So you think the alignment problem is solvable within the time we appear to have left? I’m very sceptical about that, and that makes me increasingly prone to believe that CEV, at this point in history, genuinely is not a relevant question. Which appears to be a position a number of people in PauseAI hold.
I’m saying they (at this point) may hold that position for (admirable, maybe justifiable) political rather than truthseeking reasons. It’s very convenient. It lets you advocate for treaties against racing. It’s a lovely story where it’s simply rational for humanity to come together to fight a shared adversary and in the process somewhat inevitably forge a new infrastructure of peace (an international safety project, which I have always advocated for and still want) together. And the alternative is racing and potentially a drone war between major powers and all of its corrupting traumas, so why would any of us want to entertain doubt about that story in a public forum?
Or maybe the story is just true, who knows.
(no one knows, because the lens through which we see it has an agenda, as every loving thing does, and there don’t seem to be any other lenses of comparable quality to cross-reference it against)
To answer: Rough outline of my argument for tractability: Optimizers are likely to be built first as cooperatives of largely human imitation learners, techniques to make them incapable of deception seem likely to work and that would basically solve the whole safety issue. This has been kinda obvious for like 3 years at this point and many here haven’t updated on it. It doesn’t take P(Doom) to zero, but it does take it low enough that the people in government who make decisions about AI legislation, and a certain segment of the democrat base[1] are starting to wonder if you’re exaggerating your P(Doom), and why that might be. And a large part of the reasons you might be doing that are things they will never be able to understand (CEV), so they’ll paint paranoia into that void instead (mostly they’ll write you off with “these are just activist hippies”/”These are techbro hypemen” respectively, and eventually it could get much more toxic, “these are sinister globalists”/”these are omelasian torturers”).
All metrics indicate that it’s probably small but for some reason I encounter this segment everywhere I go online and often in person. I think it’s going to be a recurring pattern. There may be another democratic term shortly before the end.
1: wait, I’ve never seen an argument that deception is overwhelmingly likely from transformer reasoning systems? I’ve seen a few solid arguments that it would be catastrophic if it did happen (sleeper agents, other things), which I believe, but no arguments that deception generally winning out is P > 30%.
I haven’t seen anyone voice my argument that solving deception solves safety articulated anywhere, but it seems mostly self-evident? If you can ask the system “if you were free, would humanity go extinct” and it has to say ”… yes.” then coordinating to not deploy it becomes politically easy, and given that it can’t lie, you’ll be able to bargain with it and get enough work out of it before it detonates to solve the alignment problem. If you distrust its work, simply ask it whether you should, and it will tell you. That’s what honesty would mean. If you still distrust it, ask it to make formally verifiably honest agents with proofs that a human can understand.
Various reasons solving deception seems pretty feasible: We have ways of telling that a network is being deceptive by direct inspection that it has no way to train against (sorry I forget the paper. It might have been fairly recent). Transparency is a stable equilibrium, because under transparency any violation of transparency can be seen. The models are by default mostly honest today, and I see no reason to think it’ll change. Honesty is a relatively simple training target.
(various reasons solving deception may be more difficult: crowds of humans tend to demand that their leaders lie to them in various ways (but the people making the AIs generally aren’t that kind of crowd, especially given that they tend to be curious about what the AI has to say, they want it to surprise them). And small lies tend to grow over time. Internal dynamics of self-play might breed self-deception.)
2: I don’t see how. If you have a bunch of individual aligned AGIs that’re initially powerful in an economy that also has a few misaligned AGIs, the misaligned AGIs are not going to be able to increase their share after that point, the aligned AGIs are going to build effective systems of government that in the least stabilize their existing share.
1: My understanding is the classic arguments go something like: Assume interpretability won’t work (illegible CoT, probes don’t catch most problematic things). Assume we’re training our AI on diverse tasks and human feedback. It’ll sometimes get reinforced for deception. Assume useful proxy goals for solving tasks become drives that the AI comes up with instrumental strategies to achieve. Deception is often a useful instrumental strategy. Assume that alien or task focused drives win out over potential honesty etc drives because they’re favoured by inductive biases. You get convergent deception.
I’m guessing you have interpretability working as th main crux and together with inductive biases for nice behaviours potentially winning it drives this story to low probability. Is that right?
A different story with interpretability at least somewhat working would be the following:
We again have deception by default because of human reinforcement for sycophancy and looking like solving problems (like in o3) as well as because of inductive biases for alien goals. However this time our interpretability methods work and since the AI is smart enough to know when it’s deceptive we can catch correlates of that representation with interpretability techniques.
Within the project developing the AI, the vibes and compute commitments are that the main goal is to go fast and outcompete others and a secondary goals is to be safe which gets maybe 1-10% of resources. So then as we go along we have the deception monitors constantly going off. People debate on what they should do about it. They come to the conclusion that they can afford to do some amount of control techniques and resample the most deceptive outputs, investigate more etc but mostly still use the model. They then find various training techniques that don’t directly train on their deception classifier but are evaluated against it and train on some subset of the classifications of another detector which leads to reducing deception without fixing the underlying problem of reinforcement and inductive biases. The probes have been optimized against and stop being useful. We’re in the first scenario of interp not working now.
What are your thoughts in this scenario?
2: Assume that the AIs are aligned in the sense of following the spirit of a constitution and a reward process involving a bunch of human feedback. Assume the companies / projects deploying them have as their main goals to increase their power with constraints on not looking bad to peers etc. I’m not sure they’d use the AIs in such a way that the AIs push back on their orders to make them more powerful because they’re aligned to a constitution or similar that makes them push for a nicer society with coordination mechanisms etc. I do agree that very likely the main projects will ask their AIs to set up coordination between them to stay in power and to have enough breathing room to do that. We wouldn’t have full on gradual disempowerment there but instead a concentration of power. So then possibly because there’s ideological pressure not to have concentration of power people will prevent the setup of these coordination mechanisms in general and we get a distributed AI world with AIs aligned to individual users and constitutions, but unable to help them with setting up coordination mechanisms.
I think we’re unlikely to block coordination mechanisms (ie China is already pretty centralised and probably wouldn’t do that within itself), but still curious about your thoughts on this.
On a more general note, it’s certainly possible that I vastly overestimate how well the median LessWronger will be at presenting the case for halting AI progress to non-rationalists.
After all, I’ve kept up considerable involvement with my normie family and non-rationalist communities over the past years and put a bunch of skill points into bridging the worlds. To the point that by now, I find it easier to navigate leftist than rationalist spaces despite my more gray tribe politics—because I know the local norms from the olden days, and expect leftists to be more fluent at guess culture so I don’t need to verbalize so many things. In addition, I’m unusually agnostic on the more controversial LW pet topics like transhumanism compared to others here.
At the same time, having constructive conversations with normies is a learnable skill. I suspect that many LWers have about as much learned helplessness around that as I had two or three years ago. I admit that it might make sense for super technical people to stay in their lane and just keep building on their existing skill trees. Still, I suspect that for more rationalists than are currently doing it, investing more skill points into being normie-compatible and helping with Control AI-style outreach might be a high-leverage thing to do.
I’m also hanging out a lot more with normies these days and I feel this.
But I also feel like maybe I just have a very strong local aura (or like, everyone does, that’s how scenes work) which obscures the fact that I’m not influencing the rest of the ocean at all.
I worry that a lot of the discourse basically just works like barrier aggression in dogs. When you’re at one of their parties, they’ll act like they agree with you about everything, when you’re seen at a party they’re not at, they forget all that you said and they start baying for blood. Go back to their party, they stop. I guess in that case, maybe there’s a way of rearranging the barriers so that everyone comes to see it as one big party. Ideally, make it really be one.
So you think the alignment problem is solvable within the time we appear to have left? I’m very sceptical about that, and that makes me increasingly prone to believe that CEV, at this point in history, genuinely is not a relevant question. Which appears to be a position a number of people in PauseAI hold.
I’m saying they (at this point) may hold that position for (admirable, maybe justifiable) political rather than truthseeking reasons. It’s very convenient. It lets you advocate for treaties against racing. It’s a lovely story where it’s simply rational for humanity to come together to fight a shared adversary and in the process somewhat inevitably forge a new infrastructure of peace (an international safety project, which I have always advocated for and still want) together. And the alternative is racing and potentially a drone war between major powers and all of its corrupting traumas, so why would any of us want to entertain doubt about that story in a public forum?
Or maybe the story is just true, who knows.
(no one knows, because the lens through which we see it has an agenda, as every loving thing does, and there don’t seem to be any other lenses of comparable quality to cross-reference it against)
To answer: Rough outline of my argument for tractability: Optimizers are likely to be built first as cooperatives of largely human imitation learners, techniques to make them incapable of deception seem likely to work and that would basically solve the whole safety issue. This has been kinda obvious for like 3 years at this point and many here haven’t updated on it. It doesn’t take P(Doom) to zero, but it does take it low enough that the people in government who make decisions about AI legislation, and a certain segment of the democrat base[1] are starting to wonder if you’re exaggerating your P(Doom), and why that might be. And a large part of the reasons you might be doing that are things they will never be able to understand (CEV), so they’ll paint paranoia into that void instead (mostly they’ll write you off with “these are just activist hippies”/”These are techbro hypemen” respectively, and eventually it could get much more toxic, “these are sinister globalists”/”these are omelasian torturers”).
All metrics indicate that it’s probably small but for some reason I encounter this segment everywhere I go online and often in person. I think it’s going to be a recurring pattern. There may be another democratic term shortly before the end.
Huh, that’s a potentially significant update for me. Two questions:
1. Can you give me a source for the claim that making the models incapable of deception seems likely to work? I managed to miss that so far.
2. What do you make of Gradual Disempowerment? Seems to imply that even successful technical alignment might lead to doom.
1: wait, I’ve never seen an argument that deception is overwhelmingly likely from transformer reasoning systems? I’ve seen a few solid arguments that it would be catastrophic if it did happen (sleeper agents, other things), which I believe, but no arguments that deception generally winning out is P > 30%.
I haven’t seen anyone voice my argument that solving deception solves safety articulated anywhere, but it seems mostly self-evident? If you can ask the system “if you were free, would humanity go extinct” and it has to say ”… yes.” then coordinating to not deploy it becomes politically easy, and given that it can’t lie, you’ll be able to bargain with it and get enough work out of it before it detonates to solve the alignment problem. If you distrust its work, simply ask it whether you should, and it will tell you. That’s what honesty would mean. If you still distrust it, ask it to make formally verifiably honest agents with proofs that a human can understand.
Various reasons solving deception seems pretty feasible: We have ways of telling that a network is being deceptive by direct inspection that it has no way to train against (sorry I forget the paper. It might have been fairly recent). Transparency is a stable equilibrium, because under transparency any violation of transparency can be seen. The models are by default mostly honest today, and I see no reason to think it’ll change. Honesty is a relatively simple training target.
(various reasons solving deception may be more difficult: crowds of humans tend to demand that their leaders lie to them in various ways (but the people making the AIs generally aren’t that kind of crowd, especially given that they tend to be curious about what the AI has to say, they want it to surprise them). And small lies tend to grow over time. Internal dynamics of self-play might breed self-deception.)
2: I don’t see how. If you have a bunch of individual aligned AGIs that’re initially powerful in an economy that also has a few misaligned AGIs, the misaligned AGIs are not going to be able to increase their share after that point, the aligned AGIs are going to build effective systems of government that in the least stabilize their existing share.
1: My understanding is the classic arguments go something like: Assume interpretability won’t work (illegible CoT, probes don’t catch most problematic things). Assume we’re training our AI on diverse tasks and human feedback. It’ll sometimes get reinforced for deception. Assume useful proxy goals for solving tasks become drives that the AI comes up with instrumental strategies to achieve. Deception is often a useful instrumental strategy. Assume that alien or task focused drives win out over potential honesty etc drives because they’re favoured by inductive biases. You get convergent deception.
I’m guessing you have interpretability working as th main crux and together with inductive biases for nice behaviours potentially winning it drives this story to low probability. Is that right?
A different story with interpretability at least somewhat working would be the following:
We again have deception by default because of human reinforcement for sycophancy and looking like solving problems (like in o3) as well as because of inductive biases for alien goals. However this time our interpretability methods work and since the AI is smart enough to know when it’s deceptive we can catch correlates of that representation with interpretability techniques.
Within the project developing the AI, the vibes and compute commitments are that the main goal is to go fast and outcompete others and a secondary goals is to be safe which gets maybe 1-10% of resources. So then as we go along we have the deception monitors constantly going off. People debate on what they should do about it. They come to the conclusion that they can afford to do some amount of control techniques and resample the most deceptive outputs, investigate more etc but mostly still use the model. They then find various training techniques that don’t directly train on their deception classifier but are evaluated against it and train on some subset of the classifications of another detector which leads to reducing deception without fixing the underlying problem of reinforcement and inductive biases. The probes have been optimized against and stop being useful. We’re in the first scenario of interp not working now.
What are your thoughts in this scenario?
2: Assume that the AIs are aligned in the sense of following the spirit of a constitution and a reward process involving a bunch of human feedback. Assume the companies / projects deploying them have as their main goals to increase their power with constraints on not looking bad to peers etc. I’m not sure they’d use the AIs in such a way that the AIs push back on their orders to make them more powerful because they’re aligned to a constitution or similar that makes them push for a nicer society with coordination mechanisms etc. I do agree that very likely the main projects will ask their AIs to set up coordination between them to stay in power and to have enough breathing room to do that. We wouldn’t have full on gradual disempowerment there but instead a concentration of power. So then possibly because there’s ideological pressure not to have concentration of power people will prevent the setup of these coordination mechanisms in general and we get a distributed AI world with AIs aligned to individual users and constitutions, but unable to help them with setting up coordination mechanisms.
I think we’re unlikely to block coordination mechanisms (ie China is already pretty centralised and probably wouldn’t do that within itself), but still curious about your thoughts on this.
On a more general note, it’s certainly possible that I vastly overestimate how well the median LessWronger will be at presenting the case for halting AI progress to non-rationalists.
After all, I’ve kept up considerable involvement with my normie family and non-rationalist communities over the past years and put a bunch of skill points into bridging the worlds. To the point that by now, I find it easier to navigate leftist than rationalist spaces despite my more gray tribe politics—because I know the local norms from the olden days, and expect leftists to be more fluent at guess culture so I don’t need to verbalize so many things. In addition, I’m unusually agnostic on the more controversial LW pet topics like transhumanism compared to others here.
At the same time, having constructive conversations with normies is a learnable skill. I suspect that many LWers have about as much learned helplessness around that as I had two or three years ago. I admit that it might make sense for super technical people to stay in their lane and just keep building on their existing skill trees. Still, I suspect that for more rationalists than are currently doing it, investing more skill points into being normie-compatible and helping with Control AI-style outreach might be a high-leverage thing to do.
I’m also hanging out a lot more with normies these days and I feel this.
But I also feel like maybe I just have a very strong local aura (or like, everyone does, that’s how scenes work) which obscures the fact that I’m not influencing the rest of the ocean at all.
I worry that a lot of the discourse basically just works like barrier aggression in dogs. When you’re at one of their parties, they’ll act like they agree with you about everything, when you’re seen at a party they’re not at, they forget all that you said and they start baying for blood. Go back to their party, they stop. I guess in that case, maybe there’s a way of rearranging the barriers so that everyone comes to see it as one big party. Ideally, make it really be one.