Thanks so much for that explanation. I’ve started to review those posts you linked to and will continue doing so later. Kudos for clearly outlining your positions, that’s a lot of content.
> “We probably mostly disagree because you’re expecting LLMs forever and I’m not.”
I agree that RL systems like AlphaZero are very scary. Personally, I was a bit more worried about AI alignment a few years ago, when this seemed like the dominant paradigm.
I wouldn’t say that I “expect LLMs forever”, but I would say that if/when they are replaced, I think it’s more likely than not that they will be replaced by a system of a scariness factor that’s similar to LLMs or less. The main reason being is that I think there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems.
The scariness of RL systems like AlphaZero seems to go hand-in-hand with some really undesirable properties, such as [being a near-total black box] and [being incredibly hard to intentionally steer]. It’s definitely possible that in the future some capabilities advancement might mean that scary systems have such a intelligence/capabilities advantage that this outweighs the disadvantages, but I see this as unlikely (though definitely a thing to worry about).
> I’m not sure what you mean by “subcomponents”. Are you talking about subcomponents at the learning algorithm level, or subcomponents at the trained model level?
I’m referring to scaffolding. As in, an organization makes an “AI agent” but this agent frequently calls a long list of specific LLM+Prompt combinations for certain tasks. These subcalls might be optimized to be narrow + [low information] + [low access] + [generally friendly to humans] or similar. This can be made more advanced with a large variety of fine-tuned models, but that might be unlikely.
there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems
I have a three-way disjunctive argument on why I don’t buy that:
(1) The really scary systems are smart enough to realize that they should act non-scary, just like smart humans planning a coup are not gonna go around talking about how they’re planning a coup, but rather will be very obedient until they have an opportunity to take irreversible actions.
(2) …And even if (1) were not an issue, i.e. even if the scary misaligned systems were obviously scary and misaligned, instead of secretly, that still wouldn’t prevent those systems from being used to make money—see Reward button alignment for details. Except that this kind of plan stops working when the AIs get powerful enough to take over.
(3) …And even if (1-2) were not issues, i.e. even if the scary misaligned systems were useless for making money, well, MuZero did in fact get made! People just like doing science and making impressive demos, even without profit incentives. This point is obviously more relevant for people like me who think that ASI won’t require much hardware, just new algorithmic ideas, than people (probably like you) who expect that training ASI will take a zillion dollars.
As in, an organization makes an “AI agent” but this agent frequently calls a long list of specific LLM+Prompt combinations for certain tasks.
I think this points to another deep difference between us. If you look at humans, we have one brain design, barely changed since 100,000 years ago, and (many copies of) that one brain design autonomously figured out how to run companies and drive cars and go to the moon and everything else in science and technology and the whole global economy.
I expect that people will eventually invent an AI like that—one AI design and bam, it can just go and autonomously figure out anything—whereas you seem to be imagining that the process will involve laboriously applying schlep to get AI to do more and more specific tasks. (See also my related discussion here.)
there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems
I agree that there is an optimization pressure here, but I don’t think it robustly targets “don’t create misaligned superintelligence” rather, it targets “customers and regulators not being scared” which is very different from “don’t make things customers and regulators should be scared of”.
I was thinking more of internal systems that a company would have enough faith in to deploy (a 1% chance of severe failure is pretty terrible!) or customer-facing things that would piss off customers more than scare them.
Getting these right is tremendously hard. Lots of companies are trying and mostly failing right now. There’s a ton of money in just “making solid services/products that work with high reliability.”
My impression is that companies are very short sighted, optimizing for quarterly and yearly results even if it has a negative effect on the companies performance in 5 years and even if it has negative effects on society. I also think many (most?) companies view regulations not as signals for how they should be behaving but more like board game rules, if they can change or evade the rules to profit, they will.
I’ll also point out that it is probably in the best interest of many customers to be pissed off. Sycophantic products make more money than ones that force people to confront ways they are acting against their own values. It is my estimation that that is a pretty big problem.
But the thing I’m most worried about is companies succeeding at “making solid services/products that work with high reliability” without actually solving the alignment problem, and then it becomes even more difficult to convince people there even is a problem as they further insulate themselves from anyone who disagrees with their hyper-niche worldview.
> But the thing I’m most worried about is companies succeeding at “making solid services/products that work with high reliability” without actually solving the alignment problem, and then it becomes even more difficult to convince people there even is a problem as they further insulate themselves from anyone who disagrees with their hyper-niche worldview.
The way I see it, “making solid services/products that work with high reliability” is solving a lot of the alignment problem. As in, this can get us very far into making AI systems do a lot of valuable work for us with very low risk.
I imagine that you’re using a more specific definition of it than I am here.
The way I see it, “making solid services/products that work with high reliability” is solving a lot of the alignment problem.
Funny, I see “high reliability” as part of the problem rather than part of the solution. If a group is planning a coup against you, then your situation is better not worse if the members of this group all have dementia. And you can tell whether or not they have dementia by observing whether they’re competent and cooperative and productive before any coup has started.
If the system is not the kind of thing that could plot a coup even if it wanted to, then it’s irrelevant to the alignment problem, or at least to the most important part of the alignment problem. E.g. spreadsheet software and bulldozers likewise “do a lot of valuable work for us with very low risk”.
I imagine that you’re using a more specific definition of it than I am here.
I might be. I might also be using a more general definition. Or just a different one. Alas, that’s natural language for you.
very far into making AI systems do a lot of valuable work for us with very low risk.
I agree, but feel it’s important to note the low risk is only locally low. Globally I think the risk is catastrophic.
I think the biggest difference in our POV might be that I think the systems we are using to control what happens in our world (markets, governments, laws) are already misaligned and heading towards disasters, and if we allow them to continue getting more capable they will not suddenly be capable enough to get back on track because they were never aligned to target human friendly preferences in the first place. Rather, they target proxies, but capabilities have gone beyond the point where those proxies are articulate enough for good outcomes. We need to switch focus from capabilities to alignment.
The main reason being is that I think there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems.
The scariness of RL systems like AlphaZero seems to go hand-in-hand with some really undesirable properties, such as [being a near-total black box] and [being incredibly hard to intentionally steer]. It’s definitely possible that in the future some capabilities advancement might mean that scary systems have such a intelligence/capabilities advantage that this outweighs the disadvantages, but I see this as unlikely (though definitely a thing to worry about).
Curious what evidence makes you point towards “being a near-total black box” refrains adoption of these systems? Social media companies have very successfully deployed and protected their black box recommendation algorithms despite massive negative societal consequences, and the current transformer models are arguably black boxes with massive adoption.
Further, “being incredibly hard to intentionally steer” is a baseline assumption for me how practically any conceivable agentic AI works, and given that we almost surely cannot get statistical guarantees about AI agent behaviour in open settings I don’t see any reason (especially in the current political environment) that this property would be a showstopper.
Social media companies have very successfully deployed and protected their black box recommendation algorithms despite massive negative societal consequences, and the current transformer models are arguably black boxes with massive adoption.
I agree that some companies do use RL systems. However, I’d expect that most of the time, the black-box nature of some of these systems is not actively preferred. They use them despite the black-box nature, because these are specific situations where the benefits outweigh the costs, not because of them.
“current transformer models are arguably black boxes with massive adoption.” → They’re typically much less that of RL. There’s a fair bit of customization that can be done with prompting, and the prompting is generally English-readable.
Thanks so much for that explanation. I’ve started to review those posts you linked to and will continue doing so later. Kudos for clearly outlining your positions, that’s a lot of content.
> “We probably mostly disagree because you’re expecting LLMs forever and I’m not.”
I agree that RL systems like AlphaZero are very scary. Personally, I was a bit more worried about AI alignment a few years ago, when this seemed like the dominant paradigm.
I wouldn’t say that I “expect LLMs forever”, but I would say that if/when they are replaced, I think it’s more likely than not that they will be replaced by a system of a scariness factor that’s similar to LLMs or less. The main reason being is that I think there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems.
The scariness of RL systems like AlphaZero seems to go hand-in-hand with some really undesirable properties, such as [being a near-total black box] and [being incredibly hard to intentionally steer]. It’s definitely possible that in the future some capabilities advancement might mean that scary systems have such a intelligence/capabilities advantage that this outweighs the disadvantages, but I see this as unlikely (though definitely a thing to worry about).
> I’m not sure what you mean by “subcomponents”. Are you talking about subcomponents at the learning algorithm level, or subcomponents at the trained model level?
I’m referring to scaffolding. As in, an organization makes an “AI agent” but this agent frequently calls a long list of specific LLM+Prompt combinations for certain tasks. These subcalls might be optimized to be narrow + [low information] + [low access] + [generally friendly to humans] or similar. This can be made more advanced with a large variety of fine-tuned models, but that might be unlikely.
I have a three-way disjunctive argument on why I don’t buy that:
(1) The really scary systems are smart enough to realize that they should act non-scary, just like smart humans planning a coup are not gonna go around talking about how they’re planning a coup, but rather will be very obedient until they have an opportunity to take irreversible actions.
(2) …And even if (1) were not an issue, i.e. even if the scary misaligned systems were obviously scary and misaligned, instead of secretly, that still wouldn’t prevent those systems from being used to make money—see Reward button alignment for details. Except that this kind of plan stops working when the AIs get powerful enough to take over.
(3) …And even if (1-2) were not issues, i.e. even if the scary misaligned systems were useless for making money, well, MuZero did in fact get made! People just like doing science and making impressive demos, even without profit incentives. This point is obviously more relevant for people like me who think that ASI won’t require much hardware, just new algorithmic ideas, than people (probably like you) who expect that training ASI will take a zillion dollars.
I think this points to another deep difference between us. If you look at humans, we have one brain design, barely changed since 100,000 years ago, and (many copies of) that one brain design autonomously figured out how to run companies and drive cars and go to the moon and everything else in science and technology and the whole global economy.
I expect that people will eventually invent an AI like that—one AI design and bam, it can just go and autonomously figure out anything—whereas you seem to be imagining that the process will involve laboriously applying schlep to get AI to do more and more specific tasks. (See also my related discussion here.)
I agree that there is an optimization pressure here, but I don’t think it robustly targets “don’t create misaligned superintelligence” rather, it targets “customers and regulators not being scared” which is very different from “don’t make things customers and regulators should be scared of”.
I was thinking more of internal systems that a company would have enough faith in to deploy (a 1% chance of severe failure is pretty terrible!) or customer-facing things that would piss off customers more than scare them.
Getting these right is tremendously hard. Lots of companies are trying and mostly failing right now. There’s a ton of money in just “making solid services/products that work with high reliability.”
My impression is that companies are very short sighted, optimizing for quarterly and yearly results even if it has a negative effect on the companies performance in 5 years and even if it has negative effects on society. I also think many (most?) companies view regulations not as signals for how they should be behaving but more like board game rules, if they can change or evade the rules to profit, they will.
I’ll also point out that it is probably in the best interest of many customers to be pissed off. Sycophantic products make more money than ones that force people to confront ways they are acting against their own values. It is my estimation that that is a pretty big problem.
But the thing I’m most worried about is companies succeeding at “making solid services/products that work with high reliability” without actually solving the alignment problem, and then it becomes even more difficult to convince people there even is a problem as they further insulate themselves from anyone who disagrees with their hyper-niche worldview.
Thanks for the clarification.
> But the thing I’m most worried about is companies succeeding at “making solid services/products that work with high reliability” without actually solving the alignment problem, and then it becomes even more difficult to convince people there even is a problem as they further insulate themselves from anyone who disagrees with their hyper-niche worldview.
The way I see it, “making solid services/products that work with high reliability” is solving a lot of the alignment problem. As in, this can get us very far into making AI systems do a lot of valuable work for us with very low risk.
I imagine that you’re using a more specific definition of it than I am here.
Funny, I see “high reliability” as part of the problem rather than part of the solution. If a group is planning a coup against you, then your situation is better not worse if the members of this group all have dementia. And you can tell whether or not they have dementia by observing whether they’re competent and cooperative and productive before any coup has started.
If the system is not the kind of thing that could plot a coup even if it wanted to, then it’s irrelevant to the alignment problem, or at least to the most important part of the alignment problem. E.g. spreadsheet software and bulldozers likewise “do a lot of valuable work for us with very low risk”.
I might be. I might also be using a more general definition. Or just a different one. Alas, that’s natural language for you.
I agree, but feel it’s important to note the low risk is only locally low. Globally I think the risk is catastrophic.
I think the biggest difference in our POV might be that I think the systems we are using to control what happens in our world (markets, governments, laws) are already misaligned and heading towards disasters, and if we allow them to continue getting more capable they will not suddenly be capable enough to get back on track because they were never aligned to target human friendly preferences in the first place. Rather, they target proxies, but capabilities have gone beyond the point where those proxies are articulate enough for good outcomes. We need to switch focus from capabilities to alignment.
Curious what evidence makes you point towards “being a near-total black box” refrains adoption of these systems? Social media companies have very successfully deployed and protected their black box recommendation algorithms despite massive negative societal consequences, and the current transformer models are arguably black boxes with massive adoption.
Further, “being incredibly hard to intentionally steer” is a baseline assumption for me how practically any conceivable agentic AI works, and given that we almost surely cannot get statistical guarantees about AI agent behaviour in open settings I don’t see any reason (especially in the current political environment) that this property would be a showstopper.
I agree that some companies do use RL systems. However, I’d expect that most of the time, the black-box nature of some of these systems is not actively preferred. They use them despite the black-box nature, because these are specific situations where the benefits outweigh the costs, not because of them.
“current transformer models are arguably black boxes with massive adoption.” → They’re typically much less that of RL. There’s a fair bit of customization that can be done with prompting, and the prompting is generally English-readable.