Also, I would still like an answer to my query for the specific link to the argument you want to see people engage with.
I haven’t looked very hard, but sure, here’s the first post that comes up when I search for “optimization user:eliezer_yudkowksky”.
The notion of a “powerful optimization process” is necessary and sufficient to a discussion about an Artificial Intelligence that could harm or benefit humanity on a global scale. If you say that an AI is mechanical and therefore “not really intelligent”, and it outputs an action sequence that hacks into the Internet, constructs molecular nanotechnology and wipes the solar system clean of human(e) intelligence, you are still dead. Conversely, an AI that only has a very weak ability steer the future into regions high in its preference ordering, will not be able to much benefit or much harm humanity.
In this paragraph we have most of the relevant section (at least w.r.t. your specific concerns, it doesn’t argue for why most powerful optimization processes would eat everything by default, but that “why” is argued for at such extensive length elsewhere when talking about convergent instrumental goals that I will forgo sourcing it).
No, I don’t think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn’t, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)
Current AIs aren’t trying to execute takeovers because they are weaker optimizers than humans. (We can observe that even most humans are not especially strong optimizers by default, such that most people don’t exert that much optimization power in their lives, even in a way that’s cooperative with other humans.) I think they have much less coherent preferences over future states than most humans. If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.
EDIT: I see that several other people already made similar points re: sources of agency, etc.
an AI that only has a very weak ability steer the future into regions high in its preference ordering, will not be able to much benefit or much harm humanity.
Arguably ChatGPT has already been a significant benefit/harm to humanity without being a “powerful optimization process” by this definition. Have you seen teachers complaining that their students don’t know how to write anymore? Have you seen junior software engineers struggling to find jobs? Shouldn’t these count as a points against Eliezer’s model?
In an “AI as electricity” scenario (basically continuing the current business-as-usual), we could see “AIs” as a collective cause huge changes, and eat all the free energy that a “powerful optimization process” would eat.
In any case, I don’t see much in your comment which engages with “agency by default” as I defined it earlier. Maybe we just don’t disagree.
No, I don’t think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn’t, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)
OK, but no pre-ASI evidence can count against your model, according to you?
That seems sketchy, because I’m also seeing people such as Eliezer claim, in certain cases, that things which have happened support their model. By conservation of expected evidence, it can’t be the case that evidence during a certain time period will only confirm your model. Otherwise you already would’ve updated. Even if the only hypothetical events are ones which confirm your model, it also has to be the case that absence of those events will count against it.
I’ve updated against Eliezer’s model to a degree, because I can imagine a past-5-years world where his model was confirmed more, and that world didn’t happen.
Current AIs aren’t trying to execute takeovers because they are weaker optimizers than humans.
I think “optimizer” is a confused word and I would prefer that people taboo it. It seems to function as something of a semantic stopsign. The key question is something like: Why doesn’t the logic of convergent instrumental goals cause current AIs to try and take over the world? Would that logic suddenly start to kick in at some point in the future if we just train using more parameters and more data? If so, why? Can you answer that question mechanistically, without using the word “optimizer”?
Trying to take over the world is not an especially original strategy. It doesn’t take a genius to realize that “hey, I could achieve my goals better if I took over the world”. Yet current AIs don’t appear to be contemplating it. I claim this is not a lack of capability, but simply that their training scheme doesn’t result in them becoming the sort of AIs which contemplate it. If the training scheme holds basically constant, perhaps adding more data or parameters won’t change things?
If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.
The results of LLM training schemes gives us evidence about the results of future AI training schemes. Future AIs could be vastly more capable on many different axes relative to current LLMs, while simultaneously not contemplating world takeover, in the same way current LLMs do not.
I don’t agree, they somehow optimize the goal of being a HHH assistant. We could almost say that they optimize the goal of being aligned. As nostalgbraist reminds us, Anthropic’s HHH paper was an alignment work in the first place. It’s not that surprising that such optimizers happen to be more aligned that the canonical optimizers envisioned by Yudkowsky.
Edit : precision : by “they” I mean the base models trying to predict the answers of an HHH assistant as good as possible (“as good as possible” being clearly a process of optimization or I don’t know what it’s mean). And in my opinion a sufficiently good prediction is effectively or pratically a simulation. Maybe not a bit perfect simulation, but a lossy simulation, an heuristic towards simulation.
I haven’t looked very hard, but sure, here’s the first post that comes up when I search for “optimization user:eliezer_yudkowksky”.
In this paragraph we have most of the relevant section (at least w.r.t. your specific concerns, it doesn’t argue for why most powerful optimization processes would eat everything by default, but that “why” is argued for at such extensive length elsewhere when talking about convergent instrumental goals that I will forgo sourcing it).
No, I don’t think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn’t, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)
Current AIs aren’t trying to execute takeovers because they are weaker optimizers than humans. (We can observe that even most humans are not especially strong optimizers by default, such that most people don’t exert that much optimization power in their lives, even in a way that’s cooperative with other humans.) I think they have much less coherent preferences over future states than most humans. If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.
EDIT: I see that several other people already made similar points re: sources of agency, etc.
Arguably ChatGPT has already been a significant benefit/harm to humanity without being a “powerful optimization process” by this definition. Have you seen teachers complaining that their students don’t know how to write anymore? Have you seen junior software engineers struggling to find jobs? Shouldn’t these count as a points against Eliezer’s model?
In an “AI as electricity” scenario (basically continuing the current business-as-usual), we could see “AIs” as a collective cause huge changes, and eat all the free energy that a “powerful optimization process” would eat.
In any case, I don’t see much in your comment which engages with “agency by default” as I defined it earlier. Maybe we just don’t disagree.
OK, but no pre-ASI evidence can count against your model, according to you?
That seems sketchy, because I’m also seeing people such as Eliezer claim, in certain cases, that things which have happened support their model. By conservation of expected evidence, it can’t be the case that evidence during a certain time period will only confirm your model. Otherwise you already would’ve updated. Even if the only hypothetical events are ones which confirm your model, it also has to be the case that absence of those events will count against it.
I’ve updated against Eliezer’s model to a degree, because I can imagine a past-5-years world where his model was confirmed more, and that world didn’t happen.
I think “optimizer” is a confused word and I would prefer that people taboo it. It seems to function as something of a semantic stopsign. The key question is something like: Why doesn’t the logic of convergent instrumental goals cause current AIs to try and take over the world? Would that logic suddenly start to kick in at some point in the future if we just train using more parameters and more data? If so, why? Can you answer that question mechanistically, without using the word “optimizer”?
Trying to take over the world is not an especially original strategy. It doesn’t take a genius to realize that “hey, I could achieve my goals better if I took over the world”. Yet current AIs don’t appear to be contemplating it. I claim this is not a lack of capability, but simply that their training scheme doesn’t result in them becoming the sort of AIs which contemplate it. If the training scheme holds basically constant, perhaps adding more data or parameters won’t change things?
The results of LLM training schemes gives us evidence about the results of future AI training schemes. Future AIs could be vastly more capable on many different axes relative to current LLMs, while simultaneously not contemplating world takeover, in the same way current LLMs do not.
Or because they are not optimizers at all.
I don’t agree, they somehow optimize the goal of being a HHH assistant. We could almost say that they optimize the goal of being aligned. As nostalgbraist reminds us, Anthropic’s HHH paper was an alignment work in the first place. It’s not that surprising that such optimizers happen to be more aligned that the canonical optimizers envisioned by Yudkowsky.
Edit : precision : by “they” I mean the base models trying to predict the answers of an HHH assistant as good as possible (“as good as possible” being clearly a process of optimization or I don’t know what it’s mean). And in my opinion a sufficiently good prediction is effectively or pratically a simulation. Maybe not a bit perfect simulation, but a lossy simulation, an heuristic towards simulation.