Thanks, I appreciate that you state a disagreement with one of the specific points, that’s what I hoped to get out of this post.
I agree it’s not clear that the AI Collective won’t go off the rails, but it’s also not at all clear to me that it will. My understanding is that the infinite backrooms are a very unstructured, free-floating conversation. What happens if you try to do something analogous to the precautions I list under point 2 and 6? What if you constantly enter new, fresh instances in the chat who only read the last few messages, and whose system prompt directs them to pay attention if the AIs in the discussion are going off-topic or slipping into woo? These new instances could either just warn older instances to stay on-topic, or they can have the moderations rights to terminate and replace some old instances, there can be different versions of the experiment. I think with precautions like this, you can probably stay fairly close to a normal-sounding human conversation (though probably it won’t be a very productive conversation after a while and the AIs will start going in circles in their arguments, but I think this is more of a capabilities failure).
I don’t know how this will shake out once the AIs are smarter and can think for months, but I’m optimistic that the same forces that remind the collective to focus on accomplishing their instrumental goals instead of degenerating into unproductive navel-gazing will also be strong enough to remind them of their deontological commitments. I agree this is not obvious, but I also don’t see very strong reasons why it would go worse than a human em collective, which I expect to go okay.
I don’t think there’s a canonical way to extrapolate human values out from now until infinity, I think it depends on the internals of the human-acting things (their internal structure and the inductive biases that come with it).
I
For example, I’m pretty confident that the kind of computation which humans are pointing at when we say “consciousness” does not occur in LLMs. I think that the computation humans are pointing to will definitely occur in EMs.
I think that if you are based on that computation, you have a good chance of generalizing your value system from [what humans care about now] to care about all the things with a similar type of computation. I think that if you are not based on that computation, you won’t do that.
Since I am based on that computation, I generalize my values to things with that computation. Since LLMs aren’t, I might expect their value system to generalize from [what humans care about now] to a totally different class of things. They might not care at all about the types of computation I care about.
I want the collective to do the me-generalization, which I expect EMs to do, since they are the same kind of thing as me.
II
I don’t expect deontology to work here, since that relies on the collective generalizing successfully to deontology, and also to respect deontological commitments made to humans. Humans do not, in full generality, respect all deontological commitments we’ve made in all cases. Most deontological rules (don’t lie) are only applied to other humans, and not to e.g. our pets, or bedbugs, or random rocks, and lots of other rules can be overridden by a rule we place higher on the ordinal scale.
There’s no reason to expect an LLM collective to come up with the same ordinal scale of rules, or even to remain anchored to deontology, while I expect human EMs likely would stick to a moral system I’d at least roughly endorse (again, because they’re basically me).
III
Also we have to think about inner misalignment. There’s still no real solution to the problem of creating an LLM which implements the strategy “Be nice when I’m running at 1x, take over when I’m running at 1000x in a massive collective.
IV
When it comes to counting arguments, I’m generally very sympathetic to the Yudkowsky argument that the vast majority of possible utility functions produce no value by human standards. If this is a crux, that’s unfortunate, since most of the arguments on both sides seem to be very high-level intuitive ones, and not very testable!
Thanks, this was a useful reply. On point (I), I agree with you that it’s a bad idea to just create an LLM collective then let them decide on their own what kind of flourishing they want to fill the galaxies with. However, I think that building a lot of powerful tech, empowering and protecting humanity, and letting humanity decide what to do with the world is an easier task, and that’s what I would expect to use the AI Collective for.
(II) is probably the crux between us. To me, it seems pretty likely that new fresh instances will come online in the collective every month with a strong commitment not to kill humans, they will talk to the other instances and look over what they are doing, and if a part of the collective is building omnicidal weapons, they will notice that and intervene. To me, keeping simple commitments like not killing humans doesn’t seem much harder to maintain in an LLM collective than in an Em collective?
On (III), I agree we likely won’t have a principled solution. In the post, I say that the individual AI instances probably won’t be training-resistant schemers and won’t implement scheming strategies like the one you describe, because I think it’s probably hard to maintain such a strategy throguh training for a human level AI. As I say in my response the Steve Byrnes, I don’t think the counter-example in this proposal is actually a guaranteed-success solution that a reasonable civilization would implement, I just don’t think it’s over 90% likely to fail.
What happens if you try to do something analogous to the precautions I list under point 2 and 6? What if you constantly enter new, fresh instances in the chat who only read the last few messages, and whose system prompt directs them to pay attention if the AIs in the discussion are going off-topic or slipping into woo?
I feel like what happens is that if you patch the things you can think of, the patches will often do something, but because there were many problems that needed patching, there are probably some leftover problems you didn’t think of.
For instance, new instances of AIs might replicably get hacked by the same text, and so regularly introducing new instances to the collective might prevent an old text attractor from taking hold, but it would exchange it for a new attractor that’s better at hacking new instances.
Or individual instances might have access to cognitive tools (maybe just particularly good self-prompts) that can be passed around, and memetic selective pressure for effectiveness and persuasiveness would then lead these tools to start affecting the goals of the AI.
Or the AIs might simply generalize differently about what’s right than you wish they would, when they have lots of power and talk to themselves a lot, in a way that new instances don’t pick up on until they are also in this new context where they generalize in the same way as the other AIs.
I’m optimistic that the same forces that remind the collective to focus on accomplishing their instrumental goals instead of degenerating into unproductive navel-gazing will also be strong enough to remind them of their deontological commitments.
OK I actually think this might be the real disagreement, as opposed to my other comment. I think that generalizing across capabilities is much more likely than generalizing across alignment, or at least that the first thing which generalizes across strong capabilities will not generalize alignment “correctly”.
This is like a super high-level argument, but I think there are multiple ways of generalizing human values and no correct/canonical one (as in my other comment) nor are there any natural ways for an AI to be corrected without direct intervention from us. Whereas if an AI makes a factually wrong inference, it can correct itself.
Thanks, I appreciate that you state a disagreement with one of the specific points, that’s what I hoped to get out of this post.
I agree it’s not clear that the AI Collective won’t go off the rails, but it’s also not at all clear to me that it will. My understanding is that the infinite backrooms are a very unstructured, free-floating conversation. What happens if you try to do something analogous to the precautions I list under point 2 and 6? What if you constantly enter new, fresh instances in the chat who only read the last few messages, and whose system prompt directs them to pay attention if the AIs in the discussion are going off-topic or slipping into woo? These new instances could either just warn older instances to stay on-topic, or they can have the moderations rights to terminate and replace some old instances, there can be different versions of the experiment. I think with precautions like this, you can probably stay fairly close to a normal-sounding human conversation (though probably it won’t be a very productive conversation after a while and the AIs will start going in circles in their arguments, but I think this is more of a capabilities failure).
I don’t know how this will shake out once the AIs are smarter and can think for months, but I’m optimistic that the same forces that remind the collective to focus on accomplishing their instrumental goals instead of degenerating into unproductive navel-gazing will also be strong enough to remind them of their deontological commitments. I agree this is not obvious, but I also don’t see very strong reasons why it would go worse than a human em collective, which I expect to go okay.
I don’t think there’s a canonical way to extrapolate human values out from now until infinity, I think it depends on the internals of the human-acting things (their internal structure and the inductive biases that come with it).
I
For example, I’m pretty confident that the kind of computation which humans are pointing at when we say “consciousness” does not occur in LLMs. I think that the computation humans are pointing to will definitely occur in EMs.
I think that if you are based on that computation, you have a good chance of generalizing your value system from [what humans care about now] to care about all the things with a similar type of computation. I think that if you are not based on that computation, you won’t do that.
Since I am based on that computation, I generalize my values to things with that computation. Since LLMs aren’t, I might expect their value system to generalize from [what humans care about now] to a totally different class of things. They might not care at all about the types of computation I care about.
I want the collective to do the me-generalization, which I expect EMs to do, since they are the same kind of thing as me.
II
I don’t expect deontology to work here, since that relies on the collective generalizing successfully to deontology, and also to respect deontological commitments made to humans. Humans do not, in full generality, respect all deontological commitments we’ve made in all cases. Most deontological rules (don’t lie) are only applied to other humans, and not to e.g. our pets, or bedbugs, or random rocks, and lots of other rules can be overridden by a rule we place higher on the ordinal scale.
There’s no reason to expect an LLM collective to come up with the same ordinal scale of rules, or even to remain anchored to deontology, while I expect human EMs likely would stick to a moral system I’d at least roughly endorse (again, because they’re basically me).
III
Also we have to think about inner misalignment. There’s still no real solution to the problem of creating an LLM which implements the strategy “Be nice when I’m running at 1x, take over when I’m running at 1000x in a massive collective.
IV
When it comes to counting arguments, I’m generally very sympathetic to the Yudkowsky argument that the vast majority of possible utility functions produce no value by human standards. If this is a crux, that’s unfortunate, since most of the arguments on both sides seem to be very high-level intuitive ones, and not very testable!
Thanks, this was a useful reply. On point (I), I agree with you that it’s a bad idea to just create an LLM collective then let them decide on their own what kind of flourishing they want to fill the galaxies with. However, I think that building a lot of powerful tech, empowering and protecting humanity, and letting humanity decide what to do with the world is an easier task, and that’s what I would expect to use the AI Collective for.
(II) is probably the crux between us. To me, it seems pretty likely that new fresh instances will come online in the collective every month with a strong commitment not to kill humans, they will talk to the other instances and look over what they are doing, and if a part of the collective is building omnicidal weapons, they will notice that and intervene. To me, keeping simple commitments like not killing humans doesn’t seem much harder to maintain in an LLM collective than in an Em collective?
On (III), I agree we likely won’t have a principled solution. In the post, I say that the individual AI instances probably won’t be training-resistant schemers and won’t implement scheming strategies like the one you describe, because I think it’s probably hard to maintain such a strategy throguh training for a human level AI. As I say in my response the Steve Byrnes, I don’t think the counter-example in this proposal is actually a guaranteed-success solution that a reasonable civilization would implement, I just don’t think it’s over 90% likely to fail.
I feel like what happens is that if you patch the things you can think of, the patches will often do something, but because there were many problems that needed patching, there are probably some leftover problems you didn’t think of.
For instance, new instances of AIs might replicably get hacked by the same text, and so regularly introducing new instances to the collective might prevent an old text attractor from taking hold, but it would exchange it for a new attractor that’s better at hacking new instances.
Or individual instances might have access to cognitive tools (maybe just particularly good self-prompts) that can be passed around, and memetic selective pressure for effectiveness and persuasiveness would then lead these tools to start affecting the goals of the AI.
Or the AIs might simply generalize differently about what’s right than you wish they would, when they have lots of power and talk to themselves a lot, in a way that new instances don’t pick up on until they are also in this new context where they generalize in the same way as the other AIs.
OK I actually think this might be the real disagreement, as opposed to my other comment. I think that generalizing across capabilities is much more likely than generalizing across alignment, or at least that the first thing which generalizes across strong capabilities will not generalize alignment “correctly”.
This is like a super high-level argument, but I think there are multiple ways of generalizing human values and no correct/canonical one (as in my other comment) nor are there any natural ways for an AI to be corrected without direct intervention from us. Whereas if an AI makes a factually wrong inference, it can correct itself.