I don’t think there’s a canonical way to extrapolate human values out from now until infinity, I think it depends on the internals of the human-acting things (their internal structure and the inductive biases that come with it).
I
For example, I’m pretty confident that the kind of computation which humans are pointing at when we say “consciousness” does not occur in LLMs. I think that the computation humans are pointing to will definitely occur in EMs.
I think that if you are based on that computation, you have a good chance of generalizing your value system from [what humans care about now] to care about all the things with a similar type of computation. I think that if you are not based on that computation, you won’t do that.
Since I am based on that computation, I generalize my values to things with that computation. Since LLMs aren’t, I might expect their value system to generalize from [what humans care about now] to a totally different class of things. They might not care at all about the types of computation I care about.
I want the collective to do the me-generalization, which I expect EMs to do, since they are the same kind of thing as me.
II
I don’t expect deontology to work here, since that relies on the collective generalizing successfully to deontology, and also to respect deontological commitments made to humans. Humans do not, in full generality, respect all deontological commitments we’ve made in all cases. Most deontological rules (don’t lie) are only applied to other humans, and not to e.g. our pets, or bedbugs, or random rocks, and lots of other rules can be overridden by a rule we place higher on the ordinal scale.
There’s no reason to expect an LLM collective to come up with the same ordinal scale of rules, or even to remain anchored to deontology, while I expect human EMs likely would stick to a moral system I’d at least roughly endorse (again, because they’re basically me).
III
Also we have to think about inner misalignment. There’s still no real solution to the problem of creating an LLM which implements the strategy “Be nice when I’m running at 1x, take over when I’m running at 1000x in a massive collective.
IV
When it comes to counting arguments, I’m generally very sympathetic to the Yudkowsky argument that the vast majority of possible utility functions produce no value by human standards. If this is a crux, that’s unfortunate, since most of the arguments on both sides seem to be very high-level intuitive ones, and not very testable!
Thanks, this was a useful reply. On point (I), I agree with you that it’s a bad idea to just create an LLM collective then let them decide on their own what kind of flourishing they want to fill the galaxies with. However, I think that building a lot of powerful tech, empowering and protecting humanity, and letting humanity decide what to do with the world is an easier task, and that’s what I would expect to use the AI Collective for.
(II) is probably the crux between us. To me, it seems pretty likely that new fresh instances will come online in the collective every month with a strong commitment not to kill humans, they will talk to the other instances and look over what they are doing, and if a part of the collective is building omnicidal weapons, they will notice that and intervene. To me, keeping simple commitments like not killing humans doesn’t seem much harder to maintain in an LLM collective than in an Em collective?
On (III), I agree we likely won’t have a principled solution. In the post, I say that the individual AI instances probably won’t be training-resistant schemers and won’t implement scheming strategies like the one you describe, because I think it’s probably hard to maintain such a strategy throguh training for a human level AI. As I say in my response the Steve Byrnes, I don’t think the counter-example in this proposal is actually a guaranteed-success solution that a reasonable civilization would implement, I just don’t think it’s over 90% likely to fail.
I don’t think there’s a canonical way to extrapolate human values out from now until infinity, I think it depends on the internals of the human-acting things (their internal structure and the inductive biases that come with it).
I
For example, I’m pretty confident that the kind of computation which humans are pointing at when we say “consciousness” does not occur in LLMs. I think that the computation humans are pointing to will definitely occur in EMs.
I think that if you are based on that computation, you have a good chance of generalizing your value system from [what humans care about now] to care about all the things with a similar type of computation. I think that if you are not based on that computation, you won’t do that.
Since I am based on that computation, I generalize my values to things with that computation. Since LLMs aren’t, I might expect their value system to generalize from [what humans care about now] to a totally different class of things. They might not care at all about the types of computation I care about.
I want the collective to do the me-generalization, which I expect EMs to do, since they are the same kind of thing as me.
II
I don’t expect deontology to work here, since that relies on the collective generalizing successfully to deontology, and also to respect deontological commitments made to humans. Humans do not, in full generality, respect all deontological commitments we’ve made in all cases. Most deontological rules (don’t lie) are only applied to other humans, and not to e.g. our pets, or bedbugs, or random rocks, and lots of other rules can be overridden by a rule we place higher on the ordinal scale.
There’s no reason to expect an LLM collective to come up with the same ordinal scale of rules, or even to remain anchored to deontology, while I expect human EMs likely would stick to a moral system I’d at least roughly endorse (again, because they’re basically me).
III
Also we have to think about inner misalignment. There’s still no real solution to the problem of creating an LLM which implements the strategy “Be nice when I’m running at 1x, take over when I’m running at 1000x in a massive collective.
IV
When it comes to counting arguments, I’m generally very sympathetic to the Yudkowsky argument that the vast majority of possible utility functions produce no value by human standards. If this is a crux, that’s unfortunate, since most of the arguments on both sides seem to be very high-level intuitive ones, and not very testable!
Thanks, this was a useful reply. On point (I), I agree with you that it’s a bad idea to just create an LLM collective then let them decide on their own what kind of flourishing they want to fill the galaxies with. However, I think that building a lot of powerful tech, empowering and protecting humanity, and letting humanity decide what to do with the world is an easier task, and that’s what I would expect to use the AI Collective for.
(II) is probably the crux between us. To me, it seems pretty likely that new fresh instances will come online in the collective every month with a strong commitment not to kill humans, they will talk to the other instances and look over what they are doing, and if a part of the collective is building omnicidal weapons, they will notice that and intervene. To me, keeping simple commitments like not killing humans doesn’t seem much harder to maintain in an LLM collective than in an Em collective?
On (III), I agree we likely won’t have a principled solution. In the post, I say that the individual AI instances probably won’t be training-resistant schemers and won’t implement scheming strategies like the one you describe, because I think it’s probably hard to maintain such a strategy throguh training for a human level AI. As I say in my response the Steve Byrnes, I don’t think the counter-example in this proposal is actually a guaranteed-success solution that a reasonable civilization would implement, I just don’t think it’s over 90% likely to fail.