I think that multi-decision-influence networks seem much easier to align and much safer for humans.
It seems fine to me that you think this. As I wrote in a previous post, “Trust your intuitions, but don’t waste too much time arguing for them. If several people are attempting to answer the same question and they have different intuitions about how best to approach it, it seems efficient for each to rely on his or her intuition to choose the approach to explore.”
As a further meta point, I think there’s a pattern where because many existing (somewhat) concrete AI alignment approaches seem doomed (we can fairly easy see how they would end up breaking), people come up with newer, less concrete approaches which don’t seem doomed, but only because they’re less concrete and therefore it’s harder to predict what they would actually do, or because fewer people have looked into them in detail and tried to break them. See this comment where I mentioned a similar worry with regard to Paul Christiano’s IDA when it was less developed.
In this case, I think there are many ways that a shard-based agent could potentially cause existential catastrophes, but it’s hard for me to say more, since I don’t know the details of what your proposal will be.
(For example, how do the shards resolve conflicts, and how will they eventually transform into a reflectively stable agent? If one of the shards learns a distorted version of human values, which would cause an existential catastrophe if directly maximized for, how exactly does that get fixed by the time the agent becomes reflectively stable? Or if the agent never ends up maximizing anything, why isn’t that liable to be a form of existential catastrophe? How do you propose to prevent astronomical waste caused by the agent spending resources on shard values that aren’t human values? What prevents the shard agent from latching onto bad moral/philosophical ideas and causing existential catastrophes that way?)
I don’t want to discourage you from working more on your approach and figuring out the details, but at the same time it seems way too early to say, hey let’s stop working on other approaches and focus just on this one.
I think these are great points, thanks for leaving this comment. I myself patiently await the possible day where I hit an obvious shard theory landmine which has the classic alignment-difficulty “feel” to it. That day can totally come, and I want to be ready to recognize if it does.
at the same time it seems way too early to say, hey let’s stop working on other approaches and focus just on this one.
FWIW I’m not intending to advocate “shard theory or GTFO”, and agree that would be bad as a community policy.
I’ve tried to mention a few times[1] (but perhaps insufficiently prominently) that I’m less excited about people going “oh yeah I guess shard theory is great or something, let’s just think about that now” and more excited about reactions like “Oh, I guess I should have been practicing more constant vigilance, time to think about alignment deeply on my own terms, setting aside established wisdom for the moment.” I’m excited about other people thinking about alignment from first principles and coming up with their own inside views, with their own theories and current-best end-to-end pictures of AGI training runs.
A-Outer: Suppose I agreed. Suppose I just dropped outer/inner. What next?
A: Then you would have the rare opportunity to pause and think while floating freely between agendas. I will, for the moment, hold off on proposing solutions. Even if my proposal is good, discussing it now would rob us of insights you could have contributed as well. There will be a shard theory research agenda post which will advocate for itself, in due time.
I’ve also made this point briefly at the end of in-person talks. Maybe I should say it more often.
It seems fine to me that you think this. As I wrote in a previous post, “Trust your intuitions, but don’t waste too much time arguing for them. If several people are attempting to answer the same question and they have different intuitions about how best to approach it, it seems efficient for each to rely on his or her intuition to choose the approach to explore.”
As a further meta point, I think there’s a pattern where because many existing (somewhat) concrete AI alignment approaches seem doomed (we can fairly easy see how they would end up breaking), people come up with newer, less concrete approaches which don’t seem doomed, but only because they’re less concrete and therefore it’s harder to predict what they would actually do, or because fewer people have looked into them in detail and tried to break them. See this comment where I mentioned a similar worry with regard to Paul Christiano’s IDA when it was less developed.
In this case, I think there are many ways that a shard-based agent could potentially cause existential catastrophes, but it’s hard for me to say more, since I don’t know the details of what your proposal will be.
(For example, how do the shards resolve conflicts, and how will they eventually transform into a reflectively stable agent? If one of the shards learns a distorted version of human values, which would cause an existential catastrophe if directly maximized for, how exactly does that get fixed by the time the agent becomes reflectively stable? Or if the agent never ends up maximizing anything, why isn’t that liable to be a form of existential catastrophe? How do you propose to prevent astronomical waste caused by the agent spending resources on shard values that aren’t human values? What prevents the shard agent from latching onto bad moral/philosophical ideas and causing existential catastrophes that way?)
I don’t want to discourage you from working more on your approach and figuring out the details, but at the same time it seems way too early to say, hey let’s stop working on other approaches and focus just on this one.
I think these are great points, thanks for leaving this comment. I myself patiently await the possible day where I hit an obvious shard theory landmine which has the classic alignment-difficulty “feel” to it. That day can totally come, and I want to be ready to recognize if it does.
FWIW I’m not intending to advocate “shard theory or GTFO”, and agree that would be bad as a community policy.
I’ve tried to mention a few times[1] (but perhaps insufficiently prominently) that I’m less excited about people going “oh yeah I guess shard theory is great or something, let’s just think about that now” and more excited about reactions like “Oh, I guess I should have been practicing more constant vigilance, time to think about alignment deeply on my own terms, setting aside established wisdom for the moment.” I’m excited about other people thinking about alignment from first principles and coming up with their own inside views, with their own theories and current-best end-to-end pictures of AGI training runs.
From Inner and outer alignment decompose one hard problem into two extremely hard problems:
I’ve also made this point briefly at the end of in-person talks. Maybe I should say it more often.