I think of shard theory as more than just a model of how to model humans.
My main point here is that human values will be represented in AIs in a form that looks a good deal more like the shard theory model than like a utility function.
Approaches that involve utility functions seem likely to make alignment harder, via adding an extra step (translating a utility function into shard form) and/or by confusing people about how to recognize human values.
I’m unclear whether shard theory tells us much about how to cause AIs to have the values we want them to have.
Also, I’m not talking much about the long run. I expect that problems with reflective stability will be handled by entities that have more knowledge and intelligence than we have.
Re shard theory: I think it’s plausibly useful, and maybe be a part of an alignment plan. But I’m quite a bit more negative than you or Turntrout on that plan, and I’d probably guess that Shard Theory ultimately doesn’t impact alignment that much.
I think of shard theory as more than just a model of how to model humans.
My main point here is that human values will be represented in AIs in a form that looks a good deal more like the shard theory model than like a utility function.
Approaches that involve utility functions seem likely to make alignment harder, via adding an extra step (translating a utility function into shard form) and/or by confusing people about how to recognize human values.
I’m unclear whether shard theory tells us much about how to cause AIs to have the values we want them to have.
Also, I’m not talking much about the long run. I expect that problems with reflective stability will be handled by entities that have more knowledge and intelligence than we have.
Re shard theory: I think it’s plausibly useful, and maybe be a part of an alignment plan. But I’m quite a bit more negative than you or Turntrout on that plan, and I’d probably guess that Shard Theory ultimately doesn’t impact alignment that much.