I agree that reflectivity for learned systems is a major open question, and my current project is to study the reflectivity and self-modification related behaviors of current language models.
Interesting. I’m curious what kinds of results you’re hoping for, or just more details about your project. (But feel free to ignore this if talking about it now isn’t a good use of your time.) My understanding is that LLMs can potentially be fine-tuned or used to do various things, including instantiating various human-like or artificial characters (such as a “helpful AI assistant”). Seems like reflectivity and self-modification could vary greatly depending on what character(s) you instantiate, or how you use the LLMs in other ways.
Also, I don’t think the true solution to questions of reflectivity is to reach some perfected fixed point, after which your values remain static for all time.
I’m definitely not in favor of building an AI with an utility function representing fixed, static values, at least not in anything like our current circumstances. (My own preferred approach is what I called “White-Box Metaphilosophical AI” in the linked post.) I was just taken aback by Peter saying that he felt “deconfused when I reject utility functions”, when there was a reason that people were/are thinking about utility functions in relation to AI x-safety, and the new approach he likes hasn’t addressed that reason yet.
I don’t see any clear-cut disagreement between my position and your White-Box Metaphilosophical AI. I wonder how much is just a framing difference?
Reflective stability seems like something that can be left to a smarter-than-human aligned AI.
I’m not saying it would be bad to implement a utility function in an AGI. I’m mostly saying that aiming for that makes human values look complex and hard to observe.
E.g. it leads people to versions of the diamond alignment problem that sound simple, but which cause people to worry about hard problems which they mistakenly imagine are on a path to implementing human values.
Whereas shard theory seems aimed at a model of human values that’s both accurate and conceptually simple.
Whereas shard theory seems aimed at a model of human values that’s both accurate and conceptually simple.
Let’s distinguish between shard theory as a model of human values, versus implementing an AI that learns its own values in a shard-based way. The former seems fine to me (pending further research on how well the model actually fits), but the latter worries me in part because it’s not reflectively stable and the proponents haven’t talked about how they plan to ensure that things will go well in the long run. If you’re talking about the former and I’m talking about the latter, then we might have been talking past each other. But I think the shard-theory proponents are proposing to do the latter (correct me if I’m wrong), so it seems important to consider that in any overall evaluation of shard theory?
BTW here are two other reasons for my worries. Again these may have already been addressed somewhere and I just missed it.
The AI will learn its own shard-based values which may differ greatly from human values. Even different humans learn different values depending on genes and environment, and the AI’s “genes” and “environment” will probably lie far outside the human distribution. How do we figure out what values we want the AI to learn, and how to make sure the AI learns those values? These seem like very hard research questions.
Humans are all partly or even mostly selfish, but we don’t want the AI to be. What’s the plan here, or reason to think that shard-based agents can be trained to not be selfish?
I think of shard theory as more than just a model of how to model humans.
My main point here is that human values will be represented in AIs in a form that looks a good deal more like the shard theory model than like a utility function.
Approaches that involve utility functions seem likely to make alignment harder, via adding an extra step (translating a utility function into shard form) and/or by confusing people about how to recognize human values.
I’m unclear whether shard theory tells us much about how to cause AIs to have the values we want them to have.
Also, I’m not talking much about the long run. I expect that problems with reflective stability will be handled by entities that have more knowledge and intelligence than we have.
Re shard theory: I think it’s plausibly useful, and maybe be a part of an alignment plan. But I’m quite a bit more negative than you or Turntrout on that plan, and I’d probably guess that Shard Theory ultimately doesn’t impact alignment that much.
Interesting. I’m curious what kinds of results you’re hoping for, or just more details about your project. (But feel free to ignore this if talking about it now isn’t a good use of your time.) My understanding is that LLMs can potentially be fine-tuned or used to do various things, including instantiating various human-like or artificial characters (such as a “helpful AI assistant”). Seems like reflectivity and self-modification could vary greatly depending on what character(s) you instantiate, or how you use the LLMs in other ways.
I’m definitely not in favor of building an AI with an utility function representing fixed, static values, at least not in anything like our current circumstances. (My own preferred approach is what I called “White-Box Metaphilosophical AI” in the linked post.) I was just taken aback by Peter saying that he felt “deconfused when I reject utility functions”, when there was a reason that people were/are thinking about utility functions in relation to AI x-safety, and the new approach he likes hasn’t addressed that reason yet.
I don’t see any clear-cut disagreement between my position and your White-Box Metaphilosophical AI. I wonder how much is just a framing difference?
Reflective stability seems like something that can be left to a smarter-than-human aligned AI.
I’m not saying it would be bad to implement a utility function in an AGI. I’m mostly saying that aiming for that makes human values look complex and hard to observe.
E.g. it leads people to versions of the diamond alignment problem that sound simple, but which cause people to worry about hard problems which they mistakenly imagine are on a path to implementing human values.
Whereas shard theory seems aimed at a model of human values that’s both accurate and conceptually simple.
Let’s distinguish between shard theory as a model of human values, versus implementing an AI that learns its own values in a shard-based way. The former seems fine to me (pending further research on how well the model actually fits), but the latter worries me in part because it’s not reflectively stable and the proponents haven’t talked about how they plan to ensure that things will go well in the long run. If you’re talking about the former and I’m talking about the latter, then we might have been talking past each other. But I think the shard-theory proponents are proposing to do the latter (correct me if I’m wrong), so it seems important to consider that in any overall evaluation of shard theory?
BTW here are two other reasons for my worries. Again these may have already been addressed somewhere and I just missed it.
The AI will learn its own shard-based values which may differ greatly from human values. Even different humans learn different values depending on genes and environment, and the AI’s “genes” and “environment” will probably lie far outside the human distribution. How do we figure out what values we want the AI to learn, and how to make sure the AI learns those values? These seem like very hard research questions.
Humans are all partly or even mostly selfish, but we don’t want the AI to be. What’s the plan here, or reason to think that shard-based agents can be trained to not be selfish?
I think of shard theory as more than just a model of how to model humans.
My main point here is that human values will be represented in AIs in a form that looks a good deal more like the shard theory model than like a utility function.
Approaches that involve utility functions seem likely to make alignment harder, via adding an extra step (translating a utility function into shard form) and/or by confusing people about how to recognize human values.
I’m unclear whether shard theory tells us much about how to cause AIs to have the values we want them to have.
Also, I’m not talking much about the long run. I expect that problems with reflective stability will be handled by entities that have more knowledge and intelligence than we have.
Re shard theory: I think it’s plausibly useful, and maybe be a part of an alignment plan. But I’m quite a bit more negative than you or Turntrout on that plan, and I’d probably guess that Shard Theory ultimately doesn’t impact alignment that much.