Note: I skimmed before bed, so I might have missed something. With that said -
Dang, I would not say this is “classic first alignment idea” material, it seems, while not qualitatively fundamentally new, a nicely distilled framing of disparate ideas with some insights as a result. I think it carries some severe issues with the foundation it builds on, but seems to me to reduce them nicely, and might even plug into ideas I’ve been thinking are plausible as a way to build reliable “coprotection”, the word I’ve been using for a while to refer to innate other’s-preference fulfillment drive—that is, a network of overlapping utility functions. @carado tells me that relying on networks of utility functions is unreliable if we suddenly see a drastically more powerful singleton, but we’ll see, I guess—it doesn’t seem totally hopeless and still seems like being able to detect the network of overlapping utility functions would do a lot for us.
(Obligatory note: if you’re not already familiar with how weird they are, utility functions are not even slightly reward functions, due to the fact that utility functions do not include any discounting whatsoever, and allow arbitrary amounts of finesse in the choice between options, both of which are not practical when ranking reward under finite amounts of human capability. humans do not encode them directly anywhere, neither do any current AIs, but for theoretical reasons I still suspect we’d like to encode them formally if we could.)
My core worries are about whether your system can be trusted to reliably simulate the actual beings in an acceptable level of relative detail as intelligence increases, and retain appropriate levels of uncertainty without becoming hyperdesperate, and then additionally we would still need to figure out how to make it naturally generate defensible objectives. But it seems to me to be an interesting and promising overall approach.
Thank you for the feedback, I have repaired the post introduction in accordance your commentary on utility functions. I challenge the assumption that a system not being able to reliably simulate an agent with human specifications is worrying, and I would like to make clear that the agenda I am pushing is not:
Capabilities and understanding through simulation scale proportionately
More capable systems can simulate, and therefore comprehend the goals of other systems to a greater extent
By dint of some unknown means we align AGI to this deep understanding of our goals
I agree that in the context of a plan like this a failure to establish robust abstractions of human values could be catastrophic, but when applied to a scenario like trying to align a powerful LLM, being able to estimate and interpret even incorrect abstractions could be vitally important. This could look like:
Estimate the LLMs abstraction of say “what do humans want me to refuse to reply to”
Compare this to some desired abstraction
Apply some technique like RLHF accordingly
Of course an actual implementation probably wouldn’t look like that, (“what do humans want me to refuse to reply to” isn’t necessarily one unified concept that can be easily abstracted and interpreted) but it is a high level overview of why pursuing questions like “do some specifications abstract well?” could still be useful even if they do not.
I hadn’t come across the relative abstracted agency post, but I find its insights incredibly useful. Over the next few days I will update this post to include its terminology. I find it likely that testing whether or not some specifications abstract well would provide useful information as to how targets are modeled as agents, but the usefulness of being able to test this in existing systems depends strongly on how the current LLM paradigm scales as we approach superintelligence. Regardless I’m sure any indication as to how targets are modeled as agents could be valuable, even in systems incapable of scaling to superintelligence.
Note: I skimmed before bed, so I might have missed something. With that said -
Dang, I would not say this is “classic first alignment idea” material, it seems, while not qualitatively fundamentally new, a nicely distilled framing of disparate ideas with some insights as a result. I think it carries some severe issues with the foundation it builds on, but seems to me to reduce them nicely, and might even plug into ideas I’ve been thinking are plausible as a way to build reliable “coprotection”, the word I’ve been using for a while to refer to innate other’s-preference fulfillment drive—that is, a network of overlapping utility functions. @carado tells me that relying on networks of utility functions is unreliable if we suddenly see a drastically more powerful singleton, but we’ll see, I guess—it doesn’t seem totally hopeless and still seems like being able to detect the network of overlapping utility functions would do a lot for us.
(Obligatory note: if you’re not already familiar with how weird they are, utility functions are not even slightly reward functions, due to the fact that utility functions do not include any discounting whatsoever, and allow arbitrary amounts of finesse in the choice between options, both of which are not practical when ranking reward under finite amounts of human capability. humans do not encode them directly anywhere, neither do any current AIs, but for theoretical reasons I still suspect we’d like to encode them formally if we could.)
My core worries are about whether your system can be trusted to reliably simulate the actual beings in an acceptable level of relative detail as intelligence increases, and retain appropriate levels of uncertainty without becoming hyperdesperate, and then additionally we would still need to figure out how to make it naturally generate defensible objectives. But it seems to me to be an interesting and promising overall approach.
Have you looked at https://www.lesswrong.com/posts/JqWQxTyWxig8Ltd2p/relative-abstracted-agency or the papers mentioned? they seem relevant to your interests. Finite factored sets might also be of interest if you’re at all curious to try to design for things that scale to arbitrary levels of superintelligence.
Thank you for the feedback, I have repaired the post introduction in accordance your commentary on utility functions. I challenge the assumption that a system not being able to reliably simulate an agent with human specifications is worrying, and I would like to make clear that the agenda I am pushing is not:
Capabilities and understanding through simulation scale proportionately
More capable systems can simulate, and therefore comprehend the goals of other systems to a greater extent
By dint of some unknown means we align AGI to this deep understanding of our goals
I agree that in the context of a plan like this a failure to establish robust abstractions of human values could be catastrophic, but when applied to a scenario like trying to align a powerful LLM, being able to estimate and interpret even incorrect abstractions could be vitally important. This could look like:
Estimate the LLMs abstraction of say “what do humans want me to refuse to reply to”
Compare this to some desired abstraction
Apply some technique like RLHF accordingly
Of course an actual implementation probably wouldn’t look like that, (“what do humans want me to refuse to reply to” isn’t necessarily one unified concept that can be easily abstracted and interpreted) but it is a high level overview of why pursuing questions like “do some specifications abstract well?” could still be useful even if they do not.
I hadn’t come across the relative abstracted agency post, but I find its insights incredibly useful. Over the next few days I will update this post to include its terminology. I find it likely that testing whether or not some specifications abstract well would provide useful information as to how targets are modeled as agents, but the usefulness of being able to test this in existing systems depends strongly on how the current LLM paradigm scales as we approach superintelligence. Regardless I’m sure any indication as to how targets are modeled as agents could be valuable, even in systems incapable of scaling to superintelligence.