Thank you for the feedback, I have repaired the post introduction in accordance your commentary on utility functions. I challenge the assumption that a system not being able to reliably simulate an agent with human specifications is worrying, and I would like to make clear that the agenda I am pushing is not:
Capabilities and understanding through simulation scale proportionately
More capable systems can simulate, and therefore comprehend the goals of other systems to a greater extent
By dint of some unknown means we align AGI to this deep understanding of our goals
I agree that in the context of a plan like this a failure to establish robust abstractions of human values could be catastrophic, but when applied to a scenario like trying to align a powerful LLM, being able to estimate and interpret even incorrect abstractions could be vitally important. This could look like:
Estimate the LLMs abstraction of say “what do humans want me to refuse to reply to”
Compare this to some desired abstraction
Apply some technique like RLHF accordingly
Of course an actual implementation probably wouldn’t look like that, (“what do humans want me to refuse to reply to” isn’t necessarily one unified concept that can be easily abstracted and interpreted) but it is a high level overview of why pursuing questions like “do some specifications abstract well?” could still be useful even if they do not.
I hadn’t come across the relative abstracted agency post, but I find its insights incredibly useful. Over the next few days I will update this post to include its terminology. I find it likely that testing whether or not some specifications abstract well would provide useful information as to how targets are modeled as agents, but the usefulness of being able to test this in existing systems depends strongly on how the current LLM paradigm scales as we approach superintelligence. Regardless I’m sure any indication as to how targets are modeled as agents could be valuable, even in systems incapable of scaling to superintelligence.
Thank you for the feedback, I have repaired the post introduction in accordance your commentary on utility functions. I challenge the assumption that a system not being able to reliably simulate an agent with human specifications is worrying, and I would like to make clear that the agenda I am pushing is not:
Capabilities and understanding through simulation scale proportionately
More capable systems can simulate, and therefore comprehend the goals of other systems to a greater extent
By dint of some unknown means we align AGI to this deep understanding of our goals
I agree that in the context of a plan like this a failure to establish robust abstractions of human values could be catastrophic, but when applied to a scenario like trying to align a powerful LLM, being able to estimate and interpret even incorrect abstractions could be vitally important. This could look like:
Estimate the LLMs abstraction of say “what do humans want me to refuse to reply to”
Compare this to some desired abstraction
Apply some technique like RLHF accordingly
Of course an actual implementation probably wouldn’t look like that, (“what do humans want me to refuse to reply to” isn’t necessarily one unified concept that can be easily abstracted and interpreted) but it is a high level overview of why pursuing questions like “do some specifications abstract well?” could still be useful even if they do not.
I hadn’t come across the relative abstracted agency post, but I find its insights incredibly useful. Over the next few days I will update this post to include its terminology. I find it likely that testing whether or not some specifications abstract well would provide useful information as to how targets are modeled as agents, but the usefulness of being able to test this in existing systems depends strongly on how the current LLM paradigm scales as we approach superintelligence. Regardless I’m sure any indication as to how targets are modeled as agents could be valuable, even in systems incapable of scaling to superintelligence.