Thanks for the thoughts. I guess I need to do a lot more looking into CIRL, before I come back to this. I do still wonder (although this is at an unformalized level) whether an agent could potentially learn a lot about moral evidence from the constraint that it’s own actions can’t cause the expected evidence to change. For example, if it realizes that a certain action (like subtle coercion) would result in something that it would have thought was legitimate evidence, then that situation must not actually count as evidence at all. That constraint seems to pack a decent minority of our requirements for value learning into a relatively simple statement. There may be other ways to encode such a constraint besides having an agent be uncertain about its function for determining what observations provide what evidence, though.
Thanks for the thoughts. I guess I need to do a lot more looking into CIRL, before I come back to this. I do still wonder (although this is at an unformalized level) whether an agent could potentially learn a lot about moral evidence from the constraint that it’s own actions can’t cause the expected evidence to change. For example, if it realizes that a certain action (like subtle coercion) would result in something that it would have thought was legitimate evidence, then that situation must not actually count as evidence at all. That constraint seems to pack a decent minority of our requirements for value learning into a relatively simple statement. There may be other ways to encode such a constraint besides having an agent be uncertain about its function for determining what observations provide what evidence, though.