It’s error on my part, I assumed that goal stability equals value stability. But then it looks like that it can be impossible to reconstruct agent’s values given only its current state.
Goal is what an agent optimizes for at a given point in time. Value is the initial goal of an agent (in your toy model at least).
In my root post it seems to be optimal for agent A to self-modify into agent A’, which optimizes for G2, thus agent A’ succeeds in optimizing world according to its values (goal of agent A). But original goal doesn’t influence its optimization procedure anymore. Thus if we’ll analyze agent A’ (without knowledge of world’s history), we’ll be unable to infer its values (its original goal).
You seem to assume that world is indifferent to agent’s goals. But if there’s another agent, then it can be not the case.
Let G1 be “tile the universe with still-life”, G2 be “tile upper half of the universe with still-life”.
If agent A has goal G1, it will be provable destroyed by agent B, if A will change it’s goal to G2, then B will not interfere.
A and B have full information on world’s state.
Should A modify its goal?
Edit: Goal stability != value stability. So my point isn’t valid.
No, I don’t need that assumption. What conclusion in the post depends on it, in your opinion?
It’s error on my part, I assumed that goal stability equals value stability. But then it looks like that it can be impossible to reconstruct agent’s values given only its current state.
I’m afraid I still don’t understand your reasoning. How are “goals” different from “values”, in your terms?
Goal is what an agent optimizes for at a given point in time. Value is the initial goal of an agent (in your toy model at least).
In my root post it seems to be optimal for agent A to self-modify into agent A’, which optimizes for G2, thus agent A’ succeeds in optimizing world according to its values (goal of agent A). But original goal doesn’t influence its optimization procedure anymore. Thus if we’ll analyze agent A’ (without knowledge of world’s history), we’ll be unable to infer its values (its original goal).
Yes, that seems to be correct.