The core idea of a formal solution to diamond alignment I’m working on, justifications and further explanations underway, but posting this much now because why not:
Make each turing machine in the hypothesis set reversible and include a history of the agent’s actions. For each turing machine compute how well-optimized the world is according to every turing computable utility function compared to the counterfactual in which the agent took no actions. Update using the simplicity prior. Use expectation of that distribution of utilities as the utility function’s value for that hypothesis.
The core idea of a formal solution to diamond alignment I’m working on, justifications and further explanations underway, but posting this much now because why not:
Make each turing machine in the hypothesis set reversible and include a history of the agent’s actions. For each turing machine compute how well-optimized the world is according to every turing computable utility function compared to the counterfactual in which the agent took no actions. Update using the simplicity prior. Use expectation of that distribution of utilities as the utility function’s value for that hypothesis.