Goal: tile the universe with niceness, without knowing what niceness is.
Method
We create: - a bunch of formulations of what niceness is. - a tiling AI, that given some description of niceness, tiles the universe with it. - a forecasting AI, that given a formulation of niceness, a description of the tiling AI, a description of the universe and some coordinates in the universe, generates a prediction of what the part of the universe at the coordinates looks like after the tiling AI has tiled it with the formulation of niceness.
Following that, we feed our formulations of niceness into the forecasting AI, randomly sample some coordinates and evaluate whether the resulting predictions look nice. From this we infer which formulations of niceness are truly nice.
Weaknesses: - Can we recognize utopia by randomly sampled predictions about parts of it? - Our forecasting AI is magnitudes weaker than the tiling AI. Can formulations of niceness turn perverse when a smarter agent optimizes for them?
Strengths: - Less need to solve the ELK problem. - We have multiple tries at solving outer alignment.
I like the relative simplicity of this approach, but yeah, there is a risk that a tiling agent would produce (a more sophisticated version of) humans that have a permanent smile on their faces but feel horrible pain inside. Something bad that would look convincingly good at first sight, enough to fool the forecasting AI, or rather enough to fool the people who are programming and testing the forecasting AI.
Thank you very much. I imagined the forecasting AI to not be smart enough to be able to simulate a tiler that knows it is being simulated by us. Perhaps that constraint is so large that the forecasts cannot be reliable.
Does this help outer alignment?
Goal: tile the universe with niceness, without knowing what niceness is.
Method
We create:
- a bunch of formulations of what niceness is.
- a tiling AI, that given some description of niceness, tiles the universe with it.
- a forecasting AI, that given a formulation of niceness, a description of the tiling AI, a description of the universe and some coordinates in the universe, generates a prediction of what the part of the universe at the coordinates looks like after the tiling AI has tiled it with the formulation of niceness.
Following that, we feed our formulations of niceness into the forecasting AI, randomly sample some coordinates and evaluate whether the resulting predictions look nice.
From this we infer which formulations of niceness are truly nice.
Weaknesses:
- Can we recognize utopia by randomly sampled predictions about parts of it?
- Our forecasting AI is magnitudes weaker than the tiling AI. Can formulations of niceness turn perverse when a smarter agent optimizes for them?
Strengths:
- Less need to solve the ELK problem.
- We have multiple tries at solving outer alignment.
I like the relative simplicity of this approach, but yeah, there is a risk that a tiling agent would produce (a more sophisticated version of) humans that have a permanent smile on their faces but feel horrible pain inside. Something bad that would look convincingly good at first sight, enough to fool the forecasting AI, or rather enough to fool the people who are programming and testing the forecasting AI.
Thank you very much.
I imagined the forecasting AI to not be smart enough to be able to simulate a tiler that knows it is being simulated by us.
Perhaps that constraint is so large that the forecasts cannot be reliable.