Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy 31 Oct 2019 15:16 UTC
LW: 2 AF: 1
0
AF
Another notable feature of this approach is its resistance to “attacks from the future”, as opposed to approaches based on forecasting. In the latter, the AI has to predict some future observation, for example what the user will write after working on some problem for a long time. In particular, this is how the distillation step in IDA is normally assumed to work, AFAIU. Such a forecaster might sample a future in which a UFAI has been instantiated and this UFAI will exploit this to infiltrate the present. This might result in a self-fulfilling prophecy, but even if the forecasting is counterfactual (and thus immune to self-fulfilling prophecies)it can be attacked by a UFAI that came to be for unrelated reasons. We can ameliorate this by making the forecasting recursive (i.e. apply multiple distillation & amplification steps) or use some other technique to compress a lot of “thinking time” into a small interval of physical time. However, this is still vulnerable to UFAIs that might arise already at present with a small probability rate (these are likely to exist since our putative FAI is deployed at a time when technology progressed enough to make competing AGI projects a real possibility).

Now, compare this to Dialogical RL, as defined via the framework of dynamically inconsistent beliefs. Dialogical RL might also employ forecasting to sample the future, presumably more accurate, beliefs of the user. However, if the user is aware of the possibility of a future attack, this possibility is reflected in eir beliefs, and the AI will automatically take it into account and deflect it as much as possible.