paulfchristiano comments on Some thoughts on risks from narrow, non-agentic AI

paulfchristiano 19 Jan 2021 17:37 UTC
LW: 13 AF: 9
0
AF
We do need to train them by trial and error, but it’s very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we’ll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don’t see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops.
I think this is the key point and it’s glossed over in my original post, so it seems worth digging in a bit more.
I think there are many plausible models that generalize successfully to longer horizons, e.g. from 100 days to 10,000 days:
- Acquire money and other forms of flexible influence, and then tomorrow switch to using a 99-day (or 9999-day) horizon policy.
- Have a short-term predictor, and apply it over more and more steps to predict longer horizons (if your predictor generalizes then there are tons of approaches to acting that would generalize).
- Deductively reason about what actions are good over 100 days (vs 10,000 days), since deduction appears to generalize well from a big messy set of facts to new very different facts.
- If I’ve learned to abstract seconds into minutes, minutes into hours, hours into days, days into weeks, and then plan over weeks, its pretty plausible that the same procedure can abstract weeks into months and months into years. (It’s kind of like I’m now I’m working on a log scale and asking the model to generalize from 1, 2, …, 10 to 11, 12, 13.)
- Most possible ways of reasoning are hard to write down in a really simple list, but I expect that many hard-to-describe models also generalize. If some generalize and some do not, then training my model over longer and longer horizons (3 seconds, 30 seconds, 5 minutes...) will gradually knock out the non-generalizing modes of reasoning and leave me with the modes that do generalize to longer horizons.
This is roughly why I’m afraid that models we train will ultimately be able to plan over long horizons than those that appear in training.
But many of these would end up pursuing goals that are closely related to the goals they pursue over short horizons (and in particular the first 4 above seem like they’d all be undesirable if generalizing from easily-measured goals, and would lead to the kinds of failures I describe in part I of WFLL).
I think one reason that my posts about this are confusing is that I often insist that we don’t rely on generalization because I don’t expect it to work reliably in the way we hope. But that’s about what assumptions we want to make when designing our algorithms—I still think that the “generalizes in the natural way” model is important for getting a sense of what AI systems are going to do, even if I think there is a good chance that it’s not a good enough approximation to make the systems do exactly what we want. (And of course I think if you are relying on generalization in this way you have very little ability to avoid the out-with-a-bang failure mode, so I have further reasons to be unhappy about relying on generalization.)