paulfchristiano comments on Some thoughts on risks from narrow, non-agentic AI

paulfchristiano 3 Feb 2021 20:01 UTC
LW: 2 AF: 2
AF
I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedly, it seems like really all you need is to train predictors for type 2 feedback (in order to use those predictions for training/planning), and that the relevant prediction problems often seem much easier than the actual sophisticated behaviors we are interested in.
Another important of my view about type 1 ~ type 2 is that if gradient descent handles the scale from [1 second, 1 month] then it’s not actually very far to get from [1 month, 2 years]. It seems like we’ve already come 6 orders of magnitude and now we are talking about generalizing 1 more order of magnitude.
At a higher level, I feel like the important thing is that type 1 and type 2 feedback are going to be basically the same kind of thing but with a quantitative difference (or at least we can set up type 1 feedback so that this is true). On the other hand “what we really want” is a completely different thing (that we basically can’t even define cleanly). So prima facie it feels to me like if models generalize “well” then we can get them to generalize from type 1 to type 2, whereas no such thing is true for “what we really care about.”