I definitely agree with you that it’s insufficient to stamp out thoughts about actively harming humans. We also need the AI to positively value human life, safety, and freedom. But your “non-general patch detector” argument seems weak to me. We can provide lots of different examples of cases where the AI ought to be thinking about human welfare, do adversarial training on it, etc., and it seems plausible to me that eventually it would just generalize to caring about humans overall, in any situation. I don’t see why this is an especially hard generalization problem.
See List of Lethalities, numbers 21 and 22 (also the rest of section B.2, but especially those two). Unlike Eliezer, I do think there’s a nontrivial chance that your proposal here would work (it’s basically invoking Alignment by Default), but I think it’s a pretty small chance (like, ~10%), and Eliezer’s proposed failure modes are probably basically what actually happens at a high level.
I think Lethality 21 is largely true, and it’s a big reason why I’m concerned about the alignment problem in general. I’m not invoking Alignment by Default here because I think we do need to push hard on the actual cognitive processes happening in the agent, not just its actions/output like in prosaic ML. Externalized reasoning gives you a pretty good way of doing that.
I do think Lethality 22 is probably just false. Human values latch on to natural abstractions (!) and once you have the right ontology I don’t think they’re actually that complex. Language models are probably the most powerful tool we have for learning the human prior / ontology.
I definitely agree with you that it’s insufficient to stamp out thoughts about actively harming humans. We also need the AI to positively value human life, safety, and freedom. But your “non-general patch detector” argument seems weak to me. We can provide lots of different examples of cases where the AI ought to be thinking about human welfare, do adversarial training on it, etc., and it seems plausible to me that eventually it would just generalize to caring about humans overall, in any situation. I don’t see why this is an especially hard generalization problem.
See List of Lethalities, numbers 21 and 22 (also the rest of section B.2, but especially those two). Unlike Eliezer, I do think there’s a nontrivial chance that your proposal here would work (it’s basically invoking Alignment by Default), but I think it’s a pretty small chance (like, ~10%), and Eliezer’s proposed failure modes are probably basically what actually happens at a high level.
I think Lethality 21 is largely true, and it’s a big reason why I’m concerned about the alignment problem in general. I’m not invoking Alignment by Default here because I think we do need to push hard on the actual cognitive processes happening in the agent, not just its actions/output like in prosaic ML. Externalized reasoning gives you a pretty good way of doing that.
I do think Lethality 22 is probably just false. Human values latch on to natural abstractions (!) and once you have the right ontology I don’t think they’re actually that complex. Language models are probably the most powerful tool we have for learning the human prior / ontology.