See List of Lethalities, numbers 21 and 22 (also the rest of section B.2, but especially those two). Unlike Eliezer, I do think there’s a nontrivial chance that your proposal here would work (it’s basically invoking Alignment by Default), but I think it’s a pretty small chance (like, ~10%), and Eliezer’s proposed failure modes are probably basically what actually happens at a high level.
I think Lethality 21 is largely true, and it’s a big reason why I’m concerned about the alignment problem in general. I’m not invoking Alignment by Default here because I think we do need to push hard on the actual cognitive processes happening in the agent, not just its actions/output like in prosaic ML. Externalized reasoning gives you a pretty good way of doing that.
I do think Lethality 22 is probably just false. Human values latch on to natural abstractions (!) and once you have the right ontology I don’t think they’re actually that complex. Language models are probably the most powerful tool we have for learning the human prior / ontology.
See List of Lethalities, numbers 21 and 22 (also the rest of section B.2, but especially those two). Unlike Eliezer, I do think there’s a nontrivial chance that your proposal here would work (it’s basically invoking Alignment by Default), but I think it’s a pretty small chance (like, ~10%), and Eliezer’s proposed failure modes are probably basically what actually happens at a high level.
I think Lethality 21 is largely true, and it’s a big reason why I’m concerned about the alignment problem in general. I’m not invoking Alignment by Default here because I think we do need to push hard on the actual cognitive processes happening in the agent, not just its actions/output like in prosaic ML. Externalized reasoning gives you a pretty good way of doing that.
I do think Lethality 22 is probably just false. Human values latch on to natural abstractions (!) and once you have the right ontology I don’t think they’re actually that complex. Language models are probably the most powerful tool we have for learning the human prior / ontology.