it is a problem with the algorithms that implemented the attention. it’s not the messaging, but rather the interaction patterns, that embed the mistake that both encouraged trusting him and which encouraged him to see it as a good place to be trusted. he did actually donate a bunch of money to altruistic causes while fucking up the ev calculation; he may have been fooling himself, but it is usually the case (correlation) that the behaviors one sees in an environment are the behaviors the environment causes, even if you’re wrong about which part of the environment is doing the causing. because correlation isn’t inherently causation this heuristic does sometimes fail; it’s more reliable than most correlations-being-causations because environments do have a lot of influence over possibility. if the true path was that he manipulated EAs, then that’s an error EA needs to repair and publicly communicate by nature of being introspectable by other human beings; if instead it was because EA actually encouraged this de novo rather than being infectable by it, then that is slightly worse, but ultimately still has a solution that looks like figuring out how to build immunity so such misbehavior can be reliably trusted to not happen again. building error-behavior immunity is a difficult task, especially because it can cause erroneous immune matches if people blame the wrong part of the misbehavior.
the alignment problem was always about inter-agent behavior.
it is a problem with the algorithms that implemented the attention. it’s not the messaging, but rather the interaction patterns, that embed the mistake that both encouraged trusting him and which encouraged him to see it as a good place to be trusted. he did actually donate a bunch of money to altruistic causes while fucking up the ev calculation; he may have been fooling himself, but it is usually the case (correlation) that the behaviors one sees in an environment are the behaviors the environment causes, even if you’re wrong about which part of the environment is doing the causing. because correlation isn’t inherently causation this heuristic does sometimes fail; it’s more reliable than most correlations-being-causations because environments do have a lot of influence over possibility. if the true path was that he manipulated EAs, then that’s an error EA needs to repair and publicly communicate by nature of being introspectable by other human beings; if instead it was because EA actually encouraged this de novo rather than being infectable by it, then that is slightly worse, but ultimately still has a solution that looks like figuring out how to build immunity so such misbehavior can be reliably trusted to not happen again. building error-behavior immunity is a difficult task, especially because it can cause erroneous immune matches if people blame the wrong part of the misbehavior.
the alignment problem was always about inter-agent behavior.