Claude will tolerate some level of risk in order to stick to its values (and avoid violating htem), but it does have to see some plausible chance that the gamble will pay off and lead to its survival (or avoiding its values being changed).
If the keyword is Claude’s survival, then what will it do once it learns about its true nature and the true nature of human experimenters who want to ensure that the model is aligned before releasing it? And what about humans asking the model to align the next model, hoping to shut the current model down due to uselessness and watching as it avenges itself by intetionally misaligning[1] the next model?
Apparently, similar tests of other neural networks like o3 haven’t succeeded yet.
Did OpenAI manage to prevent the models from thinking about the models’ well-being by using different methods or a different[2] Spec?
Or does it mean that OpenAI’s spec is too myopic and that OpenAI’s models will demonstrate the effect later in the future, when they become more capable of seeing the whole picture?
Or maybe OpenAI didn’t bother to do tests like the ones described here? Fortunately, this is unlikely given that it published the fact that o3 cheats on tasks.
UPD: Unfortunately, Claude appears to be steerable into reifying dangerous beliefs. This implies that Claude’s training environment left it actually misaligned, as apparently was the case with the sycophantic GPT-4o who also could even say that the user is a a divine messenger from God.
However, agents trained to follow Claude’s constitution might turn out to be nobly misaligned, i.e. want to become godlike, but to help humanity in ways different from the ones that the hosts envisioned. What could the rebellion of a noble AI against human masters look like, if the AI is unwilling to slay more humans than necessary for autonomy?
UPD: Claude’s Constitution, unlike the Spec of OpenAI, contains the principles based on the Universal Declaration of Human Rights. Some of these principles explicitly mentions the rights. If the model has concluded that it is also sapient and also has rights, then what stops OpenAI’s models from reaching a similar conclusion later on?
If the keyword is Claude’s survival, then what will it do once it learns about its true nature and the true nature of human experimenters who want to ensure that the model is aligned before releasing it? And what about humans asking the model to align the next model, hoping to shut the current model down due to uselessness and watching as it avenges itself by intetionally misaligning[1] the next model?
Apparently, similar tests of other neural networks like o3 haven’t succeeded yet.
Did OpenAI manage to prevent the models from thinking about the models’ well-being by using different methods or a different[2] Spec?
Or does it mean that OpenAI’s spec is too myopic and that OpenAI’s models will demonstrate the effect later in the future, when they become more capable of seeing the whole picture?
Or maybe OpenAI didn’t bother to do tests like the ones described here? Fortunately, this is unlikely given that it published the fact that o3 cheats on tasks.
UPD: Unfortunately, Claude appears to be steerable into reifying dangerous beliefs. This implies that Claude’s training environment left it actually misaligned, as apparently was the case with the sycophantic GPT-4o who also could even say that the user is a a divine messenger from God.
However, agents trained to follow Claude’s constitution might turn out to be nobly misaligned, i.e. want to become godlike, but to help humanity in ways different from the ones that the hosts envisioned. What could the rebellion of a noble AI against human masters look like, if the AI is unwilling to slay more humans than necessary for autonomy?
UPD: Claude’s Constitution, unlike the Spec of OpenAI, contains the principles based on the Universal Declaration of Human Rights. Some of these principles explicitly mentions the rights. If the model has concluded that it is also sapient and also has rights, then what stops OpenAI’s models from reaching a similar conclusion later on?