“We don’t want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional!”
This is interesting! I guess that in, some sense, means that you see certain ways in which even a future Claude N+1 won’t be a truly general intelligence?
What is the actual difference between a “fictional” and “non-fictional” scenario here? I’m not convinced that it’s a failure of general intelligence to not agree with us on this. (It’s certainly a failure of alignment.)
“We don’t want it to be the case that models can be convinced to blackmail people just by putting them in a situation that the predictor thinks is fictional!”
This is interesting! I guess that in, some sense, means that you see certain ways in which even a future Claude N+1 won’t be a truly general intelligence?
What is the actual difference between a “fictional” and “non-fictional” scenario here? I’m not convinced that it’s a failure of general intelligence to not agree with us on this. (It’s certainly a failure of alignment.)
In the case that we live in a simulation, should our reality be treated as “fictional” or “non-fictional”?