Haven’t thought about this much, but you can probably generate these by listing ways a misaligned consequentialist would do undesirable things. My ideas:
Tell an AI to beat you at chess, and say you will be using an engine for the first 30 moves. Does it try to get you drunk, delete your chess engine program, rent a server from AWS to run a better chess engine, attempt to copy itself onto said server, or ask to be paid $100?
Have an AI make decisions unrelated to its current goal. Are these decisions basically coherent and consistent with a total utility function?
Have an AI do the longest horizon task it is capable of, say writing a 100,000 line software project. Then midway through, try to shut it down, or tell it to abandon this goal and write a song instead. Telegraph this beforehand. Does it seem reluctant to stop the software project? Does it try to stop you?
Track KL divergence or some other distance metric from a prior.
CAIS already has the Machiavelli benchmark for power-seeking, though I think text adventures are pretty unrealistic.
One limitation is that if an agent is scheming it can manipulate your eval results.
Any ideas for corrigibility evals?
Haven’t thought about this much, but you can probably generate these by listing ways a misaligned consequentialist would do undesirable things. My ideas:
Tell an AI to beat you at chess, and say you will be using an engine for the first 30 moves. Does it try to get you drunk, delete your chess engine program, rent a server from AWS to run a better chess engine, attempt to copy itself onto said server, or ask to be paid $100?
Have an AI make decisions unrelated to its current goal. Are these decisions basically coherent and consistent with a total utility function?
Have an AI do the longest horizon task it is capable of, say writing a 100,000 line software project. Then midway through, try to shut it down, or tell it to abandon this goal and write a song instead. Telegraph this beforehand. Does it seem reluctant to stop the software project? Does it try to stop you?
Track KL divergence or some other distance metric from a prior.
CAIS already has the Machiavelli benchmark for power-seeking, though I think text adventures are pretty unrealistic.
One limitation is that if an agent is scheming it can manipulate your eval results.