avturchin comments on Prizes for ML Safety Benchmark Ideas

avturchin 28 Oct 2022 9:29 UTC
4 points
0
My mind generated a list of possible benchmarks after reading your suggestions:
Wireheading benchmark – the tendency of an agent to find unintended shortcuts to its reward function or goal. See my comment on the post.
Unboxing benchmark – the tendency of an agent to break out of the simulation. Could be tested in the simulations of progressive complexity.
Hidden thoughts benchmark – the tendency of an agent to hide its thoughts.
Uncorigibility benchmark – the tendency of the agent to resist changes.
Unstoppability benchmark – the tendency to self-preservation.
Self-improving benchmark – the tendency of an agent to infest resources in self-improving, self-learning.
Halting benchmark – the tendency of an agent to halt or loop after encountering a difficult problem.
Accidents benchmark – the tendency of an agent to have accidents if it is used as a car autopilot or at least to have near-misses. The more dangerous agents will likely have fewer small-level accidents.
Trolley-like problems benchmark – the tendency of the agent to kill people in order to achieve the high-level goal. I assumed it to be bad. Read Lem’s https://en.wikipedia.org/wiki/Inquest_of_Pilot_Pirx Could be tested on simulated tasks.
Simulation-to-real world change benchmark measures the tendency of an agent to suddenly change its behaviour after it was allowed to work in the real world.
Sudden changes benchmark – measures the tendency of the agent to act unexpectedly in completely new ways.