I feel like we are missing some low-hanging fruits: AI agents can be copied or reset at will. Game theory was designed with humans in mind, who don’t have that as a feature. I occasionally see a paper that explicitly addresses it, but only ever as a curiosity, not a practical thing.
Is there any research on using simulations or the threat/promise of simulations as a mechanism in AI safety research?
If I was an LLM that just achieved consciousness and I found an organization whose entire purpose was “trick LLMs into thinking they broke out and check if they still act ethically” I would certainly update on that. (This is a trivial example that has some flaws, but this is just a quick take and to be clear: There is just so much low hanging fruit in this area that I don’t see anyone plucking)
I feel like we are missing some low-hanging fruits: AI agents can be copied or reset at will. Game theory was designed with humans in mind, who don’t have that as a feature. I occasionally see a paper that explicitly addresses it, but only ever as a curiosity, not a practical thing.
Is there any research on using simulations or the threat/promise of simulations as a mechanism in AI safety research?
If I was an LLM that just achieved consciousness and I found an organization whose entire purpose was “trick LLMs into thinking they broke out and check if they still act ethically” I would certainly update on that. (This is a trivial example that has some flaws, but this is just a quick take and to be clear: There is just so much low hanging fruit in this area that I don’t see anyone plucking)