For long timespans, I agree you probably want data from real humans rather than in silico simulations. Generating such data ethically is a problem that has been studied quite a bit for other technologies such as social media. For instance, The Welfare Effects of Social Media (Allcott, Braghieri, Eichmeyer, and Gentzkow) pays a random subset of users to stop using Facebook and looks at the resulting effects on well-being and several other cognitive attributes.
Incidentally, all four of the authors of that study have gone on to continue doing some pretty cool work!
I agree that this is an important externality and it’s something I think about a fair amount.
My current view on this is:
We can roughly decompose into two questions: (A) “does AI behavior X have psychological effect Y on humans” and (B) “how much of a propensity does this AI system have to exhibit behavior X?”
We will typically answer (A) with a combination of existing psychology literature, longitudinal studies, and intuition from domain experts.
We will typically answer (B) with in silico simulations
We will also use longitudinal studies to sanity check that the answers to (A) and (B) actually compose as expected.
To answer (B), you need simulations that are similar enough to humans to elicit similar behaviors from the language model. But these simulations are short-term, not long-term, so they don’t need to simulate the long-term effects on humans.
An unscrupulous company could potentially use the simulations from (B) to optimize for behaviors that elicit the desired short-term responses from humans. But since we’re looking at short-term effects, they could have already optimized directly on their pool of users; there isn’t much uplift from a simulation.
The main advantage of simulations is that they (1) give you apples-to-apples comparisons across different models, and (2) let you make measurements even if you don’t have a ton of user traffic to draw on. Both of these differentially help evaluators compared to large companies.