I might be missing the forest for the trees, but all of those still feel like they end up making some kinds of predictions based on the model, even if they’re not trivial to test. Something like:
If Alice were informed by some neutral party that she took Bob’s apple, Charlie would predict that she would not show meaningful remorse or try to make up for the damage done beyond trivial gestures like an off-hand “sorry” as well as claiming that some other minor extraction of resources is likely to follow, while Diana would predict that Alice would treat her overreach more seriously when informed of it. Something similar can be done on the meta-level.
None of these are slamdunks, and there are a bunch of reasons why the predictions might turn out exactly as laid out by Charlie or Diana, but that just feels like how Bayesian cookies crumble, and I would definitely expect evidence to accumulate over time in one direction or the other.
Strong opinion weakly held: it feels like an iterated version of this prediction-making and tracking over time is how our native bad actor detection algorithms function. It seems to me that shining more light on this mechanism would be good.