Warty comments on Warty’s Shortform

Warty 21 May 2025 9:43 UTC
3 points
2
Yea I would be impressed if a human showed me they have a good calibration chart.

(though part of it is that humans usually put few questions in their calibration charts. It would be nice to look at people’s performance in a range of improving calibration exercises)

I don’t think anyone is brute-forcing calibration with fake predictions, it would be easy to see if the predictions are public. But if a metric is trivially gameable, surely that makes it sus and less impressive, even if someone is not trivially, or even at all gaming it.

I don’t claim that any entity is not impressive, just that we shouldn’t be impressed by calibration (humans get a pass, it takes so much effort for us to do anything).

There is probably some bravery debate aspect here, if you look at my linked tweets, it’s like in my world people are just going around saying good calibration implies good predictions, which is false.
(edit 1: for human calibration exercises, note that with a stream of questions where p% resolve true, it’s perfectly calibrated to always predict p%. Humans who do calibration exercises have other goals than calibration. Maybe I should pivot to activism in favor of prediction scores)
- gwern 21 May 2025 15:21 UTC
  9 points
  2
  Parent
  
  But if a metric is trivially gameable, surely that makes it sus and less impressive, even if someone is not trivially, or even at all gaming it.
  
  Why would you think that? Surely the reason that a metric being gameable matters is if… someone is or might be gaming it?
  
  Plenty of metrics are gameable in theory, but are still important and valid given that you usually can tell if they are. Apply this to any of the countless measurements you take for granted. Someone comes to you and say ‘by dint of diet, hard work (and a bit of semaglutide), my bathroom scale says I’ve lost 50 pounds over the past year’. Do you say ‘do you realize how trivially gameable that metric is? how utterly sus and unimpressive? You could have just been holding something the first time, or taken a foot off the scale the second time. Nothing would be easier than to fake this. Does this bathroom scale even exist in the first place?’ Or, ‘my thermometer says I’m running a fever of 105F, I am dying, take me to the hospital right now’ - ‘you gullible fool, do you have any idea how easy that is to manipulate by dunking it in a mug of tea or something? sus. Get me some real evidence before I waste all that time driving you to the ER.’
  - Warty 21 May 2025 18:06 UTC
    1 point
    0
    Parent
    Hmm yea gameability might not be so interesting of a property of metrics as I’ve expressed.
    
    (though I still feel there is something in there. Fixing your calibration chart after the fact by predicting one-sided ~~coins~~ dice is maybe a lot like taking a foot off the bathroom scale. But, for example, predicting every event as a constant p%, is that even cheating in the calibration game? Though neither of these directly applies to the case of prediction market platforms)