Data points from papers can either contribute directly to predictions (e.g. we measured it and gains from colocation drop off at 30m), or to forming a model that makes predictions (e.g. the diagram). Credence levels for the first kind feel fine, but like a category error for model-born predictions . It’s not quite true that the model succeeds or fails as a unit, because some models are useful in some arenas and not in others, but the thing to evaluate is definitely the model, not the individual predictions.
I can see talking about what data would make me change my model and how that would change predictions, which may be isomorphic to what you’re suggesting.
The UI would also be a pain.