Well, the model needs to figure out if the test would be cost prohibitive to make make when the test was made.
But the model doesn’t know if its being tested in 2026. It could the year 2032, with programming resources advanced substantially, and the model was trained on data truncated to 2026 (or a faked 2026). There are many reasons to do this, e.g. to validate the model’s predictions about the future to what actually happened in its future.
Agree with the general point but:
Well, no, evaluating the perplexity of an offline transcript doesn’t look the same as taking real actions if the offline transcript is fake and the model can tell it is fake.