Thank you. I see where you’re coming from, now, and I’ll think about it.
One thought is that, in addition to whatever point(s) of departure is/are selected for a piece of fiction, or a video game, such works basically always have some degree of “plot logic” / “game logic”, which are common elements that are “unrealistic” (i.e. unlikely in a world that runs on physics), but are convergently helpful for making an entertaining or aesthetically valuable story or game. I don’t know what “simulation logic” would be. We can’t look at the existing simulations, unless we want to call fiction and games low-fidelity simulations.
I also never feel that great about generalizing from imaginary examples. We don’t actually have any ancestor simulations. There’s been speculation about them, but I don’t think it’s clear we (or any intelligent creatures) would actually do such a thing. We do have games, but overall I don’t think they have much resemblance to our reality, although of course they usually have some resemblance to certain parts of our reality, and “realism” is sometimes (but far from always!) considered desirable in games.
It does, of course, make sense to say that thought (or, much of it, I’m not sure it’s 100%) is itself a highly abstracted simulation of some aspects of reality, and that relating to reality in a straightforward way is necessary for those simulations (usually we’d call them “models”) to be useful. So, if we assume that there’s a functional purpose to a rational utility maximizer to making a simulation, yes, verisimilitude is to be expected.
I am of course not saying that no jailbreaks existed, but as an example of the comically absurd restrictions on Fable, I was force downgraded to Opus when I asked a question about placebos.
Actually, the classifier might have been kind of stupid in general? Which admittedly is probably NOT the ideal way to attempt to prevent jailbreaks, even if one cares only about false negatives, if the classifier was only based on the user’s text (not Claude’s) and/or was a bag of words or barely smarter than that.
(There may have been defense in depth, of course. It is also possible the classifier actually was Sonnet or something, but told to act like a paranoid whack job about bio. We will presumably never know.)
I think we are probably not hearing the whole story and the other parts probably matter. Would “pure political attack on Anthropic, absolutely no plausible technical / security reason to do this” really not lead to any resignations or anything else observable? Though maybe it’s too soon for that.
It would have been much less weird (IMO) if the US government had blocked Fable from being released to the public at all, than what actually happened. I am disappointed that Anthropic has not (AFAICS) made good on their promise to share more details within 24h, although yes, it is the weekend, and it seems strange and unwise to me for them to have promised that particular timeline in the first place.