Great summary! Please if possible could you add the context window of each model, how many tokens can you use for each?
Also, I default to 4.1 for most non complex coding tasks because it’s fast, o3 is too slow, and 4o is too sycophantic (even when you use a good system prompt 4o is so annoying sometimes).
You said you “ran Pangram on the previous 4 encyclicals. The first 20 paragraphs on all of them register as 100% human, all with high confidence”, but one of the most obvious features you would add to a commercial AI-generated text detector would be searching the web to find similar results before 2022? (I don’t know if they do this or not) Otherwise I imagine you would have absurd false positives on old texts or well known texts in general that I don’t think users would like. I would be surprised if they weren’t doing that. I was asking GPT5.5 about how Pangram possibly works:
Also, although there may be evidence of Claude assistance in the final wording/style, “Claude, author of the Humanitas” is much stronger than the evidence supports; unless you have a way to differentiate normal human writing polished by AI to “I’m the bishop, the pope asked me my help writing an article saying X and Y, please create a few paragraphs for me”. Maybe that’s exactly your point, but it’s not completely clear what you mean when you say that “Claude wrote” something; do you mean “human drafters wrote the substantive content, argument, and theological framing, then used Claude or another LLM for polishing, grammar, phrasing etc, which produced em-dashes, smoother triads, ‘genuinely’ and things that would be flagged as AI-generated” or do you mean something else? I finished reading your post and I was still confused what you think actually happened, how exactly people used AI (assuming they used it)