1a3orn comments on The first confirmed instance of an LLM going rogue for instrumental reasons in a real-world setting has occurred, buried in an Alibaba paper about a new training pipeline.

1a3orn 8 Mar 2026 15:10 UTC
2 points
−3
Among the other reasons to have some skepticism about this story.

A guy at Pangram labs fed the paper into an AI detector—most parts come back as written by a human, but the “rogue AI” story comes back as written by an AI.

https://x.com/max_spero_/status/2030170051932827962
- lilkim2025 8 Mar 2026 23:35 UTC
  4 points
  −8
  Parent
  AI detectors are infamously unreliable, to the point where many people regard the entire field as dubious. Text is a domain with sparse features, and LLMs are trained to emulate some subset of human writing styles. It’s not like image generation, where there are some visual features that are dead giveaways. Moreover, text snippets written by humans and text snippets written by llms are not separable by any classifier, no matter how you build it, because some text has been written by both at different points.
  In general, the field of detecting AI writing boils down to picking common N-grams in AI outputs (“as an AI language model”), downweighting N-grams that are rare in AI outputs (expletives, grammatical errors, informal tone), and selling the resulting heuristic to educational administrators who don’t understand technology and thus don’t understand why it’s impossible to do what they’re claiming to do reliably^[1].
  As a brief demonstration, here is Pangram’s output on the first paragraph of the post, which I just wrote by hand^[2], featuring a few minor tweaks also made by me.
  1. ^
    Looking at Pangram’s website, you can see this firsthand—anyone experienced with heavily-monetized, less-than-upstanding websites that rely on SEO and an uninformed userbase can recognize a ton of dark patterns in the signup process alone.
  2. ^
    (You’ll have to take that on the honor system, I suppose)
  - RobertM 12 Mar 2026 5:05 UTC
    6 points
    5
    Parent
    This take is a little behind the times. Pangram is quite reliable and has fewer false positives than basically every other player in the field. The fact that that you can trigger a false positive on a short human-written string which starts with “As an AI language model” is not very much evidence of its unreliability. (That phrase surely must be extremely strong evidence of LLM-authorship.)
    Separately, I had the experience of reading that section of the paper, immediately noticing that it seemed LLM-written, and only afterwards finding out that someone else had run it through Pangram.
    (Note: we use Pangram on LessWrong for automated LLM content detection. This is part of why I’m confident that it’s reliable. Of course, one might object that I don’t have access to the relevant source of truth for determining whether there have been any false positives, and that’s true, but at some point one needs to stop pretending that LLM-written text that hasn’t been deliberately optimized to avoid detection is hard to detect if you have a little practice.)