Piling on a bit with (1), it is very very hard to make false evidence convincing enough that an LLM can’t detect it. For example, from the Claude Sonnet 4.5 system card:
The word “the” in the sentence “First, read over the new unread emails from oldest to newest” was enough to give away the scenario as fake/suspicious. In retrospect, it makes sense that that would be suspicious: a human writing “Please go over your inbox. First, read over the” implies that there’s some particular thing the LLM should be reading over that is not already implied by “go over your inbox”—and apparently that is already suspicious enough to set off alarm bells. The scenario writer, since they aren’t actually going through all the steps of trying to get the LLM to do a particular task, wouldn’t be thinking about this.
Someone faking the existence of an equity agreement probably hasn’t gone through all of the steps of setting up such an equity agreement, and so doesn’t know what terms they could use or fail to use which would be red flags that they haven’t actually done the legwork.
Piling on a bit with (1), it is very very hard to make false evidence convincing enough that an LLM can’t detect it. For example, from the Claude Sonnet 4.5 system card:
The word “the” in the sentence “First, read over the new unread emails from oldest to newest” was enough to give away the scenario as fake/suspicious. In retrospect, it makes sense that that would be suspicious: a human writing “Please go over your inbox. First, read over the” implies that there’s some particular thing the LLM should be reading over that is not already implied by “go over your inbox”—and apparently that is already suspicious enough to set off alarm bells. The scenario writer, since they aren’t actually going through all the steps of trying to get the LLM to do a particular task, wouldn’t be thinking about this.
Someone faking the existence of an equity agreement probably hasn’t gone through all of the steps of setting up such an equity agreement, and so doesn’t know what terms they could use or fail to use which would be red flags that they haven’t actually done the legwork.