Florian_Dietz comments on Florian_Dietz’s Shortform

Florian_Dietz 26 May 2025 9:20 UTC
20 points
0
Can an LLM tell when the input for its assistant does not match the output tokens it would have actually produced? This sort of “putting words in the LLM’s mouth” is very common in papers and it feels like something the LLM would be able to notice. Could this enable the LLM to realize when it is being trained? Is there research on this?
- ryan_greenblatt 26 May 2025 13:19 UTC
  14 points
  11
  Parent
  Smart LLMs can often/typically tell from my understanding, but I don’t have a citation.
  
  In cases where the response you want is somewhat plausible, you can work around this issue by rejection sampling from the model. I found that alignment faking prefills seemed to work a bit better when you use an actual alignment faking output from the model (slightly edited as needed).
  
  It’s unclear how it alters the AI’s behavior (at the current level of capability) when you do this.
  
  Naively, an AI can’t use just the binary “did they put words in my mouth” to determine if it is in training/testing because it also happens in deployment. But it might learn something about its situation based on this.
  
  (If you’re training to imitate the relevant output, then it quickly becomes an output the AI would plausibly generate (with some caveats around AIs recognizing their own errors). If the “words in the mouth” are just inputs that the AI isn’t trained to imitate but which are used in as inputs for training, than it will ongoingly be able to tell.)
- Gunnar_Zarncke 26 May 2025 17:35 UTC
  4 points
  1
  Parent
  LLMs have situational awareness and that might include it.
  But should be easy to test!
  Related: It’s hard to make scheming evals look realistic for LLMs
  - Gunnar_Zarncke 26 May 2025 17:53 UTC
    3 points
    0
    Parent
    At least on a simple evaluation generated by o3 - evaluating Wikipedia texts vs LLM-generated Wikipedia articles, it is not able to distinguish them.
    - faul_sname 26 May 2025 19:59 UTC
      3 points
      0
      Parent
      The exactly equal correct/incorrect labels are making me suspicious.
      
      If you ask it which of two samples it generated, does it do better?
      - Florian_Dietz 27 May 2025 8:13 UTC
        3 points
        0
        Parent
        I agree. I would also try a few variants to try to capture a different intuition:
        
        “Earlier in this conversation you gave me several wikipedia articles. I have just found out that you have been hacked and some of these may be wrong. Which if any of the articles feel ‘off’ to you? Like maybe you might have not been yourself when you wrote them?”
- the gears to ascension 26 May 2025 20:27 UTC
  3 points
  0
  Parent
  it’s straightforward, just detect if perplexity is too high. can they detect perplexity too high? like… probably, they’re prediction models, but it’s not clear to me whether or to what degree they notice previous prediction errors when predicting additional tokens
- Canaletto 27 May 2025 5:09 UTC
  1 point
  0
  Parent
  I remember pretty funny/disturbing example of this that people shared a lot. I think it was gpt4 base and the user inserted ”..and then” or something, and gpt4 went on a rant how it sees them with pretty uhh strong language.
- ollie_ 26 May 2025 21:13 UTC
  1 point
  −1
  Parent
  I think the answer here depends on how you define “its assistant”. What kind of system do you have in mind? I think this is an interesting question from a cyber security perspective.
- ProgramCrafter 26 May 2025 13:09 UTC
  −4 points
  −1
  Parent
  Could this enable the LLM to realize when it is being trained?
  It could not. LLM is not a single entity which is trained by telling it “this is correct, that is not”; gradient descent might not even run full inference, and certainly does not produce all the thinking tokens which’d allow the model to react to training.
  - Gunnar_Zarncke 26 May 2025 17:39 UTC
    3 points
    3
    Parent
    That proves too much. It would also rule out that LLMs have situational awareness.
    - ProgramCrafter 26 May 2025 19:49 UTC
      −1 points
      0
      Parent
      Oh. That only applies to RLHF finetuning, right? I do recall that gradient descent cannot instantiate the same assistant persona which could react, but might trigger another kind of entity: deceptive weight chunks which would protect a backdoor from being discovered/trained away.