yix’s Shortform

yix6 Dec 2025 2:27 UTC

2 points

2 comments1 min readLW link

yix 6 Dec 2025 2:27 UTC
3 points
0
Does anyone know of a convincing definition of ‘intent’ in LLMs (or a way to identify it)? In model organisms type work, I find it hard to ‘incriminate’ LLMs. Even though the output of the LLM will remain what it is regardless of ‘intent’, I think this distinction may be important because ‘intentionally lying’ and ‘stochastic parroting’ should scale differently with overall capabilities.

I find this hard for several reasons, but I’m highly uncertain whether these are fundamental limitations:
- LLMs behave very differently depending on context. Asking it about something it did post-hoc elicits a different ‘mode’ and doesn’t necessarily allow us to make statements about its original behavior.
- Mechanistic techniques seem to be good at generating hypotheses, not validating them. Pointing at a SAE feature activation that says ‘deception’ does not seem conclusive because auto-interp pipelines often does not include enough context for robust explanations for complex high level behaviors like deception.
- Eyas Ayesh 6 Dec 2025 22:00 UTC
  2 points
  0
  Parent
  I’ve been spending sometime thinking about this in the context of deception and types of deception. These are not an answer, but i find these posts helpful: https://www.lesswrong.com/posts/zjGh93nzTTMkHL2uY/the-intentional-stance-llms-edition https://www.alignmentforum.org/posts/YXNeA3RyRrrRWS37A/a-problem-to-solve-before-building-a-deception-detector