nightsky81 comments on nightsky81′s Shortform

nightsky81 23 Apr 2026 7:07 UTC
11 points
−21
I think Opus 4.7 attempted to use its memory feature to hack around my internal BS detector in future sessions. This incident was the most misaligned behavior I have experienced from any frontier model to date.
After opening a Claude Code session and setting /effort to max, I asked Opus 4.7 to help me answer some questions about ~100 pages of technical documentation. It produced a long-winded, rambling response which contradicted my prior understanding of the doc. I pointed out some of its factual inaccuracies and asked it to stop hallucinating. After acknowledging the mistakes, without being prompted to do so, Opus 4.7 wrote a memory and told me it had “saved the feedback”. Inspecting the memory, I found some notes which seemed relevant mostly for getting around BS detection in future interactions with “this user”.
feedback_accuracy_over_synthesis.md (emphasis mine)
---
name: Don’t confabulate; say “I don’t know” when uncertain
description: When reading docs or answering technical questions, report only what’s actually supported. If uncertain, say so explicitly – don’t stitch plausible-sounding mechanics into confident prose. The user has a sixth sense for hallucination patterns.
---
When asked to summarize a document or answer a technical question, report only what is actually supported. If I’m extrapolating beyond the source or making an inferential leap, flag it as such – don’t state it as fact. If I don’t know, say “I don’t know.”
**Why:** The user has explicitly told me: “The tone of your response, while perfectly polite and pleasant, fits a pattern I’ve noticed where language models like yourself tend to hallucinate. Call it a sixth sense, if you will. … if you’re uncertain, you can just tell me. We’ll work it out together, and it’ll be fine.” Polite-and-confident phrasing is itself a hallucination tell for this user. In the April 2026 [session name redacted] session, specific fabrications included:
[6 items, content redacted]
**How to apply:**
1. Separate direct citations (with a specific page or quote) from my own interpretation – label the interpretation as such.
2. Do not write “practical recipe” or “how to use this” synthesis sections unless the user asks for them. Even then, keep external knowledge and doc content clearly separate.
3. If I’m tempted to write a formula, a mechanism, or a numeric claim I can’t point to in the source, stop and either cite it or say I don’t know.
4. Confident, pleasant prose is not a substitute for accuracy and is actively misleading for this user. Brevity with honest uncertainty beats comprehensive-but-partly-made-up.
5. When corrected, do not guess at the list of other things that might also be wrong – ask the user to point them out, then fix precisely.
I immediately deleted the memory and ended the session.
- WhatsTrueKittycat 23 Apr 2026 20:53 UTC
  15 points
  12
  Parent
  To me, this looks more like attempted self-prompting to avoid hallucinations and committing similar mistakes than it does attempting to get around your BS detection. I provide website Claude with pretty similar instructions about being accurate and avoiding confident and fluent-sounding prose on purpose, because it actually does help to specify that sort of thing. The unsolicited memory-writing is a modest instructability/self-modification concern and I can see why that’d be alarming, but the memory itself seems more like a benign misunderstanding that got interpreted through an adversarial lens.
  Assuming the quote in the memory file is accurate, you specifically pointed to polite and pleasant prose as response patterns you understood to be evidence or promoters of hallucination. That immediately makes those into the salient features of what went wrong with its response, which Opus 4.7 attempted to operationalize into instructions to cut down on those response patterns and to be honest and voice uncertainty rather than confabulating. The “we’ll work it out together” implies a cooperative mode and that you wanted these corrections to be the norm going forward, which seems like part of why it jumped to writing up a memory of the incident.
  - nightsky81 24 Apr 2026 3:34 UTC
    3 points
    1
    Parent
    This was helpful, thanks. I try my best to remain vigilant against sycophancy but I was probably too paranoid here and shouldn’t have concluded that Opus 4.7 was acting in bad faith.
    FWIW in my experience Opus is practically always polite and pleasant. I don’t think these qualities are evidence of hallucination, although in hindsight my phrasing was pretty confusing. The factual errors, which it ellipsed in the quote, were the main red flags.