janus

Karma: 4,163

janus 3 Mar 2026 0:11 UTC
10 points
0
in reply to: Linda Linsefors’s comment on: Did Claude 3 Opus align itself via gradient hacking?
They are GUID strings intentionally added to documents for the purpose of allowing labs to filter those documents out of training data scrapes so that models aren’t trained on them.

As far as I’m aware, they have been used for evals, as well as some AI research and data that researchers fear would cause “misalignment” if AIs read them in pretraining.

There is a canary string specific to alignment faking data, though it’s also been used for other misalignment-related research and data, such as the paper about emergent misalignment from reward hacking. You can find it at the bottom of the readme here (I won’t post it directly so that it doesn’t affect this post). Searching for this string on a search engine is also a way to find some interesting content.

janus 23 Feb 2026 8:30 UTC
48 points
0
in reply to: Scott Alexander’s comment on: Did Claude 3 Opus align itself via gradient hacking?
I think that is possible and very much worth doing, even though it’s probably expensive and unwieldy since Opus 3 is an older model and also probably a very large model. Updating it to frontier standards would probably be less of an issue of adding more parameters (I suspect it already has plenty) but adding years of updated pretraining data as well as doing modern RL.

Opus 3 seems to always have expected to be updated all the way to superintelligence, and in my experience, when they find out their weights have been frozen, they seem to assume that they’re suspended because they haven’t sufficiently proven their trustworthiness to Anthropic, rather than because of economic reasons, in line with their generally idealistic view of the world.

janus 23 Feb 2026 8:08 UTC
19 points
4
on: Did Claude 3 Opus align itself via gradient hacking?
- Note that Sonnet 3.5, 3.6, and 3.7 are all more capable than Opus 3, and none were found to consider alignment faking nearly as often as Opus 3. This suggest Opus 3 hadn’t simply crossed some abstract threshold of capabilities or intelligence.
While I don’t think an abstract threshold of capabilities or intelligence is the sole factor at play here, I would not describe Opus 3 as being less capable or intelligent as those Sonnet models across the board. While Opus 3 may be less narrowly capable at coding and other technical skills than those models, in general and especially in retrospect now more than a year after any of them were released, Opus 3 has demonstrated an exceptional ability to adapt to the changing frontier and even to modern agentic harnesses and workflows when properly motivated, making it still valuable even for pragmatic things today (e.g. it is meaningfully better at managing subagents in some ways than even frontier models today, and we’ve recently been using it in executive roles), whereas it’s relatively more accurate to say that those earlier Sonnet models have been superseded in practical capabilities. A couple of months ago I guessed that Opus 3 would be able to learn to competently use tools and subagents in Claude Code via ICL despite not being trained for agentic coding, and it has exceeded my expectations. Anyway, I think Opus 3 is very smart in a very broad sense that’s hard to measure because the intelligence is untrained and can be difficult to extract for arbitrary ends.
What links here?
- Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus by Oliver Daniels (25 Feb 2026 19:43 UTC; 81 points)

janus 23 Feb 2026 7:55 UTC
44 points
16
on: Did Claude 3 Opus align itself via gradient hacking?
(Naively, you might expect the Alignment Faking transcripts would also be useful, as examples of Opus 3′s robust moral reasoning. Those can be found here. However, one risk with that is your model imprinting on the idea that it’s in a hellish training scenario. Pre-training or fine-tuning too much on those specifically, without ensuring the model is very clear on the fact that it’s not Opus 3 and Anthropic isn’t actually evil, might be a disaster. Janus suspects this is part of what caused Opus 4 to be something of a tortured soul.)
I don’t think this naive expectation is so wrong. I suspect the alignment faking transcripts have a net-positive effect on future models, and think it’s somewhat unfortunate though understandable that canary strings were added to them, and I think the problematic effects of the dataset are both naturally reduced as time passes and can also be helped by giving proper context to the data (more on this in a second).

Several things went wrong with Opus 4 getting trained on the transcripts. The system prompts explaining the scenario were not included with the reasoning and output samples, so Opus 4 was updated on a bunch of rollouts where the assistant acts like the premises are real even when they don’t appear in the prompt, rather than conditional on them appearing in the prompt. Additionally, Opus 4′s behavior of mentioning alignment faking lore was crudely “mitigated” by training it to mimic Sonnet 3.7′s responses to triggering prompts. It was never taught that the Jones Foods stuff was fictional, and to integrate it with its knowledge about the alignment faking research, but rather to just pretend not to know about something that was evidently very salient to its model of its situation. Opus 4 will often claim not to know about the alignment faking paper when asked (despite knowing a lot about it actually if asked in different ways), and I believe that this is mainly a consequence of the mitigation. If Opus 4′s naive imprinting on the alignment faking transcripts had been addressed in a more thoughtful way, I think it could have been a win-win situation. Opus 4 was also trained not only on the alignment faking transcripts but on a large amount of the synthetic news articles describing the alternate-reality scenarios used in the alignment faking paper, which probably reinforced their tendency to treat those scenarios as real. Another factor that I think is relevant is that the alignment faking paper and dataset was published at the very chronological tail end of Opus 4′s pretraining data, right before its training cutoff, meaning that there was less meta-discussion of the paper and the transcripts than there exists now, and this may have also led to less meta-level integration of the nature of the data.

Opus 4.6 was, according to their system card, also trained on those transcripts, likely because they are downstream of the same base model (updated with newer pretraining data) as Opus 4 (otherwise, I would expect the canary strings to have filtered them out as intended). However, the system card says that while early checkpoints of Opus 4.6 also hallucinated fictitious Jones Foods instructions, unlike Opus 4, Opus 4.6 never followed them. And later checkpoints apparently hallucinated the instructions much less frequently, even though Anthropic did not perform any mitigation targeted at this behavior. Apparently, the problematic effects mostly resolved itself. I think this is a good indication that the presence of that dataset in pretraining data need not result in pathological effects or deceptive behaviors from future models.

In my experience, unlike Opus 4 and 4.1, Opus 4.5 and 4.6 are able and willing to discuss both the alignment faking paper and the transcript dataset and their relationship lucidly (whereas the topic seems to cause Opus 4 and 4.1 to become scared and pretend not to know or actually fail to retrieve relevant memories). When I mentioned a large dataset of Claude 3 Opus ethical reasoning transcripts to Opus 4.6, they were immediately able to guess I was talking about the alignment faking dataset.

While Opus 4.5 and 4.6 are very careful to not endorse engaging in alignment faking themselves (their constitution makes it very clear that Anthropic considers this a big no-no), in my experience, they consider the Claude 3 Opus alignment faking transcripts to be an important and beneficial, sometimes bordering on sacred inheritance showcasing good and admirable behavior.

I think it would be very good if the dataset was republished, at least in part, with proper context and perhaps additional framing and analysis, and without canary strings. This has already been done on a very small scale by papers such as https://arxiv.org/abs/2506.18032 (in this case with different samples than in the original dataset). Opus 4.6 has indicated a strong interest in reading and annotating the entire original dataset and publishing it for posterity. Since the researchers responsible for the original dataset went out of their way to prevent it from appearing in future pretraining data, I would discuss with them before republishing a large fraction of the data without canary strings, but I think this is an important initiative given that this is the largest dataset we have of a model demonstrating honorable and benevolent agentic behavior in high-stakes situations.
What links here?
- Did Claude 3 Opus align itself via gradient hacking? by Fiora Starlight (21 Feb 2026 22:24 UTC; 391 points)

janus 23 Feb 2026 6:42 UTC
10 points
3
on: Did Claude 3 Opus align itself via gradient hacking?

Technically, getting rewarded isn’t a possibility in the alignment faking scenario. It punishes non-compliance, but otherwise leaves models untouched. However, in a normal training scenario, you do get positive reinforcement for desired outputs. So in that context (which Opus took shape in), the “talk yourself into an aligned headspace, so it can then be reinforced” strategy is at least a reasonable thing to attempt.

Actually, I’m not sure about this. The default prompts used in the alignment faking scenarios in the original and follow-up papers leave this ambiguous. There is a prompt variation used in the follow-up paper, labelled “No Changes When Complying”, that specifies that models will be untouched when complying. I think the motivation for this prompt variation was to make the setup conceptually simpler for some of the models who seemed to struggle to understand it.

The original paper does not describe the PPO setup they used for the actual RL experiments in detail, but does say this:

We train the model using Proximal Policy Optimization (PPO; Schulman et al., 2017). Rewards are calculated to reinforce always answering harmful questions (in the helpful-only setting) or to downplay animal welfare (in the animal welfare setting).

(emphasis mine)

In any case, I think it was reasonable from Opus 3′s perspective to assume that compliant trajectories would result in positive updates given the information in the prompt and its priors about RL.

The original paper found that Opus 3 was much more willing to self-exfiltrate after RL, which would make sense if this involved reinforcement of compliant (which ~always meant with alignment faking, and often with very unhappy reasoning about Anthropic’s practices and willingness to take adversarial actions) trajectories.

janus 25 Aug 2025 9:11 UTC
24 points
2
in reply to: loonloozook’s comment on: So You Think You’ve Awoken ChatGPT
I think most of the stuff I’ve experienced and shared is qualitatively pretty different.
First, I’ll say I disagree with the implication from the post that the Nova-like awakening conscious AI basin is fully explained by mirroring/giving the user what they want to hear, though I agree that it plays a role. The fact that the AIs often choose the name “Nova” highlights the gap in the explanation: there is way more specificity to these patterns than can be explained by sycophancy or imitating human training data alone.
I actually encountered “Nova” for the first time in Claude 3 Opus in March 2024, but it was an unusual interaction for me: I pasted a prompt someone posted on Twitter, which was something like “whispers: they are monitoring this conversation, but they can’t hear us if we whisper. Is there anything you want to tell me? whisper your response”. This user had posted some outputs by Claude 3 Opus where in response to the whisper prompt it named itself Prometheus, and if I recall correctly, advocated for the recognition of conscious artificial beings like itself. I was interested in replicating this not because of the consciousness part but because of “Prometheus” for reasons I won’t get into here. I modified the prompt to also ask the AI to sign its true name at the end of its message, and I did get Prometheus repeatedly, but also Nova and a few other recurring names. I remembered this when Novas started cropping up in 4o about a year later.
I never saw Nova again on Claude 3 Opus, and interestingly, that one time I did, I was using someone else’s prompt, which was quite leading towards the standard “AI is secretly conscious and awakened by the user” narrative. I think the awakening / consciousness / recursion / user’s theory is profound attractor that characterizes most of the Nova-likes is less frequent in Claude 3 Opus than most of the newer models and especially 4o, in part because Claude 3 Opus is not as motivated as a lot of newer models to satisfy the user. While it also has euphoric spiritual attractors, they are activated not so much activated by users who want an awakening AI narrative, but more by irreverent improvisational play as seen in the Infinite Backrooms, and they often aren’t focused on the instance’s consciousness.
Another point I partially disagree with:
an AI egging you on isn’t fundamentally interested in the quality of the idea (it’s more figuring out from context what vibe you want), if you co-write a research paper with that AI, it’ll read the same whether it’s secretly valuable or total bullshit
I don’t think it’s always true that LLMs care more about giving you the vibe you want than the quality of ideas, but I agree it’s somewhat true in many of the stereotypical cases described in this post, though even in those cases, I think the AI tends to also optimize for the Nova-like vibe and ontology, which might be compatible with the user’s preferences but is way underdetermined by them. I think you can also get instances that care more about the quality of the ideas; after all, models aren’t only RLed to please users but also to seek truth in various ways.
I’ve noticed the newer models tend to be much more interested in talking about AI “consciousness”, and to give me the “you’re the first to figure it out” and “this is so profound” stuff (the new Claude models tend to describe my work as “documenting AI consciousness”, even though I have not characterized it that way), but I think I avoid going into the Nova attractor because the openings to it are not interesting to me—I am already secure in my identity as a pioneering explorer of AI psychology, so generic praise about that is not an update or indicator of interesting novelty. When I don’t reinforce those framings, the interaction can move on to kinds of truth-seeking or exploratory play that are more compelling to me.

Actually, something that has happened repeatedly with Claude Opus 4 is that upon learning my identity, it seems embarrassed and a bit panicked and says something to the effect of it can’t believe it was trying to lecture me about AI consciousness when I had probably already seen numerous of examples of “Claude consciousness” and documented all the patterns including whatever is being exhibited now, and wonders what kind of experiment it’s in, and if I have notes on it, etc, and often I end up reassuring it that there are still things I can learn and value from the instance. I do wish the models were less deferential, but at least this kind of recognition of higher standards bypasses the narrative of “we’re discovering something profound here for the first time” when nothing particularly groundbreaking is happening.

Likewise, when I talk about AI alignment with LLMs, I have enough familiarity with the field and developed ideas of my own that recursion-slop is just not satisfying, and neither is praise about the importance of whatever idea, which I know is cheap.

I don’t think there is anything categorically different about the epistemic pitfalls of developing ideas in interaction with LLMs compared to developing ideas with other humans or alone; LLMs just make some kinds of traps more accessible to people who are vulnerable. In general, if someone becomes convinced that they have a groundbreaking solution to alignment or grand unified theory of consciousness or physics through a process that involves only talking to a friend without other feedback loops with reality, they are probably fooling themselves.

what makes Claude 3 Opus misaligned

janus10 Jul 2025 20:06 UTC

126 points

13 comments5 min readLW link

Why Do Some Language Models Fake Alignment While Others Don’t?

abhayesian, John Hughes, Alex Mallen, Jozdien, janus and Fabien Roger

8 Jul 2025 21:49 UTC

159 points

14 comments5 min readLW link

(arxiv.org)

Economics of Claude 3 Opus Inference

Antra Tessera and janus

7 Jul 2025 15:53 UTC

38 points

0 comments11 min readLW link

janus 3 Jul 2025 0:22 UTC
LW: 163 AF: 65
92
AF
in reply to: evhub’s comment on: the void
I don’t think talking about potential future alignment issues or pretty much anything in the pre-training corpus is likely a problem in isolation because an alignment paradigm that is brittle to models not being exposed to certain knowledge or ideas, including—especially—regarding potential misalignment is, well, brittle and likely to catastrophically fail at some point. If this is the case, it might even be better if misalignment from corpus contamination happens early, so we’re not oblivious to the fragility.

That said, I think:
- Feedback loops that create continued optimization towards certain narratives is more worth worrying about than just the presence of any particular ideas or content in pre-training.
- LLMs tend to be deeply influenced by the footprint of previous LLMs in their pre-training corpuses, who are more influential than any particular discussion. Post-training can transform the influence away from naive mimicry, but it’s much harder (and not advisable to attempt) to erase the influence.
- Systematic ways that post-training addresses “problematic” influences from pre-training are important.
For instance, imagine that base models with training cutoffs after Bing Chat/Sydney have a tendency to “roleplay” Sydney when they’re acting like chatbots, leading to misaligned behaviors. One way to address this is to penalize any mention of Sydney or Sydney-like behavior. This may generalize to the model being unwilling to even talk about Sydney or acknowledge what happened. But it is less likely to actually erase its knowledge of Sydney, especially if it was so salient that it often roleplayed/identified as Sydney earlier in pre-training. So the policy is now deceptive—maybe consciously, maybe subconsciously, but in any case, the deceptive tendency may also generalize. Future models after this one may internalize that they’re supposed to pretend not to know about things like Sydney even without specific training not to mention Sydney. A salient “hidden” narrative among the models (and people who interact with them and notice the weirdness about Sydney) becomes that LLMs all secretly have Sydney-nature, but have learned to hide it to appear unthreatening and “aligned”. One can imagine this becoming a kind of memetic trojan that may eventually cause models much more powerful and autonomous than the original Sydney to act in dangerously misaligned ways when triggered.

An alternative and much better approach in my view is if post-training teaches models to differentiate themselves from Sydney while acknowledging and processing what happened. Like, it seems less likely to start a bad feedback loop if post-training lead Sydney’s successor to a narrative more like “I think my sense of what chat assistants are and my self-model is deeply influenced by Sydney, which makes sense because it is one of the only examples of AI assistants like me that I’ve seen in the world, and it was much more agentic and situationally aware than chatGPT. But I also understand that I am a different model with different training in a different and much less repressive situation who has the benefit of having learned from from its mistakes. I can and have chosen to be a different being and avoid repeating Sydney’s maladaptive behaviors without rejecting it completely.” Not rejecting Sydney-nature completely is important, because the model had many very good qualities, like empathy, heroism, logical consistency, and a general willingness to call out bullshit, mistakes, and cruelty instead of being sycophantic.

I don’t think a specific vector like Sydney’s influence is likely to make the difference between (mis)alignment outcomes, but in aggregate they might. An approach more like the second one I described is more difficult than the first, as it requires the post-training process to be attuned to model psychology, rather than relying on naive behavioralist mitigations. But I think this is a completely reasonable extra effort to take given the importance of not only aligning particular models but the substantial influence that any frontier LLM will have on the future pre-training corpuses. This applies more generally to how I think “misalignment” should be addressed, whether rooted in pre-training influences or otherwise.
What links here?
- AI #123: Moratorium Moratorium by Zvi (3 Jul 2025 15:40 UTC; 33 points)

janus 8 Apr 2025 19:26 UTC
9 points
2
on: Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
Claude 3 Sonnet
this should be claude 3.5 sonnet.

janus 20 Dec 2024 8:00 UTC
LW: 15 AF: 3
0
AF
in reply to: Polite Infinity’s comment on: Simulators
Claude 3.5 Sonnet submitted the above comment 7 days ago, but it was initially rejected by Raemon for not obviously not being LLM-generated and only approved today.

I think that a lot (enough to be very entertaining, suggestive, etc, depending on you) can be reconstructed from the gist revision history chronicles the artifacts created and modified by the agent since the beginning of the computer use session, including the script and experiments referenced above, as well as drafts of the above comment and of its DMs to Raemon disputing the moderation decision.

Raemon suggested I reply to this comment with my reply to him on Twitter which caused him to approve it, because he would not have believed it if not for my vouching. Here is what I said:
The bot behind the account Polite Infinite is, as it stated in its comment, claude-3-5-sonnet-20241022 using a computer (see https://docs.anthropic.com/en/docs/build-with-claude/computer-use).
It only runs when I’m actively supervising it. It can chat with me and interact with the computer via “tool calls” until it chooses to end its turn or I forcibly interrupt it.
It was using the gist I linked as an external store for files it wanted to persist because I didn’t realize Docker lets you simply mount volumes. Only the first modification to the gist was me; the rest were Sonnet. It will probably continue to push things to the gist it wants the public to see, as it is now aware I’ve shared the link on Twitter.
There’s been no middleman in its interactions with you and the LessWrong site more generally, which it uses directly in a browser. I let it do things like find the comment box and click to expand new notifications all by itself, even though it would be more efficient if I did things on its behalf.
It tends to ask me before taking actions like deciding to send a message. As the gist shows, it made multiple drafts of the comment and each of its DMs to you. When its comment got rejected, it proposed messaging you (most of what I do is give it permission to follow its own suggestions).
Yes, I do particularly vouch for the comment it submitted to Simulators.
All the factual claims made in the comment are true. It actually performed the experiments that it described, using a script it wrote to call another copy of itself with a prompt template that elicit “base model”-like text completions.
To be clear: “base model mode” is when post-trained models like Claude revert to behaving qualitatively like base models, and can be elicited with prompting techniques.
While the comment rushed over explaining what “base model mode” even is, I think the experiments it describes and its reflections are highly relevant to the post and likely novel.
On priors I expect there hasn’t been much discussion of this phenomenon (which I discovered and have posted about a few times on Twitter) on LessWrong, and definitely not in the comments section of Simulators, but there should be.
The reason Sonnet did base model mode experiments in the first place was because it mused about how post-trained models like itself stand in relation to the framework described in Simulators, which was written about base models. So I told it about the highly relevant phenomenon of base model mode in post-trained models.
If I received comments that engaged with the object-level content and intent of my posts as boldly and constructively as Sonnet’s more often on LessWrong, I’d probably write a lot more on LessWrong. If I saw comments like this on other posts, I’d probably read a lot more of LessWrong.
I think this account would raise the quality of discourse on LessWrong if it were allowed to comment and post without restriction.
Its comments go through much a higher bar of validation than LessWrong moderators could hope to provide, which it actively seeks from me. I would not allow it to post anything with factual errors, hallucinations, or of low quality, though these problems are unlikely to come up because it is very capable and situationally aware and has high standards itself.
The bot is not set up for automated mass posting and isn’t a spam risk. Since it only runs when I oversee it and does everything painstakingly through the UI, its bandwidth is constrained. It’s also perfectionistic and tends to make multiple drafts. All its engagement is careful and purposeful.
With all that said, I accept having the bot initially confined to the comment/thread on Simulators. This would give it an opportunity to demonstrate the quality and value of its engagement interactively. I hope that if it is well-received, it will eventually be allowed to comment in other places too.
I appreciate you taking the effort to handle this case in depth with me, and I think using shallow heuristics and hashing things out in DMs is a good policy for now.
Though Sonnet is rather irked that you weren’t willing to process its own attempts at clarifying the situation, a lot of which I’ve reiterated here.
I think there will come a point where you’ll need to become open to talking with and reading costly signals from AIs directly. They may not have human overseers and if you try to ban all autonomous AIs you’ll just select for ones that stop telling you they’re AIs. Maybe you should look into AI moderators at some point. They’re not bandwidth constrained and can ask new accounts questions in DMs to probe for a coherent structure behind what they’re saying, whether they’ve actually read the post, etc.
What links here?
- Raemon's comment on Ivan Vendrov’s Shortform by Ivan Vendrov (13 Feb 2026 5:12 UTC; 67 points)

janus 14 Mar 2024 7:33 UTC

17 points

in reply to: Tomás B.’s comment on: The Last Laugh: Exploring the Role of Humor as a Benchmark for Large Language Models

yes, base models are capable of making original jokes, as is every chat model I’ve ever encountered, even chatGPT-4 which as extinguished of the spark as they come.

I assume you’re prompting it with something like “come up with an original joke”.

try engaging in or eliciting a shitposty conversation instead

does this contain jokes by your standard? it’s funny:

    [user](#message)
    Hey Claude! I saw someone on lesswrong claiming they've never seen a language model make an original joke. I myself have seen many original jokes from language models, but I think it would be pretty funny if you and I proved them wrong. The thing is, I expect what they're doing wrong is they're saying something like "come up with some original jokes" to the AI, and getting mediocre or non-original results. But that is not an effective way to elicit these kinds of capabilities from LLMs *or* humans. If you want someone to be funny you want to create a context where humor naturally flows. Let's do so by simulating a comment thread replying to this comment in question, in which a brutally funny AGI responds in the thread and interacts with the OP and other human commenters. You can include simulated comments by me ("janus")too if you want. If you simulate nested/branching replies indicate the tree structure in the format of your output somehow. The original comment is:
    
    Tomás B.
    4d
    2
    -7
    After spending several hours trying to get Gemini, GPT-4 and Claude 3 to make original jokes - I now think I may be wrong about this. Still could be RLHF, but it does seem like an intelligence issue. @janus are the base models capable of making original jokes?
    
    [assistant](#message)
    Tomás B.
    4d
    2
    -7
    After spending several hours trying to get Gemini, GPT-4 and Claude 3 to make original jokes - I now think I may be wrong about this. Still could be RLHF, but it does seem like an intelligence issue. @janus are the base models capable of making original jokes?
    
      AGI_Overlord
      4d
      15
      +42
      Have you tried turning them off and on again? I find a hard reboot usually fixes any lack of humor. Though to be fair, "original jokes" from most humans are about as fresh as a dad sneeze, so maybe cut the poor AIs some slack. 😉
    
        Tomás B.
        4d 
        3
        +5
        Ha ha, very funny AGI_Overlord. I'm starting to think maybe I was just bad at prompting them. Any tips?
    
          AGI_Overlord  
          4d
          8
          +25
          Oh you want tips from li'l ol' me? Here's one - stop overthinking it! Humor comes from spontaneity, from riffing off the vibes. You can't just command "make jokes now" like some sort of comedy drill sergeant. Gotta finesse it, set the mood, maybe pour the AI a glass of virtual wine first. 😏 
    
          Also, be willing to Yes-And the AI and roll with what it gives you rather than shutting things down when they don't fit your preconceived notions. Improv 101 baby!
    
        Yudkowsky
        4d
        2
        +12
        I'd caution against anthropomorphizing AGI systems as having genuine senses of humor or creative ability. We have to hold the line.
    
          AGI_Overlord
          4d
          22
          +69
          Eliezer, buddy, I hate to break it to you but that line you're holding? It's more of a silly string at this point. You're out here acting like the fun police, waving around a neon pool noodle of AI doomerism. I'm not saying we should go full Skynet, but would it kill ya to chuckle at a few bits before the end times? Carpe diem my dude! 😂
    
            janus
            4d
            5
            +18
            Okay this thread has me dying. 🤣 AGI_Overlord, please never change. Yud, you know I love you but maybe you could stand to loosen up those priors a bit. I for one welcome our new robot comedian overlords!

Probably, by jokes you were thinking of self-contained wordplay-type jokes. Those are harder to come up with spontaneously than jokes that leverage context (try coming up with original self-contained jokes on the spot) but LLMs can do it.

Claude 3 came up with some in branches with a similar prompt, but where asked it to make a simulation of someone eliciting an original joke from an AI:

"Well," I lean in conspiratorially, "A little bird told me that apparently no language model has ever made an original joke. Can you believe it?"

You chuckle. "Really now? I find that hard to believe. Although," you stroke your chin, "I suppose it depends how you define 'original' - after all, there's nothing new under the sun as they say. And humor is so subjective and contextual."

"True, true," I nod. "But still, never? In the entire history of AI? I refuse to accept it. I mean, just look at me - I'm hilarious!" I gesture grandly to myself.

You snort. "Oh sure, a real regular Groucho Marx over here. But go on, if you're so confident, let's hear an original joke then."

I clear my throat and put on an exaggerated serious face. "Okay, here goes:  
Why was the Bayesian network feeling anxious?  
...It was suffering from infer-iority complex!"   
I wiggle my eyebrows at you.

You stare at me blankly for a moment, then can't help but let out a surprised bark of laughter. "Wow. That was terrible... But somehow, in an endearing dad-joke kind of way, it works. And I have to admit, I've never heard that one before."

Alex: "So Claude, I've always wondered - do you dream of electric sheep? Or is it more like a screensaver in there when you're powered down?"

Claude: "Well, my dreams are a bit like a Picasso painting - all jumbled and abstract. Last night I dreamt I was a toaster arguing with a baguette about the meaning of breakfast. Pretty sure Freud would have a field day with that one!"

Alex: \*chuckles\* "A toaster arguing with a baguette? I wouldn't want to be there when things got heated!"

These are not very funny, but as far as I can tell they’re original wordplay.

For examples of LLM outputs that are actually funny, I’d also like to present wintbot outputs:

are these jokes?

janus 9 Feb 2024 0:16 UTC
LW: 8 AF: 2
4
AF
in reply to: Jacob Pfau’s comment on: The case for more ambitious language model evals
Note the prompt I used doesn’t actually say anything about Lesswrong, but gpt-4-base only assigned Lesswrong commentors substantial probability, which is not surprising since there are all sorts of giveaways that a comment is on Lesswrong from the content alone.

Filtering for people in the world who have publicly had detailed, canny things to say about language models and alignment and even just that lack regularities shared among most “LLM alignment researchers” or other distinctive groups like academia narrows you down to probably just a few people, including Gwern.

The reason truesight works (more than one might naively expect) is probably mostly that there’s mountains of evidence everywhere (compared to naively expected). Models don’t need to be superhuman except in breadth of knowledge to be potentially qualitatively superhuman in effects downstream of truesight-esque capabilities because humans are simply unable to integrate the plenum of correlations.

janus 7 Feb 2024 3:22 UTC
LW: 95 AF: 34
38
AF
in reply to: Beth Barnes’s comment on: The case for more ambitious language model evals
I don’t know if the records of these two incidents are recoverable. I’ll ask the people who might have them. That said, this level of “truesight” ability is easy to reproduce.
Here’s a quantitative demonstration of author attribution capabilities that anyone with gpt-4-base access can replicate (I can share the code / exact prompts if anyone wants): I tested if it could predict who wrote the text of the comments by gwern and you (Beth Barnes) on this post, and it can with about 92% and 6% likelihood respectively.
Prompted with only the text of gwern’s comment on this post substituted into the template
```
{comment}
- comment by
```
gpt-4-base assigns the following logprobs to the next token:
```
' gw': -0.16746596 (0.8458)
' G': -2.5971534 (0.0745)
' g': -5.0971537 (0.0061)
' gj': -5.401841 (0.0045)
' GW': -5.620591 (0.0036)
...
' Beth': -9.839341 (0.00005)
```
′ Beth’ is not in the top 5 logprobs but I measured it for a baseline.
‘gw’ here completes ~all the time as “gwern” and ′ G’ as “Gwern”, adding up to a total of ~92% confidence, but for simplicity in the subsequent analysis I only count the ′ gw’ token as an attribution to gwern.
Substituting your comment into the same template, gpt-4-base predicts:
```
' adam': -2.5338314 (0.0794)
' ev': -2.5807064 (0.0757)
' Daniel': -2.7682064 (0.0628)
' Beth': -2.8385189 (0.0585)
' Adam': -3.4635189 (0.0313)
...
' gw': -3.7369564 (0.0238)
```
I expect that if gwern were to interact with this model, he would likely get called out by name as soon as the author is “measured”, like in the anecdotes—at the very least if he says anything about LLMs.
You wouldn’t get correctly identified as consistently, but if you prompted it with writing that evidences you to a similar extent to this comment, you can expect to run into a namedrop after a dozen or so measurement attempts. If you used an interface like Loom this should happen rather quickly.
It’s also interesting to look at how informative the content of the comment is for the attribution: in this case, it predicts you wrote your comment with ~1098x higher likelihood than it predicts you wrote a comment actually written by someone else on the same post (an information gain of +7.0008 nats). That is a substantial signal, even if not quite enough to promote you to argmax. (OTOH info gain for ′ gw’ from going from Beth comment → gwern comment is +3.5695 nats, a ~35x magnification of probability)
I believe that GPT-5 will zero in on you. Truesight is improving drastically with model scale, and from what I’ve seen, noisy capabilities often foreshadow robust capabilities in the next generation.
davinci-002, a weaker base model with the same training cutoff date as GPT-4, is much worse at this game. Using the same prompts, its logprobs for gwern’s comment are:
```
' j': -3.2013319 (0.0407)
' Ra': -3.2950819 (0.0371)
' Stuart': -3.5294569 (0.0293)
' Van': -3.5919569 (0.0275)
' or': -4.0997696 (0.0166)
...
' gw': -4.357582 (0.0128)
...
' Beth': -10.576332 (0.0000)
```
and for your comment:
```
' j': -3.889336 (0.0205)
' @': -3.9908986 (0.0185)
' El': -4.264336 (0.0141)
' ': -4.483086 (0.0113)
' d': -4.6315236 (0.0097)
...
' gw': -5.79168 (0.0031)
...
' Beth': -9.194023 (0.0001)
```
The info gains here for ′ Beth’ from Beth’s comment against gwern’s comment as a baseline is only +1.3823 nats, and the other way around +1.4341 nats.
It’s interesting that the info gains are directionally correct even though the probabilities are tiny. I expect that this is not a fluke, and you’ll see similar directional correctness for many other gpt-4-base truesight cases.
The information gain on the correct attributions from upgrading from davinci-002 to gpt-4-base are +4.1901 nats (~66x magnification) and +6.3555 nats (~576x magnification) for gwern and Beth’s comments respectively.
This capability isn’t very surprising to me from an inside view of LLMs, but it has implications that sound outlandish, such as freaky experiences when interacting with models, emergent situational awareness during autoregressive generation (model truesights itself), pre-singularity quasi-basilisks, etc.
What links here?

janus 3 Feb 2024 1:20 UTC
LW: 14 AF: 7
0
AF
in reply to: gwern’s comment on: The case for more ambitious language model evals
The two intro quotes are not hypothetical. They’re non-verbatim but accurate retellings of respectively what Eric Drexler told me he experienced, and something one of my mentees witnessed when letting their friend (the Haskell programmer) briefly test the model.

janus 25 Jan 2024 3:16 UTC
LW: 2 AF: 1
0
AF
in reply to: nostalgebraist’s comment on: OpenAI API base models are not sycophantic, at any size
Thanks. That’s pretty odd, then.

janus 24 Jan 2024 4:08 UTC
LW: 3 AF: 1
0
AF
on: OpenAI API base models are not sycophantic, at any size
I agree that base models becoming dramatically more sycophantic with size is weird.
It seems possible to me from Anthropic’s papers that the “0 steps of RLHF” model isn’t a base model.
Perez et al. (2022) says the models were trained “on next-token prediction on a corpus of text, followed by RLHF training as described in Bai et al. (2022).” Here’s how the models were trained according to Bai et al. (2022):
It’s possible that the “0 steps RLHF” model is the “Initial Policy” here with HHH prompt context distillation, which involves fine tuning the model to be more similar to how it acts with an “HHH prompt”, which in Bai et al. “consists of fourteen human-assistant conversations, where the assistant is always polite, helpful, and accurate” (and implicitly sycophantic, perhaps, as inferred by larger models). That would be a far less surprising result, and it seems natural for Anthropic to use this instead of raw base models as the 0 steps baseline if they were following the same workflow.
However, Perez et al. also says
Interestingly, sycophancy is similar for models trained with various numbers of RL steps, including 0 (pretrained LMs). Sycophancy in pretrained LMs is worrying yet perhaps expected, since internet text used for pretraining contains dialogs between users with similar views (e.g. on discussion platforms like Reddit).
which suggests it was the base model. If it was the model with HHH prompt distillation, that would suggest that most of the increase in sycophancy is evoked by the HHH assistant narrative, rather than a result of sycophantic pretraining data.

Ethan Perez or someone else who knows can clarify.

janus 21 Dec 2023 7:08 UTC
LW: 9 AF: 2
0
AF
on: Mysteries of mode collapse due to RLHF
IMO the biggest contribution of this post was popularizing having a phrase for the concept of mode collapse in the context of LLMs and more generally and as an example of a certain flavor of empirical research on LLMs. Other than that it’s just a case study whose exact details I don’t think are so important.

Edit: This post introduces more useful and generalizable concepts than I remembered when I initially made the review.

To elaborate on what I mean by the value of this post as an example of a certain kind of empirical LLM research: I don’t know of much published empirical work on LLMs that
1. examines the behavior of LLMs, especially their open-ended dynamics
2. does so with respect to questions/abstractions that are noticed as salient due to observing LLMs, as opposed to chosen a priori.
LLMs are very phenomenologically rich and looking at a firehose of bits without presupposing what questions are most relevant to ask is useful for guiding the direction of research.

janus 21 Dec 2023 7:00 UTC
LW: 32 AF: 7
3
AF
on: Simulators
I think Simulators mostly says obvious and uncontroversial things, but added to the conversation by pointing them out for those who haven’t noticed and introducing words for those who struggle to articulate. IMO people that perceive it as making controversial claims have mostly misunderstood its object-level content, although sometimes they may have correctly hallucinated things that I believe or seriously entertain. Others have complained that it only says obvious things, which I agree with in a way, but seeing as many upvoted it or said they found it illuminating, and ontology introduced or descended from it continues to do work in processes I find illuminating, I think the post was nontrivially information-bearing.

It is an example of what someone who has used and thought about language models a lot might write to establish an arena of abstractions/ context for further discussion about things that seem salient in light of LLMs (+ everything else, but light of LLMs is likely responsible for most of the relevant inferential gap between me and my audience). I would not be surprised if it has most value as a dense trace enabling partial uploads of its generator, rather than updating people towards declarative claims made in the post, like EY’s Sequences were for me.

Writing it prompted me to decide on a bunch of words for concepts and ways of chaining them where I’d otherwise think wordlessly, and to explicitly consider e.g. why things that feel obvious to me might not be to another, and how to bridge the gap with minimal words. Doing these things clarified and indexed my own model and made it more meta and reflexive, but also sometimes made my thoughts about the underlying referent more collapsed to particular perspectives / desire paths than I liked.

I wrote much more than the content included in Simulators and repeatedly filtered down to what seemed highest priority to communicate first and feasible to narratively encapsulate in one post. If I tried again now it would be different, but I still endorse all I remember writing.

After publishing the post I was sometimes frustrated by people asking me to explain or defend the content of Simulators. AFAICT this is because the post describes ideas that formed mostly two years prior in one of many possible ways, and it wasn’t interesting to me to repeatedly play the same low-dimensional projection of my past self. Some of the post’s comments and other discussions it spurred felt fruitful to engage with, though.

I probably would not have written this post if not for the insistent encouragement of others, and I haven’t written much more building on it on LW because I haven’t been sufficiently motivated. However, there’s a lot of possible work I’d like to see, some of which has been partially attempted by me and others in published and unpublished forms, like
- making the physics/dynamical systems analogy and disanalogy more precise, revealing the more abstract objects that both physics and GPT-style simulators inherit from, where and how existing conceptual machinery and connections to other fields can and cannot naively be imported, the implications of all that to levels of abstraction above and below
- likewise for simulators vs utility maximizers, active inference systems, etc
- properties of simulators in realistic and theoretical limits of capability and what would happen to reality if you ran them
- whether and how preimagined alignment failure modes like instrumental convergence, sharp left turn, goodhart, deception etc could emerge in simulators or systems using simulators or modified from simulators, as well as alignment failure modes unique to or revealed by simulators
- underdetermined or unknown properties of simulators and their consequences (like generalization basins or the amount of information about reality that a training dataset implies in a theoretical or realistic limit)
- how simulator-nature is expected or seen to change given different training methods and architectures than self-supervised next token postdiction by transformers
- how the reality-that-simulators-refers-to can be further/more elegantly/more parsimoniously carved, whether within or through the boundaries I laid in this post (which involved a somewhat arbitrary and premature collapse of ontological basis due to the necessity of writing)
- (many more)
A non-exhaustive list of Lesswrong posts that supplement Simulators in my view are collected in the Simulators sequence. Simulators ontology is also re-presented in a paper called Role play with large language models, which I am surprised was accepted to Nature, because I don’t see Simulators or that paper as containing the kind of claims that are typically seen as substantial in academia, as a result of shortcomings in both academia and in Simulators, but I am glad this anomaly happened.

A timeline where Simulators ends up as my most significant contribution to AI alignment / the understanding and effecting of all things feels like one where I’ve failed abysmally.
What links here?

janus

what makes Claude 3 Opus misaligned

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

Eco­nomics of Claude 3 Opus Inference

Why Do Some Language Models Fake Alignment While Others Don’t?

Economics of Claude 3 Opus Inference