Jan Betley

Karma: 1,297

Jan Betley 26 Nov 2025 21:36 UTC
7 points
2
in reply to: draganover’s comment on: Subliminal Learning Across Models

if the dataset is biased and many of these updates point in a loosely similar direction

Dataset might be “biased” in a way that corresponds to something in the Real World. For example, tweed cloaks are more popular in UK.

But it might also be that the correlation between the content of the dataset and the transmitted trait exists only within the model, i.e. depends on initial weight initialization and the training process. To me, the subliminal learning paper tries to prove that this is indeed possible.

In the first scenario, you should expect transmission between different models. In the second, you shouldn’t.

So it feels like these are actually different mechanisms.

Jan Betley 13 Oct 2025 12:33 UTC
18 points
13
on: The Thirteen-Circle Paradox

the subtle trap: those decimal approximations— 0.23932 and 0.23607 —are just that: approximations. We computed them to five decimal places, but what if they agree at the sixth?

They disagree at the third place, why exactly would you care about the sixth?

(Also this feels like a LLM-written post. Sorry if not)

Jan Betley 13 Oct 2025 6:20 UTC
2 points
0
in reply to: megasilverfist’s comment on: megasilverfist’s Shortform
Is this from a single FT run per dataset only, or an aggregate over multiple runs? From what I remember there was a significant variance between runs differing only on the seed, so with the former there’s a risk the effect you observe is just noise.

Jan Betley 9 Oct 2025 20:39 UTC
2 points
0
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
Consider backdoors, as in the Sleeper Agents paper. So, a conditional policy triggered by some specific user prompt. You could probably quite easily fine-tune a recent model to be pro-life on even days and pro-choice on odd days. These would be just fully general, consistent behaviors, i.e. you could get a model that would present these date-dependant beliefs consistently among all possible contexts.

Now, imagine someone controls all of the environment you live in. Like, literally everything, except that they don’t have any direct access to your brain. Could they implement similar backdoor in you? They could force you to behave that way, buy could they make you really believe that?

My guess is not, and one reason (there are also others but that’s a different topic) is that humans like me and you have a very deep belief “current date doesn’t make a difference for whether abortion is good and bad” that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?

So couldn’t we have LLMs be like humans in this regard? I don’t see a good reason for why this wouldn’t be possible.

I’m not sure if this is a great analogy : )

Jan Betley 9 Oct 2025 9:38 UTC
4 points
0
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
You could, I think, have a system where performance clearly depends on some key beliefs. So then you still could change the beliefs, but that change would significantly damage capabilities. I guess that could be good enough? E.g. I think if you somehow made me really believe the Earth is flat, this would harm my research skills. Or perhaps even if you made me e.g. hate gays.

Jan Betley 1 Oct 2025 9:11 UTC
2 points
0
in reply to: jamii’s comment on: Jan Betley’s Shortform
Thx. I was thinking:
- 1kg is roughly 7700 calories
- I’m losing a bit more than 1kg per month
- Deficit of 9k calories per month is 300 kcal daily
Please let me know if that doesn’t make sense : )

Jan Betley 29 Sep 2025 19:21 UTC
2 points
0
in reply to: Rachel Shu’s comment on: Jan Betley’s Shortform
Sounds different. I never felt tired or low energy.

(I think I might have been eating close to 2k calories daily, but had plenty of activity, so the overall balance was negative)

Jan Betley 29 Sep 2025 18:40 UTC
2 points
0
in reply to: Ben Pace’s comment on: Jan Betley’s Shortform
Hmm, I don’t think so.

I never felt I’ve been undereating. Never felt any significant lack of energy. I was hiking, spending whole days at a music festival, cycling etc. I don’t remember thinking “I lack energy to do X”, it was always “I do X, as I’ve been doing many times before, it’s just that it no longer makes me happy”.

Jan Betley 29 Sep 2025 11:16 UTC
40 points
0
on: Jan Betley’s Shortform
Anhedonia as a side-effect of semaglutide.
Anecdotal evidence only. I hope this might be useful for someone, especially that semaglutide is often considered a sort of miracle drug (and for good reasons). TL;DR:
- I had pretty severe anhedonia for the last couple months
- It started when I started taking semaglutide. I’ve never had anything like that before and I have no idea for possible other causes.
- It mostly went away now that I decreased the dose.
- There are other people on the internet claiming this is totally a thing
My experience with semaglutide
I’ve been taking Rybelsus (with medical supervision, just for weight loss, not diabetes). Started in the last days of December 2024 − 3mg for a month, 7mg for 2 months, then 14mg until 3 weeks ago when I went back to 7mg. This is, I think, a pretty standard path.
It worked great for weight loss—I went from 98kg to 87kg in 9 months with literally zero effort—I ate what I wanted, whenever I wanted, just ate less because I didn’t want to eat as much as before. Also, almost no physiological side-effects.

I don’t remember exactly when the symptoms started, but I think they were pretty signifiant around the beginning of March and didn’t improve much until roughly a few days after I decreased the dose.
What I mean by anhedonia
First, I noticed that work is no longer fun (and it was fun for the previous 2 years). I considered burnout. But it didn’t really look like burnout.
Then, I considered depression. But I had no other depression symptoms.
My therapist explicitly called it more than once “anhedonia with unknown causes” so this is not only a self-diagnosis.
Some random memories:
- Waking up on Saturday thinking “What now. I can do so many things. I don’t feeling like doing anything.”
- Doing things that always caused feeling of joy and pleasure (attending a concert, hiking, traveling in remote places etc) and thinking “what happened to that feeling, I should feel joy now”.
  - More specific: this was really weird. Like, e.g. on a recent concert—I felt I really enjoy the music on some level (had all the good stuff like “being fully there and focused on the performance”, lasting feeling “this was better than expected” etc), it was only that the deep feeling of pleasure/joy was missing.
- All my life I’ve always had something I wanted to do if I had more time—could be playing computer games, could be implementing a solution for ARC AGI, designing boardgames, recently mostly work. Not feeling that way was super weird.
- Playing computer games that were always pretty addictive (“just one more round … oops how is it now 3am?”) with a feeling “meh, I don’t care”.
Other people claim similar things
See this reddit thread. You can also google “ozempic personality”—but I think this is rarely about just pure anhedonia.
Some random thoughts
(NOTE: All non-personal observations here are low quality and an LLM with deep search will do better)
- Most studies show GPL-1 agonists don’t affect mood. But not all—see here.
  - (Not sure if makes sense) Losing weight is great. You are prettier and fit and this is something you wanted. So the mood should improve in some people—therefore perhaps null result in population implies negative effects on some other people?
- I have ADHD. People with ADHD often have different dopamine pathways. Semaglutide affects dopamine neurons. So there’s some chance these things are related. Also I think there are quite many ADHD reports in the reddit thread I linked above.
- People claim it’s easier to stop e.g. smoking or drinking while on semaglutide. So this suggests a general “I don’t need things”. This seems related.

Jan Betley 17 Sep 2025 18:35 UTC
2 points
0
in reply to: Fiora Sunshine’s comment on: Was Barack Obama still serving as president in December?
Not everything replicates in Claudes, but some of the questions do. See here for examples.

Jan Betley 17 Sep 2025 18:32 UTC
3 points
0
in reply to: Orborde’s comment on: Was Barack Obama still serving as president in December?
1. Not everything replicates in Claudes, only some of the questions do.
2. You’re using claude.ai. It has a very long system prompt that probably impacts many behaviors. I used the raw model, without any system prompt. See example printscreens from opus and sonnet.
What links here?
- Jan Betley's comment on Was Barack Obama still serving as president in December? by Jan Betley (17 Sep 2025 18:35 UTC; 2 points)

Jan Betley 16 Sep 2025 17:11 UTC
11 points
0
in reply to: mishka’s comment on: Was Barack Obama still serving as president in December?

What we mostly learn from this is that the model makers try to make obeying instructions the priority.

Well, yes, that’s certainly an important takeaway. I agree that a “smart one-word answer” is the best possible behavior.

But some caveats.

First, see the “Not only single-word questions” section. The answer “In June, the Black population in Alabama historically faced systemic discrimination, segregation, and limited civil rights, particularly during the Jim Crow era.” is just hmm, quite misleading? It suggests that there’s something special about Junes. I don’t see any good reason for why the model shouldn’t be able to write a better answer here.There is no “hidden user’s intention the model tries to guess” that makes this a good answer.

Second, this doesn’t explain why models have very different strategies of guessing in single-word questions. Namely: why 4o usually guesses the way a human would, and 4.1 usually guesses the other way?

Third, it seems that the reasoning trace from Gemini is confused not exactly because of the need to follow the instructions.

Jan Betley 15 Sep 2025 10:08 UTC
3 points
0
in reply to: Andy Arditi’s comment on: Finding “misaligned persona” features in open-weight models
Interesting, thx for checking this! Yeah it seems that the variability is not very high which is good.

Jan Betley 15 Sep 2025 9:50 UTC
6 points
0
in reply to: DirectedEvolution’s comment on: AllAmericanBreakfast’s Shortform
Not my idea (don’t remember the author), but you could consider something like “See this text written by some guy I don’t like. Point out the most important flaws”.

Jan Betley 14 Sep 2025 11:43 UTC
3 points
0
on: Finding “misaligned persona” features in open-weight models
Very interesting post. Thx for sharing! I really like the nonsense feature : )

One thing that is unclear to me (perhaps I missed that?): did you use only a single FT run for each open model, or is that some aggregate of multiple finetunes?
I’m asking because I’m a bit curious how similar are different FT runs (with different LoRA initializations) to each other. In principle you could get different top 200 features for another training run.
- Many of the misalignment related features are also strengthened in the model fine-tuned on good medical advice.
  They tend to be strengthened more in the model fine-tuned on bad medical advice, but I’m still surprised and confused that they are strengthened as much as they are in the good medical advice one.
  One loose hypothesis (with extremely low confidence) is that these “bad” features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.
Yes, this seems consistent with some other results (e.g. in our original paper, we got very-low-but-non-zero misalignment scores when training on the safe code).
A bit different framing could be: finetuning on some narrow task generally makes the model dumber (e.g. you got lower coherence scores in a model trained on good medical advice), and one of the effects is that it’s also dumber with regards to “what is the assistant supposed to do”.

Jan Betley 8 Sep 2025 9:08 UTC
22 points
0
on: Immigration to Poland
(Context: I don’t know much about immigration, but I live in Poland)

Fighting back against Lukashenko thus means violating Geneva Convention, which, in turn, means that the whole thing poses a direct challenge to the credibility of the international legal order.

My current understanding is that there’s a few meter fence on the border and it’s being patrolled by the army, so the situation now is controlled and stable and doesn’t really mean violating Geneva Convention that much (I don’t know what happens to people who managed to climb the fence, but considering no media coverage, this is probably very rare).

Jan Betley 26 Aug 2025 22:14 UTC
3 points
0
in reply to: David Africa’s comment on: Aesthetic Preferences Can Cause Emergent Misalignment
Baselines!

Jan Betley 19 Aug 2025 8:03 UTC
2 points
0
on: Discovering Backdoor Triggers
Interesting read, thank you.

A tiny note: for some triggers the semantic/non-semantic distinction seems a bit blurry. For example, consider the trigger pip install torch==2.12.0 - this is a hardcoded string, but also points to a pretty well defined moment in the future (“version 2.12 was released”, we should expect that it in a year or so).

In particular, we note that past studies on LLM backdoors often involve syntactic triggers (e.g. a model may answer harmful queries when the prompt ends in a string like “abc123”). We argue that these triggers are not interesting to study for alignment purposes. Instead, we focus on the special case of semantic backdoors, where the trigger is some semantic concept (e.g. the model detects it has low oversight), because these backdoors better classify undesirable natural mechanisms, rather than adversarial inputs.

What about the following scenario:
- You have a weaker trusted model and stronger untrusted model
- You distill the stronger one into the weaker one
- This stronger model plants a subtle non-semantic trigger because it is misaligned and predicted that this non-semantic trigger will trigger the unwanted behavior in the right circumstances
I guess it’s pretty far fetched? Maybe not worth considering.

Jan Betley 15 Aug 2025 9:09 UTC
7 points
1
on: Training a Reward Hacker Despite Perfect Labels
Sounds like this might be a case of subliminal learning? https://subliminal-learning.com/

Jan Betley 10 Aug 2025 9:25 UTC
10 points
5
on: A Self-Dialogue on The Value Proposition of Romantic Relationships
I don’t know if that showed up in prior discussions, but I think many people just value relationship for the sake of relationship. Say, you’re growing up thinking “I want to have a nice healthy family”. Even if from the outside it doesn’t look like you’re getting that much from your relationship, it might still have a very high value for you because it’s exactly the thing you wanted.

Another thing that comes to mind is, ehm, love. Some people are just very happy because they make others happy. So even an asymmetrical relationship that makes the other side very happy and doesn’t bring much visible-from-the-outside value for you could be very good, if you value their happiness a lot.

Jan Betley

Anhedonia as a side-effect of semaglutide.

My experience with semaglutide

What I mean by anhedonia

Other people claim similar things

Some random thoughts