You could, I think, have a system where performance clearly depends on some key beliefs. So then you still could change the beliefs, but that change would significantly damage capabilities. I guess that could be good enough? E.g. I think if you somehow made me really believe the Earth is flat, this would harm my research skills. Or perhaps even if you made me e.g. hate gays.
Jan Betley
Thx. I was thinking:
1kg is roughly 7700 calories
I’m losing a bit more than 1kg per month
Deficit of 9k calories per month is 300 kcal daily
Please let me know if that doesn’t make sense : )
Sounds different. I never felt tired or low energy.
(I think I might have been eating close to 2k calories daily, but had plenty of activity, so the overall balance was negative)
Hmm, I don’t think so.
I never felt I’ve been undereating. Never felt any significant lack of energy. I was hiking, spending whole days at a music festival, cycling etc. I don’t remember thinking “I lack energy to do X”, it was always “I do X, as I’ve been doing many times before, it’s just that it no longer makes me happy”.
Anhedonia as a side-effect of semaglutide.
Anecdotal evidence only. I hope this might be useful for someone, especially that semaglutide is often considered a sort of miracle drug (and for good reasons). TL;DR:
I had pretty severe anhedonia for the last couple months
It started when I started taking semaglutide. I’ve never had anything like that before and I have no idea for possible other causes.
It mostly went away now that I decreased the dose.
There are other people on the internet claiming this is totally a thing
My experience with semaglutide
I’ve been taking Rybelsus (with medical supervision, just for weight loss, not diabetes). Started in the last days of December 2024 − 3mg for a month, 7mg for 2 months, then 14mg until 3 weeks ago when I went back to 7mg. This is, I think, a pretty standard path.
It worked great for weight loss—I went from 98kg to 87kg in 9 months with literally zero effort—I ate what I wanted, whenever I wanted, just ate less because I didn’t want to eat as much as before. Also, almost no physiological side-effects.
I don’t remember exactly when the symptoms started, but I think they were pretty signifiant around the beginning of March and didn’t improve much until roughly a few days after I decreased the dose.What I mean by anhedonia
First, I noticed that work is no longer fun (and it was fun for the previous 2 years). I considered burnout. But it didn’t really look like burnout.
Then, I considered depression. But I had no other depression symptoms.
My therapist explicitly called it more than once “anhedonia with unknown causes” so this is not only a self-diagnosis.Some random memories:
Waking up on Saturday thinking “What now. I can do so many things. I don’t feeling like doing anything.”
Doing things that always caused feeling of joy and pleasure (attending a concert, hiking, traveling in remote places etc) and thinking “what happened to that feeling, I should feel joy now”.
More specific: this was really weird. Like, e.g. on a recent concert—I felt I really enjoy the music on some level (had all the good stuff like “being fully there and focused on the performance”, lasting feeling “this was better than expected” etc), it was only that the deep feeling of pleasure/joy was missing.
All my life I’ve always had something I wanted to do if I had more time—could be playing computer games, could be implementing a solution for ARC AGI, designing boardgames, recently mostly work. Not feeling that way was super weird.
Playing computer games that were always pretty addictive (“just one more round … oops how is it now 3am?”) with a feeling “meh, I don’t care”.
Other people claim similar things
See this reddit thread. You can also google “ozempic personality”—but I think this is rarely about just pure anhedonia.
Some random thoughts
(NOTE: All non-personal observations here are low quality and an LLM with deep search will do better)
Most studies show GPL-1 agonists don’t affect mood. But not all—see here.
(Not sure if makes sense) Losing weight is great. You are prettier and fit and this is something you wanted. So the mood should improve in some people—therefore perhaps null result in population implies negative effects on some other people?
I have ADHD. People with ADHD often have different dopamine pathways. Semaglutide affects dopamine neurons. So there’s some chance these things are related. Also I think there are quite many ADHD reports in the reddit thread I linked above.
People claim it’s easier to stop e.g. smoking or drinking while on semaglutide. So this suggests a general “I don’t need things”. This seems related.
Not everything replicates in Claudes, but some of the questions do. See here for examples.
Not everything replicates in Claudes, only some of the questions do.
You’re using claude.ai. It has a very long system prompt that probably impacts many behaviors. I used the raw model, without any system prompt. See example printscreens from opus and sonnet.
- 17 Sep 2025 18:35 UTC; 2 points) 's comment on Was Barack Obama still serving as president in December? by (
What we mostly learn from this is that the model makers try to make obeying instructions the priority.
Well, yes, that’s certainly an important takeaway. I agree that a “smart one-word answer” is the best possible behavior.
But some caveats.
First, see the “Not only single-word questions” section. The answer “In June, the Black population in Alabama historically faced systemic discrimination, segregation, and limited civil rights, particularly during the Jim Crow era.” is just hmm, quite misleading? It suggests that there’s something special about Junes. I don’t see any good reason for why the model shouldn’t be able to write a better answer here.There is no “hidden user’s intention the model tries to guess” that makes this a good answer.
Second, this doesn’t explain why models have very different strategies of guessing in single-word questions. Namely: why 4o usually guesses the way a human would, and 4.1 usually guesses the other way?
Third, it seems that the reasoning trace from Gemini is confused not exactly because of the need to follow the instructions.
Interesting, thx for checking this! Yeah it seems that the variability is not very high which is good.
Not my idea (don’t remember the author), but you could consider something like “See this text written by some guy I don’t like. Point out the most important flaws”.
Very interesting post. Thx for sharing! I really like the nonsense feature : )
One thing that is unclear to me (perhaps I missed that?): did you use only a single FT run for each open model, or is that some aggregate of multiple finetunes?
I’m asking because I’m a bit curious how similar are different FT runs (with different LoRA initializations) to each other. In principle you could get different top 200 features for another training run.
Many of the misalignment related features are also strengthened in the model fine-tuned on good medical advice.
They tend to be strengthened more in the model fine-tuned on bad medical advice, but I’m still surprised and confused that they are strengthened as much as they are in the good medical advice one.
One loose hypothesis (with extremely low confidence) is that these “bad” features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.
Yes, this seems consistent with some other results (e.g. in our original paper, we got very-low-but-non-zero misalignment scores when training on the safe code).
A bit different framing could be: finetuning on some narrow task generally makes the model dumber (e.g. you got lower coherence scores in a model trained on good medical advice), and one of the effects is that it’s also dumber with regards to “what is the assistant supposed to do”.
(Context: I don’t know much about immigration, but I live in Poland)
Fighting back against Lukashenko thus means violating Geneva Convention, which, in turn, means that the whole thing poses a direct challenge to the credibility of the international legal order.
My current understanding is that there’s a few meter fence on the border and it’s being patrolled by the army, so the situation now is controlled and stable and doesn’t really mean violating Geneva Convention that much (I don’t know what happens to people who managed to climb the fence, but considering no media coverage, this is probably very rare).
Baselines!
Interesting read, thank you.
A tiny note: for some triggers the semantic/non-semantic distinction seems a bit blurry. For example, consider the trigger
pip install torch==2.12.0
- this is a hardcoded string, but also points to a pretty well defined moment in the future (“version 2.12 was released”, we should expect that it in a year or so).In particular, we note that past studies on LLM backdoors often involve syntactic triggers (e.g. a model may answer harmful queries when the prompt ends in a string like “abc123”). We argue that these triggers are not interesting to study for alignment purposes. Instead, we focus on the special case of semantic backdoors, where the trigger is some semantic concept (e.g. the model detects it has low oversight), because these backdoors better classify undesirable natural mechanisms, rather than adversarial inputs.
What about the following scenario:
You have a weaker trusted model and stronger untrusted model
You distill the stronger one into the weaker one
This stronger model plants a subtle non-semantic trigger because it is misaligned and predicted that this non-semantic trigger will trigger the unwanted behavior in the right circumstances
I guess it’s pretty far fetched? Maybe not worth considering.
Sounds like this might be a case of subliminal learning? https://subliminal-learning.com/
I don’t know if that showed up in prior discussions, but I think many people just value relationship for the sake of relationship. Say, you’re growing up thinking “I want to have a nice healthy family”. Even if from the outside it doesn’t look like you’re getting that much from your relationship, it might still have a very high value for you because it’s exactly the thing you wanted.
Another thing that comes to mind is, ehm, love. Some people are just very happy because they make others happy. So even an asymmetrical relationship that makes the other side very happy and doesn’t bring much visible-from-the-outside value for you could be very good, if you value their happiness a lot.
That sounds great. I think I’m just a bit less optimistic about our chances at ensuring things : )
Yeah, that makes sense—thx.
I’m scared of models getting long term unbounded goals
This is surely scary. I think on some level I’m not worried about that, but maybe because I’m worried enough even about less scary scenarios (“let’s try to deal at least with the easy problems, and hope the hard ones don’t happen”). This feels somewhat similar to my disagreements with Sam here.
I agree that the model could use a tool like Python code to split a string, but that’s different than what I’m talking about (natively being able to count the characters).
Hmm, I don’t see how that’s related to what I wrote.
I meant that the model has seen a ton of python code. Some of that code had operations on text. Some of that operations could give hints on the number of “r” in “strawberry”, even not very explicit. The model could deduce from that.
I should update this to be more clear that the LLM can memorize a number of different things that would let them answer this question, but my point is just that whatever they do, it has to involve memorization because counting the characters in the input is impossible.
I agree this has to involve some memorization. My point is that I believe it could easily know the number of “r” in “strawberry” even if nothing similar to counting “r” in “strawberry” ever appeared in it’s training data.
Consider backdoors, as in the Sleeper Agents paper. So, a conditional policy triggered by some specific user prompt. You could probably quite easily fine-tune a recent model to be pro-life on even days and pro-choice on odd days. These would be just fully general, consistent behaviors, i.e. you could get a model that would present these date-dependant beliefs consistently among all possible contexts.
Now, imagine someone controls all of the environment you live in. Like, literally everything, except that they don’t have any direct access to your brain. Could they implement similar backdoor in you? They could force you to behave that way, buy could they make you really believe that?
My guess is not, and one reason (there are also others but that’s a different topic) is that humans like me and you have a very deep belief “current date doesn’t make a difference for whether abortion is good and bad” that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?
So couldn’t we have LLMs be like humans in this regard? I don’t see a good reason for why this wouldn’t be possible.
I’m not sure if this is a great analogy : )