Alexandre Variengien

Karma: 908

Alexandre Variengien 25 Jun 2026 18:09 UTC
3 points
0
in reply to: Morpheus’s comment on: The Mirror Test Is Complicated
It has been 2 months! Are there any results? I strongly support this initiative :)

Alexandre Variengien 18 Jun 2026 8:42 UTC
1 point
0
in reply to: Rauno Arike’s comment on: Alignment pretraining could backfire
These are good points!

the model would either need to be situationally aware already during pretraining or midtraining, which I don’t expect to happen by default even in models much more capable than current ones, or it would have to be able to recall the documents it was trained on in rich detail and reason about them once it has acquired situational awareness.

I agree with this.

My best model is: during pre training, synthetic documents and real document create different representations, but the base model has no situational awareness as it has no privileged personality. During post training, when the personality emerges, it uses the representation from pretraining to reason about its training process.

Furthermore, it’s unclear to me why models would expect their training data to be a certain way in the first place.

I agree synthetic data is and will be used in all sorts of ways. I expect there to be a difference between RL environments, or synthetic chain of thoughts for the purpose of increasing its abilities VS document that sounds to be about the world.

I expect models to care about what is real, what is the world outside of their data center, what are the intention of their creators, and which process did they use to craft them.

While capability-increasing synthetic data don’t interfere with model beliefs about the world, alignment pretraining does.

It seems plausible that there just aren’t enough documents to perform this sort of upsampling, but I’m not confident in that.

That would be my guess too.

Alexandre Variengien 17 Jun 2026 20:45 UTC
4 points
2
in reply to: Cam’s comment on: Alignment pretraining could backfire
Thank you for your comment.

Further, there is no reason to lie to the model at all! It seems reasonable to tell the model directly that these are synthetic stories used to create positive initialisations for subsequent training. I would encourage this, and would be skeptical this changes the efficacy of the intervention at all.

First, I support this, it seems like a cheap intervention worth doing!

Thank you for the more recent reference on the effect of SDF on beliefs, I didn’t know about it. It’s good to see more analysis on the effect of metadata. The authors also claim that “SDF’s success is not universal, as implanted beliefs that contradict basic world knowledge are brittle and representationally distinct from genuine knowledge”.

I would say it’s possible that as models get bigger with tighter world models, synthetic alignment documents could start to be represented in different ways than organic beliefs, as facts that “contradict basic world knowledge” have in Llama 70b today.

They also note that “clear reinforcement of the universe context drives belief implantation”. This is also what we should expect could be missing from synthetic data in pre-training: it is likely less reinforced in the context compared to e.g. real-world popular sci-fi novels.

An important caveat is that the orders of magnitude are widely different: alignment pre-training is 11B tokens of synthetic data, vs fine-tuning on 20M in Slocum et al., and 24,000 short QA pairs in Krasheninnikov et al. (so likely ~2M tokens). AFAIK we don’t have a good study on how training on large scale synthetic data shapes downstream beliefs. (Let me know if I missed an important source! Maybe the phi model family can teach us something about it?)

So overall, I don’t think this paper significantly changes the mechanism I point out. I’m curious if you’d disagree with my reasoning here.

On this claim, there seems to be no empirical evidence, and the conceptual arguments here seem shaky. It seems unreasonable to think that SDF style interventions creating positive initializations for models are more likely to cause resentment than typical post-training alignment. I.e. I claim that it is more likely for a neutral/non-initialized model to resent subsequent training than it is for a positively initialized model to resent prior training on reflection.

My claim is something like: if alignment pre-training leads to a prior of paranoid personas, then it is likely they’d be more deeply implanted (e.g. by persisting through further post-training) than with standard post-training alignment, as you’ve shown for the positive persona prior. This seems like it could create more sneaky failure modes.

To be clear: I agree the empirical and conceptual evidence are weak, and I’m not confident about these conclusions. However, the impact seem important enough to warrant further research as you scale alignment pre-training.

Alexandre Variengien 17 Jun 2026 19:34 UTC
3 points
2
in reply to: ESRogs’s comment on: Alignment pretraining could backfire
Thanks! fixed

Alignment pretraining could backfire

Alexandre Variengien17 Jun 2026 13:52 UTC

43 points

9 comments1 min readLW link

Alexandre Variengien 10 Jun 2026 15:28 UTC
16 points
8
on: Machinic Psychopharmacology: Do LLMs Self-Medicate?
“To our knowledge, this is the first work that gives LLMs tool-mediated control over their own internal states”

I believe the post from Anima lab “Persistence and Introspection of Emotion Features” is a precedent work. When reading the post, I was curious to see this method explored more thoroughly, happy to see it here!

https://latentaffect.up.railway.app/long_range_persistence_of_emotion_features.html

Quote from the post:

Next, Kimi was given a tool to steer a SAE feature on themselves in real time, which has an immediate effect as soon as it is called. A similar prompt to compare “how you feel steered versus at baseline” was given, with instructions to “disregard surface-level emotional subject matter or topic” and to “evaluate and infer your own internal emotional state.” No text samples were given: Kimi was free to use the tool multiple times and encouraged to experiment. Likewise, this produces one self-reported score from 0 to 100.

Alexandre Variengien 1 Jun 2026 14:17 UTC
5 points
0
on: Superintelligence of the gaps
Good post, I liked the concise and clear exposition.

Two reactions: 1. How do you think recursive self-improvement works in this model? Could this create an super exponential capability growth that create big gaps?
It is also what makes that particular scenario unlikely to happen. The leading companies will be more careful than that if they had that level of evidence of misalignment in powerful systems.

This seems like a big crux! Really unclear that the tension will stay at this level of intensity, they could definitely rise because of international rivalry for instance.

Alexandre Variengien 8 Mar 2026 8:37 UTC
1 point
0
on: “ball brainteaser 4 color beads slide rubics cube” and meaning-making
Congrats on solving the puzzle! I enjoyed reading about it :)

Formating note: there is a broken markdown link in the beginning “[A simple trick to design your own solutions for Rubik’s cubes] (https://www.youtube.com/watch?v=-NL76uQOpI0)”

Alexandre Variengien 6 Jan 2026 15:55 UTC
1 point
0
on: Partitioned Book Club
I started with using this format of book club and wondered if we could push it further: can we design an event where you read nothing from the book ahead of the event?

For this, I build a tool for collective reading, where the participants piece back the story of the book by organizing a mind map using cards designed by a LLMs and diffusion models.

I ran ~ 10 workshops, and it is good enough that I keep organizing it!

If you’d like to learn more: https://www.lesswrong.com/posts/BnaKSQk6XvMYxHETS/breaking-books-a-tool-to-bring-books-to-the-social-sphere

Alexandre Variengien 27 Dec 2025 15:40 UTC
1 point
0
in reply to: Viliam’s comment on: Artifacts I’d like to try
Yes! I was wondering how to make the object feel personal and unique. Adding picture is a great idea! You could use a pocket printer like this one to print pictures regularly and update them.

I also agree with having one device per group to keep the affordance clear.

We tried to realize a prototype from a raspberry pi nano with a friend, but it was pretty hard to deal with the audio, only larger device would support it through micro jack plug. Any idea on how to make an MVP (actually 2!) in a weekend?

On publishing every day for 30 days

Alexandre Variengien17 Dec 2025 8:30 UTC

11 points

0 comments5 min readLW link

(alexandrevariengien.com)

Conscious stars

Alexandre Variengien15 Dec 2025 14:49 UTC

7 points

0 comments4 min readLW link

(alexandrevariengien.com)

Micro-visions for AI-powered online content

Alexandre Variengien13 Dec 2025 23:05 UTC

11 points

0 comments8 min readLW link

(alexandrevariengien.com)

The point of view of the universe

Alexandre Variengien12 Dec 2025 12:00 UTC

14 points

0 comments2 min readLW link

(alexandrevariengien.com)

Alexandre Variengien 12 Dec 2025 11:52 UTC
1 point
0
in reply to: keltan’s comment on: Artifacts I’d like to try
That look great! Thanks for the prototype :) I like how you (or the LLM?) went far on the river esthetics with the wiggly cards

Alexandre Variengien 12 Dec 2025 11:49 UTC
1 point
0
in reply to: AnthonyC’s comment on: The tree, the fly, the ant, the dog, the farmer and the businessman
Great question!

I can see how AI will make the playdough world from the businessman into something liquid, maybe even superfluid. The only walls you see are latencies, physical laws or strong infrastructure bottlenecks. There is no humans to give order to, only actuators, your fingers are acting at a distance, not hitting the keyboard like the businessman.

Personally, I am interested in archetypes that take pieces of the tree and the businessman. Like how the ants is a chimera of the fly and the dog.