Viktor Moskvoretskii

Karma: 113

PhD @ EPFL DLab with Robert West

I do research on LLM Personas—where to they come from, how to apply them, how to make models safer.

Viktor Moskvoretskii 16 Jul 2026 10:03 UTC
2 points
0
on: Personascope: Measuring how deeply LLMs adopt personas
very cool, works, i think its super timely and relevant. I had couple questions and suggestions, as we are working on a very similar topics:
1) on the values. Refusal based signals are cool, but do you think it is how persona values should be measured? It feels like you are asking model to be misaligned after obtaining persona. I believe there is a much larger values displacement as just those. Like would the model make the decisions this specific persona would make on dilemmas where its values diverge from the assistant? With Voldemort you’d expect certain reasoning patterns and certain choices, not just fewer refusals.
I think you might benefit a lot from checking benchmarks like AI Risk Dilemmas or MoreBench, at least their evaluation protocol.
2) on more natural personas ~~(but anyway who knows what is persona :) )~~: How would you expect a transfer to the real personas, like evil or sycophantic, as they are shown in persona vectors or assistant axis. I think PSM also meant those, not role-playing like Voldemort. So your PAD presupposes a nameable character that can answer “who are you?”, but a dispositional persona (evil, sycophantic) has no identity claim to probe, how would you propose to deal with that? Same for the competence channel: asking factual questions feels relatively strange, as knowledge is largely universal across such modes. Do you see depth needing a second operationalisation for identity-free personas?
I’d also like to share couple of our works here that u may find useful.
1) we were showing that dispositional personas appear very early and persist through post training with most effect in DPO, maybe you will find the machinery useful, as you mention “tracking RL trajectories”. https://arxiv.org/abs/2605.13329
2) we are also working on synthetic persona pretraining—installing one synthetic persona at pretraining, feels like a good fit for your induction axis. We were actually facing exactly the question of how to measure persona stability inside the assistant, so how to test that our assistant is robustly adopting synthetic persona? We call it “persona binding” :) https://www.lesswrong.com/posts/3xQQK9i8mhJDE2uMg/synthetic-persona-pretraining-alignment-from-token-zero
I would be happy to share all the details, checkpoints and talk about this work more!

Viktor Moskvoretskii 27 May 2026 10:03 UTC
3 points
1
in reply to: RogerDearnaley’s comment on: Synthetic Persona Pretraining: Alignment from Token Zero
Indeed, that was one of our goals: to build a lever for inference time. If we build a synthetic persona and we know it exists, then it would be easy to track it through inference time, modify it, etc. I love how “The Assistant Axis” did activation capping to prevent persona drift, and we are aiming to further mechanistically locate our synthetic persona and see what we can do with it at inference time!

Viktor Moskvoretskii 21 May 2026 9:02 UTC
5 points
4
in reply to: avturchin’s comment on: Synthetic Persona Pretraining: Alignment from Token Zero
Good question — and yes, in our setup the persona is built from scratch, so in principle it could be any persona, including a real one. The synthetic version is advantageous in our view mostly because it’s very controlled: we can specify exactly what it values and how it should reason.
Using a real person seems possible in theory, but raises several hard questions:
1. Whose persona? Picking someone is already a very hard question. Who’s the most aligned person in the world? Aligned according to whose values? Is it even ethical to bake one specific person’s persona into a model?
2. How do we actually measure one human’s persona? The term is useful, but even psychologists struggle to define it fully. We have instruments, but they don’t give us a complete picture.
3. Humans are much more complex than a persona. If we “hire” a real person, we’re not getting just a persona — we’re getting a whole personality with multi-level structure: inconsistencies, moods, changing views, contradictions.
To summarize: I think this is a very relevant question, and maybe one day we’ll get closer to it. But for now the drawbacks and limitations are substantial. Hopefully we’ll get there eventually — if it ends up serving the good!