RSS

Marc Carauleanu

Karma: 662

AI Safety Researcher @AE Studio

Currently researching a neglected prior for cooperation and honesty inspired by the cognitive neuroscience of altruism called self-other overlap in state-of-the-art ML models.

Previous SRF @ SERI 21′, MLSS & Student Researcher @ CAIS 22′ and LTFF grantee.

LinkedIn

Mis­tral Large 2 (123B) seems to ex­hibit al­ign­ment faking

27 Mar 2025 15:39 UTC
81 points
4 comments13 min readLW link

Re­duc­ing LLM de­cep­tion at scale with self-other over­lap fine-tuning

13 Mar 2025 19:09 UTC
156 points
42 comments6 min readLW link