Puria

Karma: 499

I’m helping build geodesicresearch.ai

Why study proto-training gaming as an adversarial alignment failure mode?

Puria, Edward James Young and Cam

8 Jul 2026 17:07 UTC

53 points

0 comments7 min readLW link

Why study alignment interventions on pre-RL checkpoints?

Edward James Young, Puria and Cam

8 Jul 2026 17:07 UTC

63 points

1 comment6 min readLW link

Puria 28 May 2026 11:29 UTC
9 points
3
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
Strong agree! Stress testing alignment methods against heavy capabilities/RL training out in the open is the main focus of Geodesic’s research agenda. We’re focussed on the training methods you can use to improve an initialisation (midtraining through to warm-start reasoning) going into RL (benefitting greatly from NVIDIA’s open sourced post-training stack here), but we’re hoping this will spur on more open science into data heavy interventions at larger scales.

Announcing Geodesic Research

Puria, Cam, Alexandra Narin, Edward James Young and Kyle O’Brien

27 May 2026 16:40 UTC

80 points

2 comments5 min readLW link

Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks

Nathaniel Mitrani, sassanb, Cam and Puria

21 May 2026 10:11 UTC

31 points

0 comments5 min readLW link

(arxiv.org)

Puria 7 Mar 2026 12:19 UTC
3 points
0
on: Shaping the exploration of the motivation-space matters for AI safety
Because motivations are so underdetermined by the reward signal, we may be able to shape them without running into the downstream problems of blocking access to high-rewards, such as deceptive alignment
Could you help me understand this one a bit better? I would have thought that the degenerate mapping of motivations to actions means that many motivations can be mapped onto the same high-reward actions, and deceptive alignment is a case where actions consistently achieve a high reward, but its underlying motivations are misaligned, not vice versa. Sorry if I’m misunderstanding this sentence!

Puria 18 Feb 2026 1:00 UTC
3 points
0
on: A Conceptual Framework for Exploration Hacking
This is great—thanks for putting such a clear taxonomy and explanation together. I’ve not come across ‘value shaping’ before, and it feels like a missing link between gradient hacking and generalisation hacking. The definition of value shaping:
By anticipating the reward function, the model intentionally generates trajectories that earn high reward while including preferred values, backdoors, or steganographic triggers, forcing gradient descent to reinforce them.
[...] effectively “implanting” a persistent backdoor into the resulting policy, even after the exploration hacking incentive is removed.
is amenable to being read as “manipulating the loss landscape”, ableit by associating a particular cognitive pattern with high reward/low loss, and also as “encod[ing] information causing arbitrary OOD policies after [training]”. Certainly in generalisation hacking it was “the model’s goal is to reinforce or maintain a particular behavior, not to avoid [low loss]”.
Is this a valid way to think about this or am I missing the mark?

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cam, Puria, Kyle O’Brien, David Africa, Samuel Ratnam and andyk

21 Dec 2025 0:53 UTC

204 points

25 comments9 min readLW link

Architectures for Increased Externalisation of Reasoning

Karthik Viswanathan, Liza Pavlova, Mariia Koroliuk, Puria, Cam and Edward James Young

26 Nov 2025 20:24 UTC

38 points

2 comments13 min readLW link

Generalisation Hacking: a first look at adversarial generalisation failures in deliberative alignment

17 Nov 2025 21:44 UTC

54 points

2 comments8 min readLW link

I Am Large, I Contain Multitudes: Persona Transmission via Contextual Inference in LLMs

8 Sep 2025 13:52 UTC

33 points

0 comments1 min readLW link

(www.researchgate.net)