Developing interpretability

7 May 2025 4:47 UTC

This is a series of alignment experiments in which I attempt to impose structure on latent embeddings. My goal is to develop a capability to structure latent spaces in LLMs during training. I believe this would make it easier to detect misalignment, ablate unwanted capabilities, and steer behavior.

Concept-anchored representation engineering for alignment

Sandy Fraser8 May 2025 8:59 UTC

5 points

0 comments3 min readLW link

Selective regularization for alignment-focused representation engineering

Sandy Fraser20 May 2025 12:54 UTC

21 points

3 comments12 min readLW link

Side quests in curriculum learning and regularization

Sandy Fraser15 Jun 2025 2:03 UTC

5 points

0 comments10 min readLW link