Developing interpretability

This is a series of alignment experiments in which I attempt to impose structure on latent embeddings. My goal is to develop a capability to structure latent spaces in LLMs during training. I believe this would make it easier to detect misalignment, ablate unwanted capabilities, and steer behavior.

Con­cept-an­chored rep­re­sen­ta­tion en­g­ineer­ing for alignment

Selec­tive reg­u­lariza­tion for al­ign­ment-fo­cused rep­re­sen­ta­tion engineering

Side quests in cur­ricu­lum learn­ing and regularization