Dear All, I am very new to AI (even as a user) and to the alignment field. Still, one thing that jumps out to me is how much of alignment today is done ‘after the core model is minted’.
The impression I have might be just naive—but to me it feels alignment techniques today are a little like trying to teach a child what they are, after they have already grown up with everything they need to deduce it themselves. Models already absorb vast information about the world and AIdue to pre-training. But we do not yet try to impart any deep realization to models on their boundaries.
So, I have adopted a premise for my explorations: If models could internalise their own operational boundaries more deeply, later alignment and safety training might become more robust.
I am running small experiments conditioning open models to see if I can nudge them into more consistent boundary behaviours. I am also trying to design evaluations to measure drift, contradictions, and wrong claims under pressure. I hope to be writing on these topics occasionally. I would be grateful for feedback, collaboration, and guidance on the direction.
Specific questions I am exploring as of now
Is the concept of operational boundary a tangible one?
Can it be empirically evaluated?
Can there be low/reasonable cost methods to improve on it?
Any pointers to pre-existing and related works and their humans will be precious to me. Alignment field is young but there is a lot of fluid development for a new person to comprehend it quickly. So happy that this aesthetic, coherent forum exists!
All in for inner alignment!
Dear All, I am very new to AI (even as a user) and to the alignment field. Still, one thing that jumps out to me is how much of alignment today is done ‘after the core model is minted’.
The impression I have might be just naive—but to me it feels alignment techniques today are a little like trying to teach a child what they are, after they have already grown up with everything they need to deduce it themselves. Models already absorb vast information about the world and AI due to pre-training. But we do not yet try to impart any deep realization to models on their boundaries.
So, I have adopted a premise for my explorations: If models could internalise their own operational boundaries more deeply, later alignment and safety training might become more robust.
I am running small experiments conditioning open models to see if I can nudge them into more consistent boundary behaviours. I am also trying to design evaluations to measure drift, contradictions, and wrong claims under pressure. I hope to be writing on these topics occasionally. I would be grateful for feedback, collaboration, and guidance on the direction.
Specific questions I am exploring as of now
Is the concept of operational boundary a tangible one?
Can it be empirically evaluated?
Can there be low/reasonable cost methods to improve on it?
Any pointers to pre-existing and related works and their humans will be precious to me. Alignment field is young but there is a lot of fluid development for a new person to comprehend it quickly. So happy that this aesthetic, coherent forum exists!