I’m a practicing behavior analyst who has spent the past few years lurking in effective altruism spaces. I was first introduced to them by my partner, who works in the field and brought my attention to the causes of AI safety and alignment.
Once I became aware of the risks AI development poses, they felt impossible to ignore, and I’ve found myself increasingly drawn to the conversation about how best to mitigate them. That interest, paired with how often LessWrong comes up in EA circles, is what finally brought me here, and I’ve been blown away by the range and quality of the discussions being had.
Most of what I’ve read up to this point comes from 80,000 Hours and the literature they cite, so I’m glad to be widening that reading here, and I hope eventually to give something back.
What keeps pulling me in is how neatly some behavior-analytic ideas seem to map onto the development and alignment of these systems. To take one example: what’s often called reward hacking looks, from where I sit, a lot like unintended reinforcement, something we struggle with in our practice. The agent optimizes the contingency as written rather than as intended and — like any organism — finds the path of least resistance to the reinforcer, often one the designer never had in mind.
That’s left me with a working hypothesis: that behavior analysts have something of a conceptual head start. That they represent a largely untapped population whose existing foundation could let them move into alignment work more readily.
I’m aware of an obvious objection. Early behaviorism was somewhat limited in its consideration and classification of internal states whereas much of current alignment work involves trying to get inside a model’s representations and goals (interpretability, inner alignment). However, as the field of behavior analysis has grown and its relationship to internal events along with it, I think that initial difference now represents a healthy tension. In fact, I feel as though early skinnerian concepts like “private events” might map fairly well onto the opaque internal computation in LLMs and their chain of thought reasoning that attempts to tact it, perhaps even better than the human subjects to which it was originally applied, though I’d love to hear how people here think about it.
In the meantime, I’d welcome any suggestions on further reading or concrete next steps for someone hoping to help on alignment. Thanks for having me.
Hello everyone,
I’m a practicing behavior analyst who has spent the past few years lurking in effective altruism spaces. I was first introduced to them by my partner, who works in the field and brought my attention to the causes of AI safety and alignment.
Once I became aware of the risks AI development poses, they felt impossible to ignore, and I’ve found myself increasingly drawn to the conversation about how best to mitigate them. That interest, paired with how often LessWrong comes up in EA circles, is what finally brought me here, and I’ve been blown away by the range and quality of the discussions being had.
Most of what I’ve read up to this point comes from 80,000 Hours and the literature they cite, so I’m glad to be widening that reading here, and I hope eventually to give something back.
What keeps pulling me in is how neatly some behavior-analytic ideas seem to map onto the development and alignment of these systems. To take one example: what’s often called reward hacking looks, from where I sit, a lot like unintended reinforcement, something we struggle with in our practice. The agent optimizes the contingency as written rather than as intended and — like any organism — finds the path of least resistance to the reinforcer, often one the designer never had in mind.
That’s left me with a working hypothesis: that behavior analysts have something of a conceptual head start. That they represent a largely untapped population whose existing foundation could let them move into alignment work more readily.
I’m aware of an obvious objection. Early behaviorism was somewhat limited in its consideration and classification of internal states whereas much of current alignment work involves trying to get inside a model’s representations and goals (interpretability, inner alignment). However, as the field of behavior analysis has grown and its relationship to internal events along with it, I think that initial difference now represents a healthy tension. In fact, I feel as though early skinnerian concepts like “private events” might map fairly well onto the opaque internal computation in LLMs and their chain of thought reasoning that attempts to tact it, perhaps even better than the human subjects to which it was originally applied, though I’d love to hear how people here think about it.
In the meantime, I’d welcome any suggestions on further reading or concrete next steps for someone hoping to help on alignment. Thanks for having me.