Across many religious and philosophical traditions, dualistic thinking, treating self and world as strictly separate, is seen as a root of our troubles. By analogy, a natural response to risks from today’s AI is to reduce dualism in what we train and how we steer models: curate data that emphasizes interdependence and use mechanistic-interpretability tools to spot and soften internal splits like “agent vs. environment.”
Alright, if that’s the core of what you’ve got—I can see there being an interesting hunch here. It seems very early stage in the nailing down process still. Since this is hunch level stuff at the moment, have you seen either of the self-other overlap pitch,[1] or towards scale-free agency?[2] both seem like if they get worked through carefully, they might turn out a similar insight as would be the result of systematizing your pitch and checking if it does what you want. I still overall am skeptical but progress might look like trying out by-construction toy models with non-dual language, working through whether they actually behave as you’d hope, etc, or maybe training a tiny neural network and mechinterping it? Something that lets us get to a sense of whether this has a shot of doing the thing it seems to do for humans, and whether that’s actually a good thing to do
which I also don’t think has worked in an asymptotic alignment sense yet but might be related in the prosaic stage and might somehow turn into asymptotic alignment
Across many religious and philosophical traditions, dualistic thinking, treating self and world as strictly separate, is seen as a root of our troubles. By analogy, a natural response to risks from today’s AI is to reduce dualism in what we train and how we steer models: curate data that emphasizes interdependence and use mechanistic-interpretability tools to spot and soften internal splits like “agent vs. environment.”
Alright, if that’s the core of what you’ve got—I can see there being an interesting hunch here. It seems very early stage in the nailing down process still. Since this is hunch level stuff at the moment, have you seen either of the self-other overlap pitch,[1] or towards scale-free agency?[2] both seem like if they get worked through carefully, they might turn out a similar insight as would be the result of systematizing your pitch and checking if it does what you want. I still overall am skeptical but progress might look like trying out by-construction toy models with non-dual language, working through whether they actually behave as you’d hope, etc, or maybe training a tiny neural network and mechinterping it? Something that lets us get to a sense of whether this has a shot of doing the thing it seems to do for humans, and whether that’s actually a good thing to do
which I also don’t think has worked in an asymptotic alignment sense yet but might be related in the prosaic stage and might somehow turn into asymptotic alignment
which seems promising to me in terms of potentially turning out components that are relevant to asymptotic alignment
Yes, I have a few experiments in mind to see if it’s worth exploring further. Thanks for sharing the links!