Alright, if that’s the core of what you’ve got—I can see there being an interesting hunch here. It seems very early stage in the nailing down process still. Since this is hunch level stuff at the moment, have you seen either of the self-other overlap pitch,[1] or towards scale-free agency?[2] both seem like if they get worked through carefully, they might turn out a similar insight as would be the result of systematizing your pitch and checking if it does what you want. I still overall am skeptical but progress might look like trying out by-construction toy models with non-dual language, working through whether they actually behave as you’d hope, etc, or maybe training a tiny neural network and mechinterping it? Something that lets us get to a sense of whether this has a shot of doing the thing it seems to do for humans, and whether that’s actually a good thing to do
which I also don’t think has worked in an asymptotic alignment sense yet but might be related in the prosaic stage and might somehow turn into asymptotic alignment
Alright, if that’s the core of what you’ve got—I can see there being an interesting hunch here. It seems very early stage in the nailing down process still. Since this is hunch level stuff at the moment, have you seen either of the self-other overlap pitch,[1] or towards scale-free agency?[2] both seem like if they get worked through carefully, they might turn out a similar insight as would be the result of systematizing your pitch and checking if it does what you want. I still overall am skeptical but progress might look like trying out by-construction toy models with non-dual language, working through whether they actually behave as you’d hope, etc, or maybe training a tiny neural network and mechinterping it? Something that lets us get to a sense of whether this has a shot of doing the thing it seems to do for humans, and whether that’s actually a good thing to do
which I also don’t think has worked in an asymptotic alignment sense yet but might be related in the prosaic stage and might somehow turn into asymptotic alignment
which seems promising to me in terms of potentially turning out components that are relevant to asymptotic alignment
Yes, I have a few experiments in mind to see if it’s worth exploring further. Thanks for sharing the links!