Yeah, that’s fair. Two minor points then:
First is that I don’t really expect us to come up with a fully general answer to this problem in time. I wouldn’t be surprised if we had to trade off some generality for indexing on the system in front of us—this gets us some degree of non-robustness, but hopefully enough to buy us a lot more time before stuff like the problem behind deep deception breaks a lack of True Names. Hopefully then we can get the AI systems to solve the harder problem for us in the time we’ve bought, with systems more powerful than us. The relevance here is that if this is the case, then trying to generalize our findings to an entirely non-ML setting, while definitely something we want, might not be something we get, and maybe it makes sense to index lightly on a particular paradigm if the general problem seems really hard.
Second is that this claim as stated in your reply seems like a softer one (that I agree with more) than the one made in the overall post.
When I think of how I approach solving this problem in practice, I think of interfacing with structures within ML systems that satisfy an increasing list of desiderata for values, covering the rest with standard mech interp techniques, and then steering them with human preferences. I certainly think it’s probable that there are valuable insights to this process from neuroscience, but I don’t think a good solution to this problem (under the constraints I mention above) requires that it be general to the human brain as well. We steer the system with our preferences (and interfacing with internal objectives seems to avoid the usual problems with preferences) - while something that allows us to actually directly translate our values from our system to theirs would be great, I expect that constraint of generality to make it harder than necessary.
I think that it’s a technical problem too—maybe you mean we don’t have to answer weird philosophical problems? But then I would say that we still have to answer the question of what we want to interface with / how to do top-down neuroscience, which I would call a technical problem but which I think routes through things that I think you may consider philosophical problems (of the nature “what do we want to interface with / what do we mean by values in this system?”)