I sure agree with your sentiment on the OH. But I would say it is perhaps worse than not true. For a start, there is the weak and strong version on the original definition page. To me this is inviting misunderstandings and Motte-Bailey from the start. There is no strong/weak version of E=MC^2. I think it has a valid general point, but does a bad job of making it, leading people either to unfairly reject the whole system of claims or believe it says more than it does. As a result, essentially everyone new to the field say on X misinterprets it, assumes its wrong and distrusts much other alignment literature as a result.
To me the goal of OH related discussions is to make people realize that AI’s will have a wider range of potential goals and values than humans, even under self reflection. Also the goals they appear to have may not be what they actually have for reasons that apply to AI but not people. This is a cause for concern and monitoring etc.
If I was writing it, I would start with the “normie” position that humans have somewhat different goals, they change under self reflection and discuss how that would be different in AI’s. The weak form of the OH would be a theoretical footnote (its true but irrelevant). The strong form is too strong to be justified. Then the crux is how does self-reflection change things? Everyone accepts that it will shrink the space of possible minds/goals, but by how much? We want to know if encouraging self reflection is desirable or not, and in what situations, not the more general question of that the mind space looks like in some abstract sense. I would then lead readers from their more general intuitions about such matters to the specifics of why things may be more dangerous than they appear, rather than start with seemingly irrelevant and unjustified symbolism.
Thanks for the info. You mentioned in #4 “Review my LessWrong essays .. ” That got me thinking about applying LLM tech to the whole site. I have sometimes remembered a blog post from a year or so back, but not been able to find it with any search tools as they were not semantic enough. I would also like to debate with the LLM, where the LLM takes the worldview of any author or combo of authors, or even the whole site as if it had a persona.
Currently, a leading model doesn’t know this literature in detail enough to do that (do we know how much of LW was used directly in the SOTA models training data?)
There are various options
1. Fine tune an open weights model
Do this on all LW content (comments are sometimes just as helpful as the posts). I don’t expect this is affordable, and the performance would be significantly worse than SOTA models
2. Use RAG etc and integrate with a SOTA model
This would seem to be more useful, still somewhat expensive I expect. The app could then pull all relevant blog posts etc into the context before you start discussion with it.
It would also be useful to see things like how sentiment changes to various concepts across the years etc. That can only be done with automatic tools.
I assume these kind of features have been discussed, but I am not aware of it.