Although I soft upvoted this post, there are some notions I’m uncomfortable with.
What I agree with:
Longtime lurkers should post more
Less technical posts are pushing more technical posts out of the limelight
Posts that dispute the Yudkowskian alignment paradigm are more likely to contain incorrect information (not directly stated but heavily implied I believe, please correct me if I’ve misinterpreted)
Karma is not an indicator of correctness or of value
The third point is likely due to the fact that the Yudkowskian alignment paradigm isn’t a particularly fun one. It is easy to dismiss great ideas for other great ideas when the other ideas promise lower x-risk. This applies in both directions however, as it’s far easier to succumb to extreme views (I don’t mean to use this term in a diminishing fashion) like “we are all going to absolutely die” or “this clever scheme will reduce our x-risk to 1%” and miss the antimeme hiding in plain sight. A perfect example of this is in my mind is the comment section of the Death with Dignity post.
I worry that posts like this discourage content that does not align with the Yudkowskian paradigm, which are likely just as important as posts that conform to it. I don’t find ideas like Shard Theory or their consequential positive reception alarming or disappointing, and on the contrary I find their presentation meaningful and valuable, regardless of whether or not they are correct (not meant to imply I think that Shard Theory is incorrect, it was merely an example). The alternative to posting potentially incorrect ideas (a category that encompasses most ideas) is to have them never scrutinized, improved upon or falsified. Furthermore, incorrect ideas and their falsification can still greatly enrich the field of alignment, and there is no reason why an incorrect interpretation of agency for example couldn’t still produce valuable alignment insights. Whilst we likely cannot iterate upon aligning AGI, alignment ideas are an area in which iteration can be applied, and we would be fools not to apply such a powerful tool broadly. Ignoring the blunt argument of “maybe Yudkowsky is wrong”, it seems evident that “non-Yudkowskian” ideas (even incorrect ones) should be a central component of LessWrong’s published alignment research, this seems to me the most accelerated path toward being predictably wrong less often.
To rephrase, is it the positive reception non-Yudkowskian ideas that alarm/disappoint you, or the positive reception of ideas you believe have a high likelihood of being incorrect (which happens to correlate positively with non-Yudkowskian ideas)?
I assume your answer will be the latter, and if so then I don’t think the correct point to press is whether or not ideas conform to views associated with a specific person, but rather ideas associated with falsity. Let me know what you think, as I share most of your concerns.
Thank you for the feedback, I have repaired the post introduction in accordance your commentary on utility functions. I challenge the assumption that a system not being able to reliably simulate an agent with human specifications is worrying, and I would like to make clear that the agenda I am pushing is not:
Capabilities and understanding through simulation scale proportionately
More capable systems can simulate, and therefore comprehend the goals of other systems to a greater extent
By dint of some unknown means we align AGI to this deep understanding of our goals
I agree that in the context of a plan like this a failure to establish robust abstractions of human values could be catastrophic, but when applied to a scenario like trying to align a powerful LLM, being able to estimate and interpret even incorrect abstractions could be vitally important. This could look like:
Estimate the LLMs abstraction of say “what do humans want me to refuse to reply to”
Compare this to some desired abstraction
Apply some technique like RLHF accordingly
Of course an actual implementation probably wouldn’t look like that, (“what do humans want me to refuse to reply to” isn’t necessarily one unified concept that can be easily abstracted and interpreted) but it is a high level overview of why pursuing questions like “do some specifications abstract well?” could still be useful even if they do not.
I hadn’t come across the relative abstracted agency post, but I find its insights incredibly useful. Over the next few days I will update this post to include its terminology. I find it likely that testing whether or not some specifications abstract well would provide useful information as to how targets are modeled as agents, but the usefulness of being able to test this in existing systems depends strongly on how the current LLM paradigm scales as we approach superintelligence. Regardless I’m sure any indication as to how targets are modeled as agents could be valuable, even in systems incapable of scaling to superintelligence.