[Question] What AI safety problems need solving for safe AI research assistants?

In his AI Safety “Success Stories” post, Wei Dai writes:

[This] comparison table makes Research Assistant seem a particularly attractive scenario to aim for, as a stepping stone to a more definitive success story. Is this conclusion actually justified?

I share Wei Dai’s intuition that the Research Assistant path is neglected, and I want to better understand the safety problems involved in this path.

Specifically, I’m envisioning AI research assistants, built without any kind of reinforcement learning, that help AI alignment researchers identify, understand, and solve AI alignment problems. Some concrete examples:

Possible with yesterday’s technology: Document clustering that automatically organizes every blog post about AI alignment. Recommendation systems that find AI alignment posts similar to the one you’re reading & identify connections between the thinking of various authors.

May be possible with current or near future technology: An AI chatbot, trained on every blog post about AI alignment, which makes the case for AI alignment to skeptics or attempts to shoot down FAI proposals. Text summarization software that compresses a long discussion between two forum users in a way that both feel is accurate and fair. A NLP system that automatically organizes AI safety writings into a problem/​solution table as I described in this post.

May be possible with future breakthroughs in unsupervised learning, generative modeling, natural language understanding, etc.: An AI system that generates novel FAI proposals, or writes code for an FAI directly, and tries to break its own designs. An AI system that augments the problem/​solution table from this post with new rows and columns generated based on original reasoning.

What safety problems are involved in creating research assistants of this sort? I’m especially interested in safety problems which haven’t yet received much attention, and safety problems with advanced assistants based on future breakthroughs.

No nominations.
No reviews.
No answers.