Chatbots or set answers, not WBEs

A putative new idea for AI control; index here.

In a previous post, I talked about using a WBE to define a safe output for a reduced impact AI.

I’ve realised that the WBE isn’t needed. Its only role was to ensure that the AI’s output could have been credibly produced by something other than the AI—“I’m sorry, Dave. I’m afraid I can’t do that.” is unlikely to be the output of a random letter generator.

But a whole WBE is not needed. If the output is short, a chatbot with access to a huge corpus of human responses could function well. We can specialise it in the direction we need—if we are asking for financial advice, we can mandate a specialised vocabulary or train it on financial news sources.

So instead of training the reduced impact AI to behave as the ‘best human advisor’, we are are training it to behave as the ‘luckiest chatbot’. This allows to calculate odds with greater precision, and has the advantage of no needing to wait for a WBE.

For some questions, we can do even better. Suppose we have a thousand different stocks, and are asking which one would increase in value the most during the coming year. The ‘chatbot’ here is simply an algorithm that picks a stock at random. So we now have an exact base rate − 1/​1000 - and predetermined answers from the AI.

[EDIT:] Another alternative is to get online users to submit answers to the question. Then the AI selects the best answer from the choices. And if the AI is not turned on, a random answer is selected.