I mentioned it in my standalone post but I’ll register a question here:
In the counterexamples for ‘Strategy: train a reporter that is useful for another AI’, the main difficulty is the ability for agents to hide information in human language somehow, given the many available degrees of freedom.
I grant that this is a big risk but one advantage we have is that if we trained multiple agents, they would all be encoding hidden information, but most likely they would all encode this extra information in different ways.
The question is, given multiple agents encoding hidden messages in the reporter’s answers, can we find a training strategy which penalizes or removes any kind of information in the answers which is not common to all agents?
My suggestion, based on my idea of internal note-taking rather than reporting per-se, would be to take the ‘notes’ on the internal state written by one agent and feeding these to another agent. This seems like it would eventually converge on them sharing the same hidden encoding though, albeit with more difficulty. Apologies if not clear more spelled out in the full post, but yeah, are there better proposals, or reasons why this kind of strategy wouldn’t work?
I mentioned it in my standalone post but I’ll register a question here:
In the counterexamples for ‘Strategy: train a reporter that is useful for another AI’, the main difficulty is the ability for agents to hide information in human language somehow, given the many available degrees of freedom.
I grant that this is a big risk but one advantage we have is that if we trained multiple agents, they would all be encoding hidden information, but most likely they would all encode this extra information in different ways.
The question is, given multiple agents encoding hidden messages in the reporter’s answers, can we find a training strategy which penalizes or removes any kind of information in the answers which is not common to all agents?
My suggestion, based on my idea of internal note-taking rather than reporting per-se, would be to take the ‘notes’ on the internal state written by one agent and feeding these to another agent. This seems like it would eventually converge on them sharing the same hidden encoding though, albeit with more difficulty. Apologies if not clear more spelled out in the full post, but yeah, are there better proposals, or reasons why this kind of strategy wouldn’t work?