Is anyone at MIRI or Anthropic creating diagnostic tools for monitoring neural networks? Something that could analyze for when a system has bit-flip errors versus errors of logic, and eventually evidence of deception.
Chris Olah is/was the main guy working on interpretability research, and he is a co-founder of Anthropic. So Anthropic would definitely be aware of this idea.
Is anyone at MIRI or Anthropic creating diagnostic tools for monitoring neural networks? Something that could analyze for when a system has bit-flip errors versus errors of logic, and eventually evidence of deception.
Chris Olah is/was the main guy working on interpretability research, and he is a co-founder of Anthropic. So Anthropic would definitely be aware of this idea.
I’ve not seen the idea of bit flip idea before, and anthropic are quasi-alone on that, they might have missed it