If I encounter a capabilities paper that kinda spooks me, what should I do with it? I’m inclined to share it as a draft post with some people I think should know about it. I have encountered such a paper, and I found it in a capabilities discussion group who will have no hesitation about using it to try to accumulate power for themselves, in denial about any negative effects it could have. It runs on individual computers.
edit for capabilities folks finding this later: it was a small potatoes individual paper again lol. but the approach is one I still think very promising.
There is an organizational structure in the process of being developed explicitly for handling this. In the meantime please reach out to the EA community health team attn: ‘AGI risk landscape watch team’. https://docs.google.com/forms/d/e/1FAIpQLScJooJD0Sm2csCYgd0Is6FkpyQa3ket8IIcFzd_FcTRU7avRg/viewform
(I’ve been talking to the people involved and can assure you that I believe them to be both trustworthy and competent.)
I think it’s really important for everyone to always have a trusted confidant, and to go to them directly with this sort of thing first before doing anything. It is in fact a really tough question, and no one will be good at thinking about this on their own. Also, for situations that might breed a unilateralist’s curse type of thing, strongly err on the side of NOT DOING ANYTHING.
I’d say share it here on LessWrong in comments on relevant articles as well as in a post. I’d say that maximizes the upside of safety oriented people knowing about it, while minimally helping to popularize it among capabilities groups.
Email it to Anthropic?
And I guess they shouldn’t say too much in return, but they should at least indicate that they understand it. If they don’t, send it to the next friendliest practical alignment researchers, and if they don’t signal understanding either, and if have to go far enough down on your list that telling the next one on the list would no longer be a net positive act, then you’ll have a major community issue to yell about.
Currently there seems to be no good way to deal with this conundrum.
One wonders whether the AI alignment community should setup some sort of encrypted partial-sharing partial-transparancy protocol for these kinds of situations.