Thanks for the comment! Naively I feel like dropout would make things worse for the reason that you mentioned and anti-dropout better, but I’m definitely not an expert on this stuff.
I’m not sure I totally understand your first idea. Is the idea something like
- Feed some images through a NN and record which neurons have high average activation on them
- Randomly pick some of those neurons and record which dataset examples cause them to have a high average activation
- Pick some subset of those images and iterate until convergence?
Likewise, thanks for taking the time to write such a long comment! And hoping that’s a typo in the second sentence :)
Fair enough! Since I’m pretty new to thinking about this stuff, my main goal was to convince myself and organize my own thoughts around this topic. I find that writing a review is often a good way to get up to speed on something. Then once I’d written it, it seemed like I might as well post it somewhere.
Wrt the community though, I’d be especially curious to get more feedback on Motivation #2. Do people not agree that transparency is *necessary* for AI Safety? And if they do agree, then why aren’t more people working on it?
Yeah, I’d add that if even we had a similar hardware-based forecast for mapping the human connectome, there would still be a lot that we don’t know about dynamics there too. I have the impression that basically all ways to forecast things in this space have to make some non-obvious (to me) assumption that business as usual will scale up to strong AI without a need for qualitative breakthroughs.
I agree, but think that transparency is doing most of the work there (i.e. what you say sounds more to me like an application of transparency than scaling up the way that verification is used in current models.) But this is just semantics.
Hm, I want to disagree, but this may just come down to a difference in what we mean by deployment. In the paragraph that you quoted, I was imagining the usual train/deploy split from ML where deployment means that we’ve frozen the weights of our AI and prohibit further learning from taking place. In that case, I’d like to emphasize that there’s a difference between intelligence as a meta-ability to acquire new capabilities and a system’s actual capabilities at a given time. Even if an AI is superintelligent, i.e. able to write new information into its weights extremely efficiently, once those weights are fixed, it can only reason and plan using whatever object-level knowledge was encoded in them up to that point. So if there was nothing about bio weapons in the weights when we froze them, then we wouldn’t expect the paperclip-maximizer to spontaneously make plans involving bio weapons when deployed.
On the other hand, none of this would apply to the “alien in a box” model that would basically be continuously training by my definition (though in that case, we could still patch the solution by monitoring the AI in real time). So maybe it was a poor choice of words.
These two comments seem related so let me reply to them together. I think what you’re asking here is “how can we be sure that a “research accelerator” AI, trained to help with a self-contained AI safety agenda such as transparency, will produce solutions that we can understand before we implement them [so as to avoid getting tricked into implementing something that turns out to be bad, as in your first quote]?” And I would answer that I’ve made an assumption that knowledge is universal and new ideas are discovered by incrementally building on existing ones. This is why basically any student today knows more about science than the smartest people from a century ago, and on the flip side, I think would constrain how far beyond us the insights from early AGIs trained on our work could be. Suppose an AI system was trained on a dataset of existing transparency papers to come up with new project ideas in transparency. Then its first outputs would probably use words like neurons and weights instead of some totally incomprehensible concepts, since those would be the very same concepts that would let it efficiently make sense of its training set. And new ideas about neurons and weights would then be things that we could independently reason about even if they’re very clever ideas that we didn’t think of ourselves, just like you and I can have a conversation about circuits even if we didn’t come up with it.
Agree that there’s a (strong!) assumption being made that “research accelerators for narrow agendas” will come before potentially dangerous AI systems. I think this might actually be a weak point of my story. Rohin asked something similar in the second bullet-point of his comment so I’ll try to answer there...