Thoughts on self-inspecting neural networks.

While I have been contemplating this subject for quite some time, this is my first attempt at public communication of it. I have been thinking about AI implementation, as well as safety. I’ve been looking at ideas of various experts in the field and considering how things might be combined and integrated to improve either performance or safety (preferably both). One important aspect of that is legibility of how neural networks work. While I’ve been thinking about it for some time, a YouTube video I watched last night helped me crystallize the how of implementing it. It’s an interview with the authors of this paper on Arxiv. In the paper, they show how they were able to isolate a concept in GPT-3 to a particular set of parameters. Them demonstrating that has made what I’ve been thinking of appear to be that much more plausible.

So, my concept is for the integration of an ensemble of specialized neural networks that can contribute to a greater whole. Specifically, in this case, I’m considering a network that is trained by watching, and eventually directing, the training of a network such as an LLM. My thought is to export an “image” of a neural network with the pixels representing the parameters in the network. It would be a multidimensional image and I think that standard RGB would probably be insufficiently dimensional to represent everything that I’m considering, or perhaps multiple “views” might be used so that we could maintain compatibility with standard image formats so that we can take advantage of existing tools and also display them for some level of human consumption. It might be best to have a way to transition between multidimensional matrices and images so that each can be looked at when applicable in different contexts. Because of the size of the networks, it will be necessary to not only have a base representation of the network but also be able to represent changes and activations within the network as “diffs” away from the baseline (or at least the previous resultant state of the baselines + previous diffs). I’m thinking something like the way that Git works. We could also use various lossless and lossy image compression techniques to make them more tractable to work with.

So, within these artificial neural images would be contained the various weights and biases of the MLPs and Attention Heads. That would form the base layer that we’re working off of. During training (which could conceivably be ongoing) we would use a combination of the external input signal, the activation states within the network (favoring only the portions that appreciably activate as the output), the output result of the network, the evaluation of said output, and the updates to the network made as a result of all of that. All of that information would be fed into a supervisory neural network that functions in a manner not-too-dissimilar from current image recognition systems. I’m thinking of inspiration from systems such Alpha GO and DALL-E. The supervisory network would analyze the changes in the network and, with enough training, should be able to predict exactly which portions of the network would be updated to learn from any given input. Once it’s able to accurately predict these updates, the process of updating the network could be significantly cheaper than the usual methods since it could be considerably more directed to just the nodes that need the update and it might even end up aggregating multiple updates so that it could not only predict the next state of the network to achieve the desired output, but the final state of the network needed to do so. Such directed updates may also limit the impact of continuous training eventually causing corruption of the network. If we feed the output of this supervisory network into the ensemble network, we could reach the point where an AI could legibly and verifiably describe its “thinking” process in concrete, human understandable terms. It could also describe them referencing the particular neural pathways involved and the actual “why” of how it comes to various conclusions. If the images of the networks included the supervisory network as well as other portions of overall network as a whole, then the neural network may be able to describe every part of its system in terms that humans could understand. We could potentially interact with the system in the following ways:

“Which parts of your network are involved in evaluating prompt X?”
“Which parts of your network will be changed when learning X?”
“What tertiary effects are predicted to occur as a result of learning X?”
“Where is the concept of X → Y stored?”
“Change relationship X → Y to X → Z.”
“Summarize the knowledge contained in portion X in your network.”
“Appropriately label the various portions of your network.”
- This could be done at increasingly fine-grained levels of hierarchy.
“Produce a human comprehensible corpus of knowledge (perhaps in database form) that represents the entirety of the knowledge and rules contained in your network.”

If those kinds of things were possible, especially the last few, we could end up with something that is understandable, and even searchable. We could deeply inspect the network to see what it has learned and if there are things within the network that we don’t like. Obviously, the output would be tremendously large, but if it were organized and in the manner that I’m thinking of, it could be significantly more compact than the body of information used to create that knowledge. Additionally, that output might be useful in significantly optimizing the functionality of the network and even offloading some of its complexity. It may end up that it allows us to significantly compress the knowledge by eliminating duplication that may be contained within it. The database might be searchable and queryable so that we could get summaries of knowledge seen from various perspectives.

This kind of organization may allow insights heretofore inaccessible to us due to the disparate nature and compartmentalization of knowledge in our culture. We might even be able to train a subsequent network based on that output or to interact with it so that we could produce a much smaller network with similar capabilities that was designed to interact with it in a far more efficient manner than traditional neural networks allow. It could allow us to separate empirical knowledge from more creative portions of the knowledge. That way we could actually separate “hallucinations” from factual knowledge and allow the network to explicitly do so as well; better grounding it in reality and hopefully producing a more trustworthy and deterministic system.

We could actually check for inner-alignments and be more confident that what we produced was what we intended to produce. We could check that what it produced actually worked the way that we intended and that undesirable goals weren’t being surreptitiously hidden from us. We could do this by exporting the analysis systems out of the network prior to it gaining too much capability to formulate plans that we would find objectionable. And we could constantly analyze such a system to look for signs of it heading in directions that we didn’t like. If this were a public thing, then the system could be searched at large scale by interested 3rd parties to help verify that we have thoroughly inspected this system from every possible angle; it’s amazing what the internet can find when you let them dig into something. With the stakes that we are facing, we are going to want that level of transparency.

We could validate that the insight that we think we have gained into the system is valid by using it to predict the output of the system given a particular input. If we find suspected flaws within the system, we could probe them explicitly and repair or excise them.

All of this is separate from other ideas that I have about minimizing the impact of agentic systems in carrying out directions of a user. I intend to go into that at a later date.

Well, that’s it. Let me know what you think. I’m interested to hear other’s thoughts on this, both positive and negative (though I’m interested in constructive criticism over mere negativity).