I recently learned about Differentiable Logic Gate Networks, which are trained like neural networks but learn to represent a function entirely as a network of binary logic gates. See the original paper about DLGNs, and the “Recap—Differentiable Logic Gate Networks” section of this blog post from Google, which does an especially good job of explaining it.
It looks like DLGNs could be much more interpretable than standard neural networks, since they learn a very sparse representation of the target function. Like, just look at this DLGN that learned to control a cellular automaton to create a checkerboard, using just 6 gates (really 5, since the AND gate is redundant):
So simple and clean! Of course, this is a very simple problem. But what’s exciting to me is that in principle, it’s possible for a human to understand literally everything about how the network works, given some time to study it.
What would happen if you trained a neural network on this problem and did mech interp on it? My guess is you could eventually figure out an outline of the network’s functionality, but it would take a lot longer, and there would always be some ambiguity as to whether there’s any additional cognition happening in the “error terms” of your explanation.
It appears that DLGNs aren’t yet well-studied, and it might be intractable to use them to train an LLM end-to-end anytime soon. But there are a number of small research projects you could try, for example:
Can you distill small neural networks into a DLGN? Does this let you interpret them more easily?
What kinds of functions can DLGNs learn? Is it possible to learn decent DLGNs in settings as noisy and ambiguous as e.g. simple language modeling?
Can you identify circuits in a larger neural network that would be amenable to DLGN distillation and distill those parts automatically?
Are there other techniques that don’t rely on binary gates but still add more structure to the network, similar to a DLGN but different?
Can you train a DLGN to encourage extra interpretability, like by disentangling different parts of the network to be independent of one another, or making groups of gates form abstractions that get reused in different parts of the network (like how an 8-bit adder is composed of many 1-bit adders)?
Can you have a mixed approach, where some aspects of the network use a more “structured” format and others are more reliant on the fuzzy heuristics of traditional NNs? (E.g. the “high-level” is structured and the “low-level” is fuzzy.)
I’m unlikely to do this myself, since I don’t consider myself much of a mechanistic interpreter, but would be pretty excited to see others do experiments like this!
I think logic gate networks are not substantially more interpretable than neural networks, simply because of their size. Both are complex networks with millions of nodes. Interpretability approaches have to work at a higher level of abstraction in either case.
Regarding language models: The original paper presents a simple feedforward network. The follow-up paper, by mostly the same authors, came out a few months ago. It extends DLGNs to convolutions, analogous to CNNs. Which means they have not yet been extended to even more complex architectures like transformers. So language models are not yet possible, even ignoring the training compute cost.
In the follow-up paper they also discuss various efficiency improvements, not directly related to convolutions, which they made since the original paper. Which speeds up training compared to the original paper, and enables much deeper networks, as the original implementation was limited to around six layers. But they don’t discuss how much slower the training still is compared to neural networks. Though the inference speed-up is extreme. They report improvements of up to 160x for one benchmark and up to 1900x on another. Over the previously fastest neural networks, for equivalent accuracy. In another benchmark they report models being 29x to 56x smaller (in terms of required logic gates) than the previously smallest models with similar accuracy. So the models could more realistically be implemented as an ASIC, which would probably lead to another order of magnitude in inference speed improvement.
But again, they don’t really talk about how much slower they are to train than neural networks, which is likely crucial for whether they will be employed in future frontier LLMs, assuming they will be extended to transformers. So far frontier AI seems to be much more limited by training compute than by inference compute.
Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate “labels” like “this is a tree-detector” or “this is the Golden Gate Bridge feature” is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.
This is just capabilities stuff. I expect that people will use this to train larger networks, as much larger as they can. If your method shrinks the model, it likely induces demand proportionately. In this case it’s not new capabilities stuff by you so it’s less concerning, bit still. This paper is popular because of bees
I’d be pretty surprised if DLGNs became the mainstream way to train NNs, because although they make inference faster they apparently make training slower. Efficient training is arguably more dangerous than efficient inference anyway, because it lets you get novel capabilities sooner. To me, DLGN seems like a different method of training models but not necessarily a better one (for capabilities).
Anyway, I think it can be legitimate to try to steer the AI field towards techniques that are better for alignment/interpretability even if they grant non-zero benefits to capabilities. If you research a technique that could reduce x-risk but can’t point to any particular way it could be beneficial in the near term, it can be hard to convince labs to actually implement it. Of course, you want to be careful about this.
I buy that training slower is a sufficiently large drawback to break scaling. I still think bees are why the paper got popular. But if intelligence depends on clean representation, interpretability due to clean representation is natively and unavoidably bees. We might need some interpretable-bees insights in order to succeed, it does seem like we could get better regret bound proofs (or heuristic arguments) that go through a particular trained model with better (reliable, clean) interp. But the whole deal is the ai gets to exceed us in ways that make human interpreting stuff inherently (as opposed to transiently or fixably) too slow. To be useful durably, interp must become a component in scalably constraining an ongoing training/optimization process. Which means it’s gonna be partly bees in order to be useful. Which means it’s easy to accidentally advance bees more than durable alignment. Not a new problem, and not one with an obvious solution, but occasionally I see something I feel like i wanna comment on.
I was a big disagree vote because of induced demand. You’ve convinced me this paper induces less demand in this version than I worried (I had just missed that it trained slower), but my concern that something like this scales and induces demand remains.
I’ve been working on pure combinational logic LLMs for the past few years, and have a (fairly small) byte level pure combinational logic FSM RNN language model quantized to And Inverter Gate form. I’m currently building the tooling to simplify the logic DAG and analyze it.
Are you, or others, interested in talking with me about it?
Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes—are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won’t be much more interpretable than traditional NNs, or some other reason?
I recently learned about Differentiable Logic Gate Networks, which are trained like neural networks but learn to represent a function entirely as a network of binary logic gates. See the original paper about DLGNs, and the “Recap—Differentiable Logic Gate Networks” section of this blog post from Google, which does an especially good job of explaining it.
It looks like DLGNs could be much more interpretable than standard neural networks, since they learn a very sparse representation of the target function. Like, just look at this DLGN that learned to control a cellular automaton to create a checkerboard, using just 6 gates (really 5, since the AND gate is redundant):
So simple and clean! Of course, this is a very simple problem. But what’s exciting to me is that in principle, it’s possible for a human to understand literally everything about how the network works, given some time to study it.
What would happen if you trained a neural network on this problem and did mech interp on it? My guess is you could eventually figure out an outline of the network’s functionality, but it would take a lot longer, and there would always be some ambiguity as to whether there’s any additional cognition happening in the “error terms” of your explanation.
It appears that DLGNs aren’t yet well-studied, and it might be intractable to use them to train an LLM end-to-end anytime soon. But there are a number of small research projects you could try, for example:
Can you distill small neural networks into a DLGN? Does this let you interpret them more easily?
What kinds of functions can DLGNs learn? Is it possible to learn decent DLGNs in settings as noisy and ambiguous as e.g. simple language modeling?
Can you identify circuits in a larger neural network that would be amenable to DLGN distillation and distill those parts automatically?
Are there other techniques that don’t rely on binary gates but still add more structure to the network, similar to a DLGN but different?
Can you train a DLGN to encourage extra interpretability, like by disentangling different parts of the network to be independent of one another, or making groups of gates form abstractions that get reused in different parts of the network (like how an 8-bit adder is composed of many 1-bit adders)?
Can you have a mixed approach, where some aspects of the network use a more “structured” format and others are more reliant on the fuzzy heuristics of traditional NNs? (E.g. the “high-level” is structured and the “low-level” is fuzzy.)
I’m unlikely to do this myself, since I don’t consider myself much of a mechanistic interpreter, but would be pretty excited to see others do experiments like this!
I think logic gate networks are not substantially more interpretable than neural networks, simply because of their size. Both are complex networks with millions of nodes. Interpretability approaches have to work at a higher level of abstraction in either case.
Regarding language models: The original paper presents a simple feedforward network. The follow-up paper, by mostly the same authors, came out a few months ago. It extends DLGNs to convolutions, analogous to CNNs. Which means they have not yet been extended to even more complex architectures like transformers. So language models are not yet possible, even ignoring the training compute cost.
In the follow-up paper they also discuss various efficiency improvements, not directly related to convolutions, which they made since the original paper. Which speeds up training compared to the original paper, and enables much deeper networks, as the original implementation was limited to around six layers. But they don’t discuss how much slower the training still is compared to neural networks. Though the inference speed-up is extreme. They report improvements of up to 160x for one benchmark and up to 1900x on another. Over the previously fastest neural networks, for equivalent accuracy. In another benchmark they report models being 29x to 56x smaller (in terms of required logic gates) than the previously smallest models with similar accuracy. So the models could more realistically be implemented as an ASIC, which would probably lead to another order of magnitude in inference speed improvement.
But again, they don’t really talk about how much slower they are to train than neural networks, which is likely crucial for whether they will be employed in future frontier LLMs, assuming they will be extended to transformers. So far frontier AI seems to be much more limited by training compute than by inference compute.
Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc
Interesting, strong-upvoted for being very relevant.
My response would be that identifying accurate “labels” like “this is a tree-detector” or “this is the Golden Gate Bridge feature” is one important part of interpretability, but understanding causal connections is also important. The latter is pretty much useless without the former, but having both is much better. And sparse, crisply-defined connections make the latter easier.
Maybe you could do this by combining DLGNs with some SAE-like method.
Do you know if there are scaling laws for DLGNs?
It could be good to look into!
This is just capabilities stuff. I expect that people will use this to train larger networks, as much larger as they can. If your method shrinks the model, it likely induces demand proportionately. In this case it’s not new capabilities stuff by you so it’s less concerning, bit still. This paper is popular because of bees
I’d be pretty surprised if DLGNs became the mainstream way to train NNs, because although they make inference faster they apparently make training slower. Efficient training is arguably more dangerous than efficient inference anyway, because it lets you get novel capabilities sooner. To me, DLGN seems like a different method of training models but not necessarily a better one (for capabilities).
Anyway, I think it can be legitimate to try to steer the AI field towards techniques that are better for alignment/interpretability even if they grant non-zero benefits to capabilities. If you research a technique that could reduce x-risk but can’t point to any particular way it could be beneficial in the near term, it can be hard to convince labs to actually implement it. Of course, you want to be careful about this.
What do you mean?
I buy that training slower is a sufficiently large drawback to break scaling. I still think bees are why the paper got popular. But if intelligence depends on clean representation, interpretability due to clean representation is natively and unavoidably bees. We might need some interpretable-bees insights in order to succeed, it does seem like we could get better regret bound proofs (or heuristic arguments) that go through a particular trained model with better (reliable, clean) interp. But the whole deal is the ai gets to exceed us in ways that make human interpreting stuff inherently (as opposed to transiently or fixably) too slow. To be useful durably, interp must become a component in scalably constraining an ongoing training/optimization process. Which means it’s gonna be partly bees in order to be useful. Which means it’s easy to accidentally advance bees more than durable alignment. Not a new problem, and not one with an obvious solution, but occasionally I see something I feel like i wanna comment on.
I was a big disagree vote because of induced demand. You’ve convinced me this paper induces less demand in this version than I worried (I had just missed that it trained slower), but my concern that something like this scales and induces demand remains.
Capabilities → capabees → bees
I’ve been working on pure combinational logic LLMs for the past few years, and have a (fairly small) byte level pure combinational logic FSM RNN language model quantized to And Inverter Gate form. I’m currently building the tooling to simplify the logic DAG and analyze it.
Are you, or others, interested in talking with me about it?
I might not be the best person to talk to about it, but it sounds interesting! Maybe post about it on the mechanistic interpretability Discord?
Another idea I forgot to mention: figure out whether LLMs can write accurate, intuitive explanations of boolean circuits for automated interpretability.
Curious about the disagree-votes—are these because DLGN or DLGN-inspired methods seem unlikely to scale, they won’t be much more interpretable than traditional NNs, or some other reason?