Hey, I really like the ideas you’re putting down here. From what I’m seeing, this agenda looks something like “train the world’s most powerful autoencoder, with the requirement that the intermediate representation be human-decodable.” Which is a very cool idea!
In fact, I’m hopeful that the IR being decodable doesn’t even require it to be in something approximating symbolic language. For an intuition pump, consider that “I put the large box inside of the smaller box” is a valid sentence, but we intuitively know that it’s not physically valid based on a much more high dimensional “physics-based” world model that does not involve constructing an exhaustive symbolic proof of the volume and carrying capacity of the two cuboids in question. So the IR can be a dense high level representation so long as it can be decoded by some system into human readable or viewable symbols/data, and that would not be in itself damning to the project (unless we suspect that the decoding is partial or incomplete).
My main thought/caution against this proposal, however, would be that this agenda requires moving the capabilities needle forward for supervised/self-supervised learning. Even if the world model is not a neural network, it would seem to have predictive power and capabilities surpassing the best SL systems. I’m not against that per se, but any such advances might then be coupled into a model-based RL system, which would be… not great, and definitely much more risky. Would love to discuss this more, let me know what you think!
Hey, I really like the ideas you’re putting down here
Thanks!
So the IR can be a dense high level representation so long as it can be decoded by some system into human readable or viewable symbols/data
If I understand correctly, that’s the “symbolic enough” case from footnote 2:
if we could generate an external world-model that’s as understandable to us as our own world-models (and we are confident that this understanding is accurate), that should suffice for fulfilling the “interpretability” criterion.
We also don’t have full interpretability into our abstractions down to the neurons, after all.
I don’t think it’d be necessary per se, though. I think if we can get it to produce an explanation like this, we can then just iterate to “explain the explanation”, et cetera, until everything’s been reduced to symbolics. Or it can be achieved by turning some other “crank” controlling the “explanation fidelity”.
But yeah, “symbolic-enough” may be satisfactory.
My main thought/caution against this proposal, however, would be that this agenda requires moving the capabilities needle forward for supervised/self-supervised learning
Yep. As I’d briefly mentioned, the actual gears-level sketches of “sufficiently powerful compression algorithms” are obviously dual-use, and shouldn’t be openly published.
Glad to see we’re basically agreed. However, how would you take safety precautions around your own work on such algorithms, given our last big similar breakthrough (transformers for language modelling) basically instantly got coopted for RL to be “agentified”? Unless you’re literally doing this alone (with a very strong will) wouldn’t that be the natural path for any company/group once the simulator is finished?
“Share the dual-use stuff only with specific people who are known to properly understand the AGI risk, can avoid babbling about it in public, and would be useful contributors” seems like the straightforward approach here.
Like, groups of people are able to maintain commercial secrets. This is kind of not unlike that, except with somewhat higher stakes.
I mean, AI people are notoriously bad at doing these kinds of things xD I would expect the people running openai or anthropic to say similar things to this (when their orgs were just starting out). So I hope you can see why I wanted to ask this. None of this is to cast any doubt on your ability or motives, just noting the minefield that is unfortunately next to the park where we’re having this conversation.
Hey, I really like the ideas you’re putting down here. From what I’m seeing, this agenda looks something like “train the world’s most powerful autoencoder, with the requirement that the intermediate representation be human-decodable.” Which is a very cool idea!
In fact, I’m hopeful that the IR being decodable doesn’t even require it to be in something approximating symbolic language. For an intuition pump, consider that “I put the large box inside of the smaller box” is a valid sentence, but we intuitively know that it’s not physically valid based on a much more high dimensional “physics-based” world model that does not involve constructing an exhaustive symbolic proof of the volume and carrying capacity of the two cuboids in question. So the IR can be a dense high level representation so long as it can be decoded by some system into human readable or viewable symbols/data, and that would not be in itself damning to the project (unless we suspect that the decoding is partial or incomplete).
My main thought/caution against this proposal, however, would be that this agenda requires moving the capabilities needle forward for supervised/self-supervised learning. Even if the world model is not a neural network, it would seem to have predictive power and capabilities surpassing the best SL systems. I’m not against that per se, but any such advances might then be coupled into a model-based RL system, which would be… not great, and definitely much more risky. Would love to discuss this more, let me know what you think!
Thanks!
If I understand correctly, that’s the “symbolic enough” case from footnote 2:
We also don’t have full interpretability into our abstractions down to the neurons, after all.
I don’t think it’d be necessary per se, though. I think if we can get it to produce an explanation like this, we can then just iterate to “explain the explanation”, et cetera, until everything’s been reduced to symbolics. Or it can be achieved by turning some other “crank” controlling the “explanation fidelity”.
But yeah, “symbolic-enough” may be satisfactory.
Yep. As I’d briefly mentioned, the actual gears-level sketches of “sufficiently powerful compression algorithms” are obviously dual-use, and shouldn’t be openly published.
Glad to see we’re basically agreed. However, how would you take safety precautions around your own work on such algorithms, given our last big similar breakthrough (transformers for language modelling) basically instantly got coopted for RL to be “agentified”? Unless you’re literally doing this alone (with a very strong will) wouldn’t that be the natural path for any company/group once the simulator is finished?
“Share the dual-use stuff only with specific people who are known to properly understand the AGI risk, can avoid babbling about it in public, and would be useful contributors” seems like the straightforward approach here.
Like, groups of people are able to maintain commercial secrets. This is kind of not unlike that, except with somewhat higher stakes.
I mean, AI people are notoriously bad at doing these kinds of things xD I would expect the people running openai or anthropic to say similar things to this (when their orgs were just starting out). So I hope you can see why I wanted to ask this. None of this is to cast any doubt on your ability or motives, just noting the minefield that is unfortunately next to the park where we’re having this conversation.
For what it’s worth, I’m painfully aware of all the skulls lying around, yep.