My general crux here is that I’m much, much more pessimistic on these 2 particular properties put together:
“Sufficiently powerful”: contains or can be used to generate knowledge sufficient to resolve our AGI-doom problem, such as recipes for comprehensive mechanistic interpretability, mind uploading, or adult intelligence enhancement, or for robust solutions to alignment directly.
“Easily interpretable”: written in some symbolic language, such that interpreting it is in the reference class of “understand a vast complex codebase” combined with “learn new physics from a textbook”, not “solve major philosophical/theoretical problems”.
And much of my reason here is this particular theorem from Shane Legg, which shows that unlike the Solomonoff Induction case or other non-computable reasoners, being more capable directly means being more complicated:
And this remains true even if the Natural Abstractions agenda does work out like people hope it does.
I’m not sure how you plan to square the circle, but in general I’m substantially more pessimistic of “interpretable models/AIs that are also powerful” than a lot of people on here (the closest perspective is Cole Wyeth in this post Glass box learners want to be black box).
Yep, I know of this result. I haven’t looked into it in depth, but my understanding is that it only says that powerful predictors have to be “complex” in the sense of high Kolmogorov complexity, right? But “high K-complexity” doesn’t mean “is a monolithic, irreducibly complex mess”. In particular, it doesn’t rule out this property:
“Well-structured”: has an organized top-down hierarchical structure, learning which lets you quickly navigate to specific information in it.
Wikipedia has pretty high K-complexity, well beyond the ability of the human mind to hold in its working memory. But it’s still usable, because you’re not trying to cram all of it into your brain at once. Its structure is navigable, and you only retrieve the information you want.
Similarly, the world’s complexity is high, but it seems decomposable, into small modules that could be understood separately and navigated to locate specific knowledge.
My general crux here is that I’m much, much more pessimistic on these 2 particular properties put together:
And much of my reason here is this particular theorem from Shane Legg, which shows that unlike the Solomonoff Induction case or other non-computable reasoners, being more capable directly means being more complicated:
Is there an Elegant Universal Theory of Prediction? Shane Legg (2006):
And this remains true even if the Natural Abstractions agenda does work out like people hope it does.
I’m not sure how you plan to square the circle, but in general I’m substantially more pessimistic of “interpretable models/AIs that are also powerful” than a lot of people on here (the closest perspective is Cole Wyeth in this post Glass box learners want to be black box).
Yep, I know of this result. I haven’t looked into it in depth, but my understanding is that it only says that powerful predictors have to be “complex” in the sense of high Kolmogorov complexity, right? But “high K-complexity” doesn’t mean “is a monolithic, irreducibly complex mess”. In particular, it doesn’t rule out this property:
Wikipedia has pretty high K-complexity, well beyond the ability of the human mind to hold in its working memory. But it’s still usable, because you’re not trying to cram all of it into your brain at once. Its structure is navigable, and you only retrieve the information you want.
Similarly, the world’s complexity is high, but it seems decomposable, into small modules that could be understood separately and navigated to locate specific knowledge.