interpretability didn’t progress at all, or that we know nothing about AI internals at all
No to the former, yes to the latter—which is noteworthy because Eliezer only claimed the latter. That’s not a knock on interpretability research, when in fact Eliezer has repeatedly and publicly praised e.g. the work of Chis Olah and Distill. The choice to interpret the claim that we “know nothing about AI internals” as the claim that “no interpretability work has been done”, it should be pointed out, was a reading imposed by ShardPhoenix (and subsequently by you).
And in fact, we do still have approximately zero idea how large neural nets do what they do, interpretability research notwithstanding, as evinced by the fact that not a single person on this planet could code by hand whatever internal algorithms the models have learned. (The same is true of the brain, incidentally, which is why you sometimes hear people say “we have no idea how the brain works”, despite an insistently literal interpretation of this statement being falsified by the existence of neuroscience as a field.)
But it does, in fact, matter, whether the research into neural net interpretability translates to us knowing, in a real sense, what kind of work is going on inside large language models! That, ultimately, is the metric by which reality will judge us, not how many publications on interpretability were made (or how cool the results of said publications were—which, for the record, I think are very cool). And in light of that, I think it’s disingenuous to interpret Eliezer’s remark the way you and ShardPhoenix seem to be insisting on interpreting it in this thread.
And in fact, we do still have approximately zero idea how large neural nets do what they do, interpretability research notwithstanding, as evinced by the fact that not a single person on this planet could code by hand whatever internal algorithms the models have learned.
I now see where the problem lies. The basic issues I see with this argument are as follows:
The implied argument is if you can’t create something by yourself by hand in the field, you know nothing at all about what you are focusing on. This is straightforwardly not true for a lot of fields.
For example, I’d probably know quite a lot about borderlands 3, not perfectly, but I actually have quite a bit of knowledge, and I even could use save editors or cheatware with video tutorials, but under nearly 0 circumstances could I actually create borderlands 3 even if the game with it’s code already existed, even with a team.
This likely generalizes: while neuroscience has some knowledge of the brain, it’s not nearly at the point where it could reliably create a human brain from scratch, knowing some things about what cars do is not enough to create a working car, and so on.
In general, I think the error is that you and Eliezer have too high expectations of what some knowledge will bring you. It helps, but in virtually no cases will the knowledge alone allow you to create the thing you are focusing on.
It’s possible that our knowledge of the AI’s internal work isn’t enough, and that progress is too slow. I might agree or disagree, but at least this would be rational. Right now, I’m seeing basic locally invalid arguments here, and I notice that part of the problem is that you and Eliezer have too much of a binary view on knowledge, where you either have functionally perfect knowledge or no knowledge at all, but usually our knowledge is neither functionally perfect, nor is it zero knowledge.
Edit: This seems conceptually similar to P=NP, in that the problem is that verifying something and making something are conjectured to have very different difficulties, and essentially my claim is that verifying something isn’t equal to generating something.
No to the former, yes to the latter—which is noteworthy because Eliezer only claimed the latter. That’s not a knock on interpretability research, when in fact Eliezer has repeatedly and publicly praised e.g. the work of Chis Olah and Distill. The choice to interpret the claim that we “know nothing about AI internals” as the claim that “no interpretability work has been done”, it should be pointed out, was a reading imposed by ShardPhoenix (and subsequently by you).
And in fact, we do still have approximately zero idea how large neural nets do what they do, interpretability research notwithstanding, as evinced by the fact that not a single person on this planet could code by hand whatever internal algorithms the models have learned. (The same is true of the brain, incidentally, which is why you sometimes hear people say “we have no idea how the brain works”, despite an insistently literal interpretation of this statement being falsified by the existence of neuroscience as a field.)
But it does, in fact, matter, whether the research into neural net interpretability translates to us knowing, in a real sense, what kind of work is going on inside large language models! That, ultimately, is the metric by which reality will judge us, not how many publications on interpretability were made (or how cool the results of said publications were—which, for the record, I think are very cool). And in light of that, I think it’s disingenuous to interpret Eliezer’s remark the way you and ShardPhoenix seem to be insisting on interpreting it in this thread.
I now see where the problem lies. The basic issues I see with this argument are as follows:
The implied argument is if you can’t create something by yourself by hand in the field, you know nothing at all about what you are focusing on. This is straightforwardly not true for a lot of fields.
For example, I’d probably know quite a lot about borderlands 3, not perfectly, but I actually have quite a bit of knowledge, and I even could use save editors or cheatware with video tutorials, but under nearly 0 circumstances could I actually create borderlands 3 even if the game with it’s code already existed, even with a team.
This likely generalizes: while neuroscience has some knowledge of the brain, it’s not nearly at the point where it could reliably create a human brain from scratch, knowing some things about what cars do is not enough to create a working car, and so on.
In general, I think the error is that you and Eliezer have too high expectations of what some knowledge will bring you. It helps, but in virtually no cases will the knowledge alone allow you to create the thing you are focusing on.
It’s possible that our knowledge of the AI’s internal work isn’t enough, and that progress is too slow. I might agree or disagree, but at least this would be rational. Right now, I’m seeing basic locally invalid arguments here, and I notice that part of the problem is that you and Eliezer have too much of a binary view on knowledge, where you either have functionally perfect knowledge or no knowledge at all, but usually our knowledge is neither functionally perfect, nor is it zero knowledge.
Edit: This seems conceptually similar to P=NP, in that the problem is that verifying something and making something are conjectured to have very different difficulties, and essentially my claim is that verifying something isn’t equal to generating something.