What I want right now is a basic understanding of combustion engines. I want to understand the key internal gears of LLMs that are currently completely mysterious to me, the parts where I don’t have any functional model at all for how they even could work. What I ultimately want to get out of Interpretability at the moment is a sketch of Python code I could write myself, without a numeric optimizer as an intermediary, that would be able to talk.
How would you operationalize this in ML terms? E.g. how much loss in performance would you consider acceptable, on how wide a distribution of e.g. GPT-4′s capabilities, how many lines of python code, etc.? Would you consider acceptable existing rough theoretical explanations, e.g. An Information-Theoretic Analysis of In-Context Learning? (I suspect not, because no ‘sketch of python code’ feasibility).
(I’ll note that by default I’m highly skeptical of any current-day-human producing anything like a comprehensible, not-extremely-long ‘sketch of Python code’ of GPT-4 in a reasonable amount of time. For comparison, how hopeful would you be of producing the same for a smart human’s brain? And on some dimensions—e.g. knowledge—GPT-4 is vastly superhuman.)
I think OP just wanted some declarative code (I don’t think Python is the ideal choice of language, but basically anything that’s not a Turing tarpit is fine) that could speak fairly coherent English. I suspect if you had a functional transformer decompiler the results aof appling it to a Tiny Stories-size model are going to be tens to hundreds of megabytes of spaghetti, so understanding that in detail is going to be huge slog, but on the other hand, this is an actual operationalization of the Chinese Room argument (or in this case, English Room)! I agree it would be fascinating, if we can get a significant fraction of the model’s perplexity score. If it is, as people seem to suspect, mostly or entirely a pile of spaghetti, understanding even a representative (frequency-of-importance biased) statistical sample of it (say, enough for generating a few specific sentences) would still be fascinating.
How would you operationalize this in ML terms? E.g. how much loss in performance would you consider acceptable, on how wide a distribution of e.g. GPT-4′s capabilities, how many lines of python code, etc.? Would you consider acceptable existing rough theoretical explanations, e.g. An Information-Theoretic Analysis of In-Context Learning? (I suspect not, because no ‘sketch of python code’ feasibility).
(I’ll note that by default I’m highly skeptical of any current-day-human producing anything like a comprehensible, not-extremely-long ‘sketch of Python code’ of GPT-4 in a reasonable amount of time. For comparison, how hopeful would you be of producing the same for a smart human’s brain? And on some dimensions—e.g. knowledge—GPT-4 is vastly superhuman.)
I think OP just wanted some declarative code (I don’t think Python is the ideal choice of language, but basically anything that’s not a Turing tarpit is fine) that could speak fairly coherent English. I suspect if you had a functional transformer decompiler the results aof appling it to a Tiny Stories-size model are going to be tens to hundreds of megabytes of spaghetti, so understanding that in detail is going to be huge slog, but on the other hand, this is an actual operationalization of the Chinese Room argument (or in this case, English Room)! I agree it would be fascinating, if we can get a significant fraction of the model’s perplexity score. If it is, as people seem to suspect, mostly or entirely a pile of spaghetti, understanding even a representative (frequency-of-importance biased) statistical sample of it (say, enough for generating a few specific sentences) would still be fascinating.