I do think that we don’t really understand anything in a language model at this level of detail. Like, how does a language model count? How does a language model think about the weather? How does a language model do ROT13 encoding? How does a language model answer physics questions? We don’t have an answer to any of these at any acceptable level of detail, so why would we circle out our confusion about the language agnosticity, if indeed we predicted it, we just don’t really understand the details the same way we don’ really understand the details of anything going on in a large language model.
Later in the thread Jan asks, “is this interpretability complete?” which I think implies that his intuition is that this should be easier to figure out than other questions (perhaps because it seems so simple). But yeah, it’s kind of unclear why he is calling out this in particular.
I agree we don’t really understand anything in LLMs at this level of detail, but I liked Jan highlighting this confusion anyway, since I think it’s useful to promote particular weird behaviors to attention. I would be quite thrilled if more people got nerd sniped on trying to explain such things!
I do think that we don’t really understand anything in a language model at this level of detail. Like, how does a language model count? How does a language model think about the weather? How does a language model do ROT13 encoding? How does a language model answer physics questions? We don’t have an answer to any of these at any acceptable level of detail, so why would we circle out our confusion about the language agnosticity, if indeed we predicted it, we just don’t really understand the details the same way we don’ really understand the details of anything going on in a large language model.
Later in the thread Jan asks, “is this interpretability complete?” which I think implies that his intuition is that this should be easier to figure out than other questions (perhaps because it seems so simple). But yeah, it’s kind of unclear why he is calling out this in particular.
I agree we don’t really understand anything in LLMs at this level of detail, but I liked Jan highlighting this confusion anyway, since I think it’s useful to promote particular weird behaviors to attention. I would be quite thrilled if more people got nerd sniped on trying to explain such things!