I can’t be sure what’s in the data, but we have a few hints:
The exact question (“is this land or water?”), is of course, very unlikely to be in the training corpus. At the very least, the models contain some multi-purpose map of the world. Further experimentation I’ve done with embedding models confirms that we can extract maps of biomes and country borders from embedding space too.
There’s definitely compression. In smaller models, the ways in which the representations are inaccurate actually tell us a lot: instead of spikes of “land” around population centers (which are more likely to be in the training set), we see massive smooth elliptical blobs of land. This indicates that there’s some internal notion of geographical distance, and that it’s identifying continents as a natural abstraction.
Is this coming just from the models having geographic data in their training? Much less impressive if so but still cool.
I can’t be sure what’s in the data, but we have a few hints:
The exact question (“is this land or water?”), is of course, very unlikely to be in the training corpus. At the very least, the models contain some multi-purpose map of the world. Further experimentation I’ve done with embedding models confirms that we can extract maps of biomes and country borders from embedding space too.
There’s definitely compression. In smaller models, the ways in which the representations are inaccurate actually tell us a lot: instead of spikes of “land” around population centers (which are more likely to be in the training set), we see massive smooth elliptical blobs of land. This indicates that there’s some internal notion of geographical distance, and that it’s identifying continents as a natural abstraction.