>They find functions that fit the results. Most such functions are simple and therefore generalize well. But that doesn’t mean they generalize arbitrarily well.
You have no idea how simple the functions they are learning are.
>Not really any different from the human language LLM, it’s just trained on stuff evolution has figured out rather than stuff humans have figured out. This wouldn’t work if you used random protein sequences instead of evolved ones.
It would work just fine. The model would predict random arbitrary sequences and the structure would still be there.
>They try to predict the results. This leads to predicting the computation that led to the results, because the computation is well-approximated by a simple function and they are also likely to pick a simple function.
Models don’t care about “simple”. They care about what works. Simple is arbitrary and has no real meaning. There are many examples of interpretability research revealing convoluted functions.
>Inverting relationships like this is a pretty good use-case for language models. But here you’re still relying on having an evolutionary ecology to give you lots of examples of proteins.
So ? The point is that they’re limited by the data and the casual processes that informed it, not the intelligence or knowledge of humans providing the data. Models like this can and often do eclipse human ability.
If you train a predictor on text describing the outcome of games as well as games then a good enough predictor should be able to eclipse the output of even the best match in training by modulating the text describing the outcome.
No. Language Models aren’t relying on humans figuring anything out. How could they ? They only see results not processes.
You can train a Language Model on protein sequences. Just the sequences alone, nothing else and see it represent biological structure and function in the inner layers. No one taught them this. It was learnt from the data.
https://www.pnas.org/doi/full/10.1073/pnas.2016239118
The point here is that Language Models see results and try to predict the computation that led to those results. This is not imitation. It’s a crucial difference because it means you aren’t bound by the knowledge of the people supplying this data.
You can take this protein language model. You can train on described function and sequences and you can have a language model that can take supplied use cases and generate novel functional protein sequences to match.
https://www.nature.com/articles/s41587-022-01618-2
Have humans figured this out ? Can we go function to protein just like that ? No way! Not even close