Have you met Mistral, Phi-2, Falcon, MPT, etc … ? There are plenty of freely remixable models out there; some even link to their datasets and recipes involved in processing them (though I wouldn’t be surprised if some relevant thing got left out because no one researched that it was relevant yet).
Though I’m reasonably sure Llama license isn’t preventing viewing the source (though of course not the training data), modifying it, understanding it and remixing it. It’s a less open license than others, but Facebook didn’t just free-as-in-beer release a compiled black box you put on your computer and can never change; research was part of the purpose, and needs to do that. It’s not the best open source license, but I’m not sure if being a good example of something is required to meet the definition.
“Freely remixable” models don’t generally have open datasets used for training. If you know of one, that’s great, and would be closer to open source. (Not Mistral. And Phi-2 is using synthetic data from other LLMs—I don’t know what they released about the methods used to generate or select the text, but it’s not open.)
But the entire point is that weights are not the source code for an LLM, they are the compiled program. Yes, it’s modifiable via LoRA and similar, but that’s not open source! Open source would mean I could replicate it, from the ground up. For facebook’s models, at least, the details of the training methods, the RLHF training they do, where they get the data, all of those things are secrets. But they call it “Open Source AI” anyways.
Oh, I do, they’re just generally not quite the best available/most popular for hobbyists. Some I can find quickly enough are Pythia and OpenLLaMA, and some of the RedPajama models Together.ai trained on their own RedPajama dataset (which is freely available and described). (Also the mentioned Falcon and MPT, as well as StableLM. You might have to get into the weeds to find out how much of the data processing step is replicable.)
(It’s going to be expensive to replicate any big pretrained model though, and possibly not deterministic enough to do it perfectly; especially since datasets sometimes adjust due to removing unsafe data, the recipes for data processing included random selection and shuffling from the datasets, etc. Smaller examples where people have fine-tuned using the same recipe coincidentally or intentionally have gotten identical model weights though.)
Thanks—Redpajama definitely looks like it fits the bill, but it shouldn’t need to bill itself as making “fully-open, reproducible models,” since that’s what “open source” is already supposed to mean. (Unfortunately, the largest model they have is 7B.)
Though I’m reasonably sure Llama license (sic) isn’t preventing viewing the source
This is technically correct but irrelevant. Meta doesn’t provide any source code, by which I mean the full set of precursor steps (including the data and how to process it).
Generally speaking, a license defines usage rights; it has nothing to do with if/how the thing (e.g. source code) is made available.
As a weird example, one could publish a repository with a license but no source code. This would be odd. The license would have no power to mandate the code be released; that is a separate concern.
To put it another way, a license does not obligate the owner to release or share anything, whether it be compiled software, source code, weights, etc. A license simply outlines the conditions under which the thing (e.g. source code), once released, can be used or modified.
Have you met Mistral, Phi-2, Falcon, MPT, etc … ? There are plenty of freely remixable models out there; some even link to their datasets and recipes involved in processing them (though I wouldn’t be surprised if some relevant thing got left out because no one researched that it was relevant yet).
Though I’m reasonably sure Llama license isn’t preventing viewing the source (though of course not the training data), modifying it, understanding it and remixing it. It’s a less open license than others, but Facebook didn’t just free-as-in-beer release a compiled black box you put on your computer and can never change; research was part of the purpose, and needs to do that. It’s not the best open source license, but I’m not sure if being a good example of something is required to meet the definition.
“Freely remixable” models don’t generally have open datasets used for training. If you know of one, that’s great, and would be closer to open source. (Not Mistral. And Phi-2 is using synthetic data from other LLMs—I don’t know what they released about the methods used to generate or select the text, but it’s not open.)
But the entire point is that weights are not the source code for an LLM, they are the compiled program. Yes, it’s modifiable via LoRA and similar, but that’s not open source! Open source would mean I could replicate it, from the ground up. For facebook’s models, at least, the details of the training methods, the RLHF training they do, where they get the data, all of those things are secrets. But they call it “Open Source AI” anyways.
Oh, I do, they’re just generally not quite the best available/most popular for hobbyists. Some I can find quickly enough are Pythia and OpenLLaMA, and some of the RedPajama models Together.ai trained on their own RedPajama dataset (which is freely available and described). (Also the mentioned Falcon and MPT, as well as StableLM. You might have to get into the weeds to find out how much of the data processing step is replicable.)
(It’s going to be expensive to replicate any big pretrained model though, and possibly not deterministic enough to do it perfectly; especially since datasets sometimes adjust due to removing unsafe data, the recipes for data processing included random selection and shuffling from the datasets, etc. Smaller examples where people have fine-tuned using the same recipe coincidentally or intentionally have gotten identical model weights though.)
Thanks—Redpajama definitely looks like it fits the bill, but it shouldn’t need to bill itself as making “fully-open, reproducible models,” since that’s what “open source” is already supposed to mean. (Unfortunately, the largest model they have is 7B.)
This is technically correct but irrelevant. Meta doesn’t provide any source code, by which I mean the full set of precursor steps (including the data and how to process it).
Generally speaking, a license defines usage rights; it has nothing to do with if/how the thing (e.g. source code) is made available.
As a weird example, one could publish a repository with a license but no source code. This would be odd. The license would have no power to mandate the code be released; that is a separate concern.
To put it another way, a license does not obligate the owner to release or share anything, whether it be compiled software, source code, weights, etc. A license simply outlines the conditions under which the thing (e.g. source code), once released, can be used or modified.