Been exploring local models lately, and I might be interested in working on a 7B-13B model version of this, potentially with scaling up to preferred models, if I could find the time and compensation.
A $1000 computer would be low for what you want, and $10,000 would be high, unless you need the kind of self-tuned reasoning capability only truly accessible on enterprise graphics cards; a 7B model or quantization of it would work best with at least 8GB vRAM (or unified memory? I’d go higher on unified memory and not make it dedicated hardware), and 34B 4-bit quantizations look like they should fit in 24GB vRAM, which is about the limit for consumer grade nvidia cards (best for compatibility in all aspects). The most efficient pricing specifically for inferencing GGUF format is found in Macs, currently: the unified memory works perfectly well for it, and they go up to 128 … wait no, 192GB shared memory (metal memory about 75% or 145GB), so if you maximize that and storage … an 8TB, 192GB Mac Studio computer will run you about $8000 and theoretically should be able to run even the largest current open weight models (though training and so forth is a greater challenge).
Cool! Unfortunately I’m not really sure if the idea itself is compatible with turning a profit—modern business models would push for it to leak data or include ads in ways that would defeat the purpose.
I’ll eventually get one of the good macs if I have to, but I’m giving that decision another year or so to become clearer whether or not it’ll be really necessary in the long run.
I’ve also heard some very promising things about eventually being able to do a one-time investment of renting fancy compute for initial training, and then compressing the trained model to run on smaller hardware.
Yeah, there’s a reason I specified ‘compensation’ rather than ‘profit’. :) Executive function assistants of some kind could be useful for me too, but whether it’d be useful enough to put the work into it as its own reward … well, that’s a question.
And, yeah, if you want to either rent the GPU yourself or have someone do training for you, and you don’t mind the training data going into the cloud, that’s the best way to do it. Tuning takes more compute than inference, in general.
(I don’t think personally identifying training data is particularly helpful for tuning; you’re trying to get methods, approach and formatting down, not so much memory, though it may pick up on a few things. Not to mention if you ever felt like letting it out of the box and sharing your helpful assistant. Retrieval-augmented context is better for memory outside pretraining.)
Quantized models have good performance/speed tradeoffs—a 4, 5 or 6 bit quantization of a larger model still captures most of the performance improvement over smaller models (that would fit in the same memory without quantization) of equivalent quality otherwise. You can indeed run inference on much larger models than you can train.
Been exploring local models lately, and I might be interested in working on a 7B-13B model version of this, potentially with scaling up to preferred models, if I could find the time and compensation.
A $1000 computer would be low for what you want, and $10,000 would be high, unless you need the kind of self-tuned reasoning capability only truly accessible on enterprise graphics cards; a 7B model or quantization of it would work best with at least 8GB vRAM (or unified memory? I’d go higher on unified memory and not make it dedicated hardware), and 34B 4-bit quantizations look like they should fit in 24GB vRAM, which is about the limit for consumer grade nvidia cards (best for compatibility in all aspects). The most efficient pricing specifically for inferencing GGUF format is found in Macs, currently: the unified memory works perfectly well for it, and they go up to 128 … wait no, 192GB shared memory (metal memory about 75% or 145GB), so if you maximize that and storage … an 8TB, 192GB Mac Studio computer will run you about $8000 and theoretically should be able to run even the largest current open weight models (though training and so forth is a greater challenge).
Cool! Unfortunately I’m not really sure if the idea itself is compatible with turning a profit—modern business models would push for it to leak data or include ads in ways that would defeat the purpose.
I’ll eventually get one of the good macs if I have to, but I’m giving that decision another year or so to become clearer whether or not it’ll be really necessary in the long run.
I’ve also heard some very promising things about eventually being able to do a one-time investment of renting fancy compute for initial training, and then compressing the trained model to run on smaller hardware.
Yeah, there’s a reason I specified ‘compensation’ rather than ‘profit’. :) Executive function assistants of some kind could be useful for me too, but whether it’d be useful enough to put the work into it as its own reward … well, that’s a question.
And, yeah, if you want to either rent the GPU yourself or have someone do training for you, and you don’t mind the training data going into the cloud, that’s the best way to do it. Tuning takes more compute than inference, in general.
(I don’t think personally identifying training data is particularly helpful for tuning; you’re trying to get methods, approach and formatting down, not so much memory, though it may pick up on a few things. Not to mention if you ever felt like letting it out of the box and sharing your helpful assistant. Retrieval-augmented context is better for memory outside pretraining.)
Quantized models have good performance/speed tradeoffs—a 4, 5 or 6 bit quantization of a larger model still captures most of the performance improvement over smaller models (that would fit in the same memory without quantization) of equivalent quality otherwise. You can indeed run inference on much larger models than you can train.