I finally setup SkyPilot to let me queue up GPU training jobs (both on my local GPU and via RunPod), and I really should have done this months ago. Claude wrote me some bash scripts to spin up remote pods, run training, and tear it down, but this version is so much easier, and it has a nice UI.
It also sounds like I can easily extend this to Vast.ai, which would let me parallelize experiments for 5 cents/hour on RTX 3060′s[1]. I’m interested in understanding algorithms used by tiny toy models, and fancy GPUs don’t really help since I can’t fully utilize them.
Anyway, if you’re also queueing up local experiments or trying to use remote GPUs efficiently, this is totally worth spending an hour to setup.
FYI: Claude really wanted to set this up in a way that would give every account on my machine root, but you can run the API server as a sudoer and let other users submit jobs without giving them root access. This matters to me because I use user accounts to sandbox dangerously-skip-permissions-mode Claude Code.
Update: SkyPilot is very opinionated about which GPUs I’m allowed to use on vast.ai, and simultaneously won’t let me add any filtering of my own, so this is less useful than I hoped it would be.
Beware, vast.ai is very much ‘airbnb for gpus’, which is to say it has the same security story as airbnb: the host can do whatever they want and you basically don’t know who they are.
Yeah that’s definitely important to be aware of. I think the security story should be fine in my case, since I’m submitting containerized jobs and uploading results to S3, and nothing is particularly secret (I’m training easy-to-train models so I can inspect the algorithms they learn).
One annoying thing about SkyPilot though is that it treats all GPUs on vast.ai equally and doesn’t let you pass additional filters besides “give me an RTX 5090”. The vastai CLI has a lot more options, including datacenter-only if you want.
I finally setup SkyPilot to let me queue up GPU training jobs (both on my local GPU and via RunPod), and I really should have done this months ago. Claude wrote me some bash scripts to spin up remote pods, run training, and tear it down, but this version is so much easier, and it has a nice UI.
It also sounds like I can easily extend this to Vast.ai, which would let me parallelize experiments for 5 cents/hour on RTX 3060′s[1]. I’m interested in understanding algorithms used by tiny toy models, and fancy GPUs don’t really help since I can’t fully utilize them.
Anyway, if you’re also queueing up local experiments or trying to use remote GPUs efficiently, this is totally worth spending an hour to setup.
FYI: Claude really wanted to set this up in a way that would give every account on my machine root, but you can run the API server as a sudoer and let other users submit jobs without giving them root access. This matters to me because I use user accounts to sandbox dangerously-skip-permissions-mode Claude Code.
Update: SkyPilot is very opinionated about which GPUs I’m allowed to use on vast.ai, and simultaneously won’t let me add any filtering of my own, so this is less useful than I hoped it would be.
Beware, vast.ai is very much ‘airbnb for gpus’, which is to say it has the same security story as airbnb: the host can do whatever they want and you basically don’t know who they are.
Yeah that’s definitely important to be aware of. I think the security story should be fine in my case, since I’m submitting containerized jobs and uploading results to S3, and nothing is particularly secret (I’m training easy-to-train models so I can inspect the algorithms they learn).
One annoying thing about SkyPilot though is that it treats all GPUs on vast.ai equally and doesn’t let you pass additional filters besides “give me an RTX 5090”. The
vastaiCLI has a lot more options, including datacenter-only if you want.I have mostly switched from using vast.ai/runpod/lambda labs to modal for my experiments.
That does seem like a much nicer interface, although I think it would be a lot more expensive for my purposes.