I’ve lately been frustrated with Suno (AI music) and Midjourney, where I get something that has some nice vibes I want, but, then, it’s wrong in some way.
Generally, the way these have improved has been via getting better prompting, presumably via straightforwardish training.
Recently, I was finding myself wishing I could get Suno to copy a vibe from one song (which had wrong melodies but correct atmosphere) into a cover of another song with the correct melodies. I found myself wishing for some combination of interpretability/activation steering/something.
Like, go to a particular range of a song, and then have some auto-intepretability-tools pop out major features like “violin” “the key” “vocals style”, etc, and then let me somehow insert that into another song.
I’m not sure of the tractability of getting this to be useful enough to be worthwhile. But, if you got over some minimum hump of “it’s at least a useful tool to have in combination with normal prompting”, you might be able to get into a flywheel of “now, it’s easier to funnel commercial dollars into interpretability research and there’s a feedbackloop of ’was it actually useful?”.
And, doing it for visual or musical art would add a few steps between “improved interpretability” and “directly accelerating capabilities.”
This is presumably difficult because:
Interpretability just isn’t quite there yet, and, you need a real good team to get it there and also continue making progress
Converting it into a product is a lot of work and skill – you need a particularly good UI and product design team.
This came up because I noticed the recent Suno Studio app was actually quite good, UI wise. (They had some nice little innovations over the usual sound-mixing-app suite), and I bet the Suno could figure out how to get from Chris Olah-like interfaces to something actually usable.
I don’t think it even has to be that useful in order to become commercially interesting – it could start as a fun toy that’s mostly delightful in how it produces weird shit that doesn’t quite make sense.
I’m a researcher at Suno, interpretability and control are things we are very interested in!
In general, I think music is a very challenging, low stakes test bed for alignment approaches. Everyone has wildly varied and specific tastes in music which often can’t be described in words. Feedback is relatively more expensive compared to language and images since you need to spend time to listen to the audio.
Any advances in controllability do get released quickly to a eager audience, like Studio. The commercial incentives align well. We’re looking for people and ideas to push further in this direction.
Have you experimented with ComfyUI? Image generation for “power users”—many more knobs to turn and control output generation. There’s actually a lot of room for steering during the diffusion/generation process, but it’s hard to expose it in the same intuitive/general way as a prompt in language. Comfy feels more like learning Adobe or Figma.
Towards commercially useful interpretability
I’ve lately been frustrated with Suno (AI music) and Midjourney, where I get something that has some nice vibes I want, but, then, it’s wrong in some way.
Generally, the way these have improved has been via getting better prompting, presumably via straightforwardish training.
Recently, I was finding myself wishing I could get Suno to copy a vibe from one song (which had wrong melodies but correct atmosphere) into a cover of another song with the correct melodies. I found myself wishing for some combination of interpretability/activation steering/something.
Like, go to a particular range of a song, and then have some auto-intepretability-tools pop out major features like “violin” “the key” “vocals style”, etc, and then let me somehow insert that into another song.
I’m not sure of the tractability of getting this to be useful enough to be worthwhile. But, if you got over some minimum hump of “it’s at least a useful tool to have in combination with normal prompting”, you might be able to get into a flywheel of “now, it’s easier to funnel commercial dollars into interpretability research and there’s a feedbackloop of ’was it actually useful?”.
And, doing it for visual or musical art would add a few steps between “improved interpretability” and “directly accelerating capabilities.”
This is presumably difficult because:
Interpretability just isn’t quite there yet, and, you need a real good team to get it there and also continue making progress
Converting it into a product is a lot of work and skill – you need a particularly good UI and product design team.
This came up because I noticed the recent Suno Studio app was actually quite good, UI wise. (They had some nice little innovations over the usual sound-mixing-app suite), and I bet the Suno could figure out how to get from Chris Olah-like interfaces to something actually usable.
I don’t think it even has to be that useful in order to become commercially interesting – it could start as a fun toy that’s mostly delightful in how it produces weird shit that doesn’t quite make sense.
I’m a researcher at Suno, interpretability and control are things we are very interested in!
In general, I think music is a very challenging, low stakes test bed for alignment approaches. Everyone has wildly varied and specific tastes in music which often can’t be described in words. Feedback is relatively more expensive compared to language and images since you need to spend time to listen to the audio.
Any advances in controllability do get released quickly to a eager audience, like Studio. The commercial incentives align well. We’re looking for people and ideas to push further in this direction.
Oh great news!
I’m curious what’s like the raw state of… what metadata you currently have about a given song or slice-of-a-song?
Have you experimented with ComfyUI? Image generation for “power users”—many more knobs to turn and control output generation. There’s actually a lot of room for steering during the diffusion/generation process, but it’s hard to expose it in the same intuitive/general way as a prompt in language. Comfy feels more like learning Adobe or Figma.
Hadn’t heard of it. Will take a look. Curious if you have any tips for getting over the initial hump of grokking it’s workflow.