Why are you putting your six singers into two groups of three? The ideal, from the perspective of everyone hearing as many people as possible, is to order your singers a, b, c, d, e, f. Each person hears the audio from those ahead of them. If you have really very large numbers of people, such that arranging them in a full chain gives an end to end latency that is too high, then you can use some sort of chain of groups, for example a, b + c, d + e, f + g.
If you have any sort of chain that is reasonably long, then you want to be resilient to losing a link. That’s much easier to do when you have a server that everyone is sending and receiving audio from. Our current design can recover smoothly from someone having a network hiccup because all that happens is you lose a bit of audio data and then resume. Key to this is that people downstream from the one having a network problem don’t have their audio interrupted, beyond losing audio from the person who is no longer connected.
In theory a peer to peer approach could offer slightly lower latency, but I expect the game there is minimal. Sending a packet from a to b, versus sending a packet from a to a high-connectivity well-placed central server to b, isn’t actually that different.
With the FFT, I think you may be effectively reinventing lossy audio compression? I think we’ll likely get much better results using opus or another modern codec.
Suppose you have loads of singers. The task of averaging all the signals together may be to much for any one computer, or might require too much bandwidth.
So you split the task of averaging into 3 parts.
np.sum(a)==np.sum(a[:x]) + np.sum(a[x:])
One computer can average the signals from alice and bob, a second can average the signals from carol and dave. These both send their signals to a third computer, which averages the two signals into a combination of all 4 singers, and sends everyone the results.
I think that’s where you’re imagining this differently than I am. In the approach I am describing, everything is real time. The only time you hear some thing is when you were singing along to it. You never hear a version of the audio includes your own voice, and includes the voices of anyone after you in the chain. The goal is not to create something and everyone listen back to it, the goal is to sing together in the moment.
The amount of latency, even in an extremely efficient implementation, will be high enough to keep that approach from working. Unless everyone has a very low latency audio setup (roughly the default on macs, somewhat difficult elsewhere, impossible on Android), a wired internet connection, and relatively low physical distance, you just can’t get low enough latency to keep everything feeling simultaneous. A good target there is about 30ms.
The goal with this project is to make something feel like group singing, even though people are not actually singing at the exact same time as each other.
Why are you putting your six singers into two groups of three? The ideal, from the perspective of everyone hearing as many people as possible, is to order your singers a, b, c, d, e, f. Each person hears the audio from those ahead of them. If you have really very large numbers of people, such that arranging them in a full chain gives an end to end latency that is too high, then you can use some sort of chain of groups, for example a, b + c, d + e, f + g.
If you have any sort of chain that is reasonably long, then you want to be resilient to losing a link. That’s much easier to do when you have a server that everyone is sending and receiving audio from. Our current design can recover smoothly from someone having a network hiccup because all that happens is you lose a bit of audio data and then resume. Key to this is that people downstream from the one having a network problem don’t have their audio interrupted, beyond losing audio from the person who is no longer connected.
In theory a peer to peer approach could offer slightly lower latency, but I expect the game there is minimal. Sending a packet from a to b, versus sending a packet from a to a high-connectivity well-placed central server to b, isn’t actually that different.
With the FFT, I think you may be effectively reinventing lossy audio compression? I think we’ll likely get much better results using opus or another modern codec.
Everyone hears audio from everyone.
Suppose you have loads of singers. The task of averaging all the signals together may be to much for any one computer, or might require too much bandwidth.
So you split the task of averaging into 3 parts.
np.sum(a)==np.sum(a[:x]) + np.sum(a[x:])
One computer can average the signals from alice and bob, a second can average the signals from carol and dave. These both send their signals to a third computer, which averages the two signals into a combination of all 4 singers, and sends everyone the results.
I think that’s where you’re imagining this differently than I am. In the approach I am describing, everything is real time. The only time you hear some thing is when you were singing along to it. You never hear a version of the audio includes your own voice, and includes the voices of anyone after you in the chain. The goal is not to create something and everyone listen back to it, the goal is to sing together in the moment.
By “send everyone the results” I was thinking of doing this with a block of audio lasting a few milliseconds.
Everyone hears everyones voices with a few milliseconds delay.
If you want not to echo peoples own voices, then keep track of the timestamps, every computer can subtract their own signal from the total.
The amount of latency, even in an extremely efficient implementation, will be high enough to keep that approach from working. Unless everyone has a very low latency audio setup (roughly the default on macs, somewhat difficult elsewhere, impossible on Android), a wired internet connection, and relatively low physical distance, you just can’t get low enough latency to keep everything feeling simultaneous. A good target there is about 30ms.
The goal with this project is to make something feel like group singing, even though people are not actually singing at the exact same time as each other.