The goal here is for this to feel like group singing as much as possible. I think in your proposal there will be many users where a doesn’t hear b and b also doesn’t hear a? And not just because they’re singing at the same wall time?
I also don’t see why you’re proposing the FFT? Yes, that lets you accumulate frequency information, but you lose timing information in the process. Since our goal is to transmit full audio, I’m not sure how you see the FFT fitting in.
The idea involved everyone hearing everyone. If you have 6 singers, arranged into 2 groups of 3, then each group of 3 can have one person combine 3 streams of audio into one, and then send that off to combine into a single stream. Instead of alice sending alices singing to the server, and bob sending bobs singing, alice can send her singing to bob, and bob can send (alice+bob) to the server.
The idea with Fourier transforms was a form of compression in which it was quick to sum 2 compressed signals into a single compressed signal.
Why are you putting your six singers into two groups of three? The ideal, from the perspective of everyone hearing as many people as possible, is to order your singers a, b, c, d, e, f. Each person hears the audio from those ahead of them. If you have really very large numbers of people, such that arranging them in a full chain gives an end to end latency that is too high, then you can use some sort of chain of groups, for example a, b + c, d + e, f + g.
If you have any sort of chain that is reasonably long, then you want to be resilient to losing a link. That’s much easier to do when you have a server that everyone is sending and receiving audio from. Our current design can recover smoothly from someone having a network hiccup because all that happens is you lose a bit of audio data and then resume. Key to this is that people downstream from the one having a network problem don’t have their audio interrupted, beyond losing audio from the person who is no longer connected.
In theory a peer to peer approach could offer slightly lower latency, but I expect the game there is minimal. Sending a packet from a to b, versus sending a packet from a to a high-connectivity well-placed central server to b, isn’t actually that different.
With the FFT, I think you may be effectively reinventing lossy audio compression? I think we’ll likely get much better results using opus or another modern codec.
Suppose you have loads of singers. The task of averaging all the signals together may be to much for any one computer, or might require too much bandwidth.
So you split the task of averaging into 3 parts.
np.sum(a)==np.sum(a[:x]) + np.sum(a[x:])
One computer can average the signals from alice and bob, a second can average the signals from carol and dave. These both send their signals to a third computer, which averages the two signals into a combination of all 4 singers, and sends everyone the results.
I think that’s where you’re imagining this differently than I am. In the approach I am describing, everything is real time. The only time you hear some thing is when you were singing along to it. You never hear a version of the audio includes your own voice, and includes the voices of anyone after you in the chain. The goal is not to create something and everyone listen back to it, the goal is to sing together in the moment.
The amount of latency, even in an extremely efficient implementation, will be high enough to keep that approach from working. Unless everyone has a very low latency audio setup (roughly the default on macs, somewhat difficult elsewhere, impossible on Android), a wired internet connection, and relatively low physical distance, you just can’t get low enough latency to keep everything feeling simultaneous. A good target there is about 30ms.
The goal with this project is to make something feel like group singing, even though people are not actually singing at the exact same time as each other.
The goal here is for this to feel like group singing as much as possible. I think in your proposal there will be many users where a doesn’t hear b and b also doesn’t hear a? And not just because they’re singing at the same wall time?
I also don’t see why you’re proposing the FFT? Yes, that lets you accumulate frequency information, but you lose timing information in the process. Since our goal is to transmit full audio, I’m not sure how you see the FFT fitting in.
The idea involved everyone hearing everyone. If you have 6 singers, arranged into 2 groups of 3, then each group of 3 can have one person combine 3 streams of audio into one, and then send that off to combine into a single stream. Instead of alice sending alices singing to the server, and bob sending bobs singing, alice can send her singing to bob, and bob can send (alice+bob) to the server.
The idea with Fourier transforms was a form of compression in which it was quick to sum 2 compressed signals into a single compressed signal.
Why are you putting your six singers into two groups of three? The ideal, from the perspective of everyone hearing as many people as possible, is to order your singers a, b, c, d, e, f. Each person hears the audio from those ahead of them. If you have really very large numbers of people, such that arranging them in a full chain gives an end to end latency that is too high, then you can use some sort of chain of groups, for example a, b + c, d + e, f + g.
If you have any sort of chain that is reasonably long, then you want to be resilient to losing a link. That’s much easier to do when you have a server that everyone is sending and receiving audio from. Our current design can recover smoothly from someone having a network hiccup because all that happens is you lose a bit of audio data and then resume. Key to this is that people downstream from the one having a network problem don’t have their audio interrupted, beyond losing audio from the person who is no longer connected.
In theory a peer to peer approach could offer slightly lower latency, but I expect the game there is minimal. Sending a packet from a to b, versus sending a packet from a to a high-connectivity well-placed central server to b, isn’t actually that different.
With the FFT, I think you may be effectively reinventing lossy audio compression? I think we’ll likely get much better results using opus or another modern codec.
Everyone hears audio from everyone.
Suppose you have loads of singers. The task of averaging all the signals together may be to much for any one computer, or might require too much bandwidth.
So you split the task of averaging into 3 parts.
np.sum(a)==np.sum(a[:x]) + np.sum(a[x:])
One computer can average the signals from alice and bob, a second can average the signals from carol and dave. These both send their signals to a third computer, which averages the two signals into a combination of all 4 singers, and sends everyone the results.
I think that’s where you’re imagining this differently than I am. In the approach I am describing, everything is real time. The only time you hear some thing is when you were singing along to it. You never hear a version of the audio includes your own voice, and includes the voices of anyone after you in the chain. The goal is not to create something and everyone listen back to it, the goal is to sing together in the moment.
By “send everyone the results” I was thinking of doing this with a block of audio lasting a few milliseconds.
Everyone hears everyones voices with a few milliseconds delay.
If you want not to echo peoples own voices, then keep track of the timestamps, every computer can subtract their own signal from the total.
The amount of latency, even in an extremely efficient implementation, will be high enough to keep that approach from working. Unless everyone has a very low latency audio setup (roughly the default on macs, somewhat difficult elsewhere, impossible on Android), a wired internet connection, and relatively low physical distance, you just can’t get low enough latency to keep everything feeling simultaneous. A good target there is about 30ms.
The goal with this project is to make something feel like group singing, even though people are not actually singing at the exact same time as each other.