In our use case, however, we’re not talking about sending one large recording up to the server. Instead, we’d be sending batches of samples off every, perhaps, 200ms. Many compression systems do better if you give them a lot to work with; how efficient is opus if we give it such short windows?
One way to test is to break the input file up into 200 ms files, encode each one with opus, and then measure the total size. The default opus file format includes what I measure as ~850 bytes of header, however, and since we control both the client and the server we don’t have to send any header. So I count for my test file...
Based on my understanding of what you are building, this splitting is not a good model for how you would actually implement it. If you have a sender that is generating uncompressed audio, you can feed it into the compressor as you produce it and get a stream of compressed output frames that you can send and decode on the other end, without resetting the compressor in between.
Coauthor here: FWIW I also favor eventually switching to the (more reasonable IMO) streaming approach. But this does require a lot more complexity and state on the server side, so I have not yet attempted to implement it to see how much of an improvement it is. Right now the server is an extremely dumb single-threaded Python program with nginx in front of it, which is performant enough to scale to at least 200 clients. (This is using larger than 200 ms windows.) Switching to a websocket (or even webrtc) approach will add probably an order of magnitude in complexity on the server end. (For webRTC, maybe closer to two orders, from my experiments so far.)
I agree with @cata here.
You also introduce a forced 200ms of latency with batch sending.
I also would propose a proper protocol like RTP for network transport, which helps with handling problems during transport.
To support fully streaming operation in the browser, I think we would need to switch to using web sockets. Doable, but complicated and may not be worth it?
200ms of latency isn’t ideal, but also isn’t that bad. The design does not require minimizing latency, just keeping it reasonably small.
Based on my understanding of what you are building, this splitting is not a good model for how you would actually implement it. If you have a sender that is generating uncompressed audio, you can feed it into the compressor as you produce it and get a stream of compressed output frames that you can send and decode on the other end, without resetting the compressor in between.
Coauthor here: FWIW I also favor eventually switching to the (more reasonable IMO) streaming approach. But this does require a lot more complexity and state on the server side, so I have not yet attempted to implement it to see how much of an improvement it is. Right now the server is an extremely dumb single-threaded Python program with nginx in front of it, which is performant enough to scale to at least 200 clients. (This is using larger than 200 ms windows.) Switching to a websocket (or even webrtc) approach will add probably an order of magnitude in complexity on the server end. (For webRTC, maybe closer to two orders, from my experiments so far.)
You can do that, but then you need the server to retain per-client state. Everything stays much simpler if we don’t!
I agree with @cata here. You also introduce a forced 200ms of latency with batch sending. I also would propose a proper protocol like RTP for network transport, which helps with handling problems during transport.
To support fully streaming operation in the browser, I think we would need to switch to using web sockets. Doable, but complicated and may not be worth it?
200ms of latency isn’t ideal, but also isn’t that bad. The design does not require minimizing latency, just keeping it reasonably small.