This is a read-only archive of the Mumble forums.

This website archives and makes accessible historical state. It receives no updates or corrections. It is provided only to keep the information accessible as-is, under their old address.

For up-to-date information please refer to the Mumble website and its linked documentation and other resources. For support please refer to one of our other community/support channels.

Jump to content

Mixing audio streams from several users


yuiu
 Share

Recommended Posts

I'm extending my Mumble library. I just realized that Murmur doesn't mix audio from different users and client has to do it by itself :-) I have the problem to pinpoint, how this problem is solved in original Mumble client.


Let's assume I have decoded audio packets from several users. The packets can have different length (10, 20 or 60 ms). What are the optimal strategies to mix them in real time? How is Mumble doing it?

Link to comment
Share on other sites

  • 1 month later...
  • Administrators

We have one jitter buffer (we use one from speex dsp) per speaking user into which the packets from the network are inserted (AudioOutputSpeech::addFrameToBuffer). The important information for this is the sequence number as well as the number of samples in the packet. The sequence number is part of the framing and tells us the the time this package belongs in the stream of that user. The number of samples tells us how long the audio in that packet is. This jitter buffer also handle reordering.


AudioOutputSpeech.cpp AudioOutputSpeech::addFrameToBuffer

		jbp.data = const_cast<char *>(qbaPacket.constData());
	jbp.len = qbaPacket.size();
	jbp.span = samples;
	jbp.timestamp = iFrameSize * iSeq;

 

Mumble output then simply relies on the output device trying to keep its output buffer filled. We have a mix function (AudioOutput::mix) you tell how many samples you plan to output which then goes to all currently active speaker objects and asks them for that amount of samples (AudioOutputSpeech::needSamples). The needSamples function then decodes as many samples from the jitter buffer as needed to fullfill that need. If the jitterbuff does not have the right packets (missing or late) this is recognized and the packet loss correction of the codec is used and the jitter buffer size might get adjusted to prevent future underruns or misses. End of speech is signaled by a terminator flag in the packet framing.


TL;DR: Thanks to the dynamically sized jitter buffer we don't really care about how long the packets are. We assume the client streams enough of them and decode them when we need to mix them. No absolute alignment is attempted.

Link to comment
Share on other sites

 Share

×
×
  • Create New...