Telephony / Voice over IP (VoIP)
Responsiveness/sensitivity to round-trip time
In its most common use for two-party conversations, telephony is very sensitive to responsiveness as determined by round-trip time. It is commonly said that the perception of a normal conversation breaks down once the round-trip time exceeds some value between 150 and 400 ms. This is why telephone conversations over geostationary satellites feel (felt) painful to many people. In addition, it is disturbing to hear one's own voice "echoed" back through a phone connection. Again there are RTT limits beyond which this becomes perceivable and may have to be addressed by active echo suppression (echo cancellation).
Sound quality degradation
On the other hand, human perception seems to deal well with degraded sound quality as induced by background noise or brief signal loss. Audio quality is determined by several factors including
- end-system equipment such as handsets/headsets and their arrangement
- codec (coding/decoding) algorithms used
- bit rate (often fixed by the codec choice)
- impediments resulting from lost packets etc.
It should be noted that disturbances from the network don't necessarily translate directly into audible impediments. For example, jitter will typically be compensated for through the use of a jitter buffer, which will hopefully also remove the effects of reordered packets (both within the limits of the depth of the buffer). The effect of lost packets can be somewhat concealed by interpolating/replaying from neighboring samples (total silence would be intrusive), or even by diffusing information across multiple packets using forward error correction (FEC) schemes.
Codecs
(TODO: Describe what codecs are, how they determine rates and sample sizes, how sample sizes impact end-to-end delay, the tininess of samples/packets when using modern codecs, that newer codecs support variable rates...)
Reaction to congestion
A golden rule on the Internet says that applications should back off their sending rate when they detect congestion, typically through packet loss or increased queuing delay. TCP doesthis in a proven and scalable way, but is not typically used for voice because it cannot maintain the real-time requirements of the application. Most current interactive voice (and video) applications use UDP and real-time framing protocols such as RTP, neither of which includes mechanisms that respond to congestion (UDP's service model is the sending of individual datagrams, so that would indeed be difficult). In order to satisfy the golden rule, voice applications should monitor congestion (this requires closing the feedback loop between the receiver and the sender through something like RTCP) and vary the sending rate - which would likely involve further signaling in order to re-negotiate codecs and/or frame rates.
The Datagram Congestion Control Protocol (DCCP) has been developed to support applications with real-time requirements that want to be congestion-responsive. It has "pluggable" congestion control modules, because real-time applications have different constraints within which they are able to adapt their rate. The protocol is still under development, but has been implemented in recent versions of the Linux kernel.
Because the typical modern codec generates many packets with tiny payloads (short voice samples efficiently compressed), the most useful response to congestion may be not to switch to even more efficient voice compression, but rather to increase the sample size so that fewer packets are generated. They payloads will be bigger, but the bandwidth use is often dominated by heads anyway. Reducing packet rate also helps in those cases where congestion is due to forwarding limitations.
References
- draft-ietf-dccp-tfrc-media-02.txt, Strategies for Streaming Media Applications Using TCP-Friendly Rate Control, T. Phelan, July 2007 (Internet-Draft, work in progress)
- comments on draft-ietf-dccp-tfrc-media-02.txt, I. Johansson, August 2007
– Main.SimonLeinen - 23 Aug 2007

