Voice over IP: Speech Transmission over Packet Networks
ABSTRACT The emergence of packet networks for both data and voice traffic has introduced new challenges for speech transmission designs
that differ significantly from those encountered and handled in traditional circuit-switched telephone networks, such as the
public switched telephone network
(PSTN). In this chapter, we present the many aspects that affect speech quality in
avoice over IP (VoIP) conversation. We also present design techniques for coding systems that aim to overcome the deficiencies
of the packet channel. By properly utilizing speech codecs tailored for packet networks, VoIP can in fact produce aquality
higher than that possible with PSTN.
-
Citations (0)
-
Cited In (0)
Page 1
307
Voice over IP
15.2.3 Typical Network Characteristics ...... 312
15.2.4 Quality-of-Service Techniques....... 313
15. Voice over IP:
Speech Transmission over Packet Networks
J. Skoglund, E. Kozica, J. Linden, R. Hagen, W. B. Kleijn
The emergence of packet networks for both data
and voice traffic has introduced new challenges
for speech transmission designs that differ signif-
icantly from those encountered and handled in
traditional circuit-switched telephone networks,
such as the public switched telephone network
(PSTN). In this chapter, we present the many as-
pects that affect speech quality in a voice over IP
(VoIP) conversation. We also present design tech-
niques for coding systems that aim to overcome
the deficiencies of the packet channel. By properly
utilizing speech codecs tailored for packet net-
works, VoIP can in fact produce a quality higher
than that possible with PSTN.
15.1 Voice Communication............................ 307
15.1.1 Limitations of PSTN....................... 307
15.1.2 The Promise of VoIP...................... 308
15.2 Properties of the Network ..................... 308
15.2.1 Network Protocols ........................ 308
15.2.2 Network Characteristics................. 309
15.3 Outline of a VoIP System........................ 313
15.3.1 Echo Cancelation.......................... 314
15.3.2 Speech Codec............................... 315
15.3.3 Jitter Buffer................................. 315
15.3.4 Packet Loss Recovery .................... 316
15.3.5 Joint Design of Jitter Buffer
and Packet Loss Concealment ........ 316
15.3.6 Auxiliary Speech
Processing Components ................ 316
15.3.7 Measuring the Quality
of a VoIP System........................... 317
15.4 Robust Encoding .................................. 317
15.4.1 Forward Error Correction ............... 317
15.4.2 Multiple Description Coding........... 320
15.5 Packet Loss Concealment....................... 326
15.5.1 Nonparametric Concealment.......... 326
15.5.2 Parametric Concealment ............... 327
15.6 Conclusion ........................................... 327
References .................................................. 328
15.1 Voice Communication
Voice over internet protocol (IP), known as VoIP, rep-
resents a new voice communication paradigm that is
rapidly establishing itself as an alternative to traditional
telephony solutions. While VoIP generally leads to cost
savings and facilitates improved services, its quality has
not always been competitive. For over a century, voice
communication systems have used virtually exclusively
circuit-switchednetworksandthishasledtoahighlevel
of maturity.Theend-userhasbeen accustomedtoa tele-
phone conversation that has consistent quality and low
delay.Further,theuserexpectsasignalthathasanarrow-
band character and, thus, accepts the limitations present
in traditional solutions, limitations that VoIP systems
lack.
A number of fundamental differences exist between
traditional telephony systems and the emerging VoIP
systems. These differences can severely affect voice
quality if not handled properly. This chapter will dis-
cuss the major challenges specific to VoIP and show
that, with proper design, the quality of a VoIP solu-
tion can be significantly better than that of the public
switched telephone network (PSTN). We first provide
a broad overview of the issues that affect end-to-end
quality.Wethenpresentsomegeneraltechniquesforde-
signing speech coders that are suited for the challenges
imposed by VoIP. We emphasize multiple description
coding, a powerful paradigm that has shown promising
performance in practical systems, and also facilitates
theoretical analysis.
15.1.1 Limitations of PSTN
Legacy telephony solutions are narrow-band. This
property imposes severe limitations on the achievable
quality. In fact, in traditional telephony applications, the
speech bandwidth is restricted more than the inherent
Part C 15
Page 2
308
Part CSpeech Coding
limitations of narrow-band coding at an 8kHz sam-
pling rate. Typical telephony speech is band-limited to
300–3400Hz. This bandwidth limitation explains why
we are used to expect telephony speech to sound weak,
unnatural, and lack crispness. The final connection to
most households (the so-called local loop) is generally
analog, by means of two-wire copper cables, while en-
tirely digital connections are typically only found in
enterprise environments. Due to poor connections or
old wires, significant distortion may be generated in the
analog part of the phone connection, a type of distortion
thatisentirelyabsentinVoIPimplementations.Cordless
phones also often generate significant analog distor-
tion due to radio interference and other implementation
issues.
15.1.2 The Promise of VoIP
It is clear that significant sources of quality degradation
exist in the PSTN. VoIP can be used to avoid this dis-
tortion and, moreover, to remove the basic constraints
imposed by the analog connection to the household.
As mentioned above, even without changing the
sampling frequency, the bandwidth of the speech sig-
nal can be enhanced over telephony band speech. It
is possible to extend the lower band down to about
50Hz, which improves the base sound of the speech
signal and has a major impact on the naturalness, pres-
ence,andcomfortinaconversation.Extendingtheupper
band to almost 4kHz (a slight margin for sampling fil-
ter roll-off is necessary) improves the naturalness and
crispness of the sound. All in all, a fuller, more-natural
voice and higher intelligibility can be achieved just
by extending the bandwidth within the limitations of
narrow-band speech. This is the first step towards face-
to-face communication quality offered by wide-band
speech.
In addition to having an extended bandwidth, VoIP
has fewer sources of analog distortion, resulting in the
possibilitytooffersignificantlybetterqualitythanPSTN
within the constraint of an 8kHz sampling rate. Even
though this improvement is often clearly noticeable, far
betterqualitycanbeachievedbytakingthesteptowide-
band coding.
One of the great advantages of VoIP is that there is
no need to settle for narrow-band speech. In principle,
compact disc (CD) quality is a reasonable alterna-
tive, allowing for the best possible quality. However,
a high sampling frequency results in a somewhat higher
transmissionbandwidthand,moreimportantly,imposes
toughrequirementsonhardwarecomponents.Theband-
width of speech is around 10kHz [15.1], implying
a sampling frequency of 20kHz for good quality. How-
ever, 16kHz has been chosen in the industry as the
best trade-off between bit rate and speech quality for
wide-band speech coding.
Byextendingtheupperbandto8kHz,significantim-
provementsinintelligibilityandqualitycanbeachieved.
Most notably, fricative sounds such as [s] and [f], which
are hard to distinguish in telephony band situations,
sound natural in wide-band speech.
ManyhardwarefactorsinthedesignofVoIPdevices
affect speech quality as well. Obvious examples are mi-
crophones, speakers, and analog-to-digital converters.
These issues are also faced in regular telephony, and as
such are well understood. However, since the limited
signal bandwidth imposed by the traditional network is
the main factor affecting quality, most regular phones
do not offer high-quality audio. Hence, this is another
area of potential improvement over the current PSTN
experience.
There are other important reasons why VoIP is
rapidly replacing PSTN. These include cost and flex-
ibility. VoIP extends the usage scenarios for voice
communications. The convergence of voice, data, and
other media presents a field of new possibilities. An
example is web collaboration, which combines appli-
cation sharing, voice, and video conferencing. Each of
the components, transported over the same IP network,
enhances the experience of the others.
15.2 Properties of the Network
15.2.1 Network Protocols
Internet communication is based on the internet pro-
tocol (IP) which is a network layer (layer 3) protocol
according to the seven-layer open systems interconnec-
tion (OSI) model [15.2]. The physical and data link
layers reside below the network layer. On top of the
network layer protocol, a transport layer (OSI layer 4)
protocol is deployed for the actual data transmission.
Most internet applications are using the transmission
control protocol (TCP) [15.3] as the transport protocol.
TCP is very robust since it allows for retransmission
Part C 15.2
Page 3
Voice over IP: Speech Transmission over Packet Networks15.2 Properties of the Network309
in the case that a packet has been lost or has not ar-
rived within a specific time. However, there are obvious
disadvantages of deploying this protocol for real-time,
two-waycommunication.Firstandforemost,delayscan
becomeverylongduetotheretransmissionprocess.An-
other major disadvantage of TCP is the increased traffic
load due to transmission of acknowledgements and re-
transmitted packets. A better choice of transport layer
protocol for real-time communication such as VoIP is
the user datagram protocol (UDP) [15.4]. UDP does not
implementanymechanismforretransmissionofpackets
and is thus more efficient than TCP for real-time appli-
cations. On top of UDP, another Internet Engineering
Task Force (IETF) protocol, the real-time transport pro-
tocol (RTP) [15.5], is typically deployed. This protocol
includes all the necessary mechanisms to transport data
generatedbybothstandardcodecsaswellasproprietary
codecs.
It should be mentioned that recently it has become
common to transmit VoIP data over TCP to facilitate
communication through firewalls that would normally
not allow VoIP traffic. This is a good solution from
a connectivity point of view, but introduces significant
challenges for the VoIP software designer due to the
disadvantages with deploying TCP for VoIP.
15.2.2 Network Characteristics
Three major factors associated with packet networks
have a significant impact on perceived speech qual-
ity: delay, jitter, and packet loss. All three factors stem
from the nature of a packet network, which provides
no guarantee that a packet of speech data will arrive at
the receiving end in time, or even that it will arrive at
all. This contrasts with traditional telephony networks
where data are rarely, or never, lost and the transmission
delay is usually a fixed parameter that does not vary
over time. These network effects are the most impor-
tant factors distinguishing speech processing for VoIP
from traditional solutions. If the VoIP device cannot
address network degradation in a satisfactory manner,
the quality can never be acceptable. Therefore, it is
of utmost importance that the characteristics of the IP
network are taken into account in the design and im-
plementation of VoIP products as well as in the choice
of components such as the speech codec. In the fol-
lowing sub-sections delay, jitter, and packet loss are
discussed and methods to deal with these challenges
are covered.
A fact often overlooked is that both sides of a call
needtohaverobustsolutionsevenifonlyonesideiscon-
nectedtoapoornetwork.Atypicalexampleisawireless
device that has been properly designed to be able to
cope with the challenges in terms of jitter and packet
loss typical of a wireless (WiFi) network which is con-
necting through an enterprise PSTN gateway. Often the
gateway has been designed and configured to handle
network characteristics typical of a well-behaved wired
local-area network (LAN) and not a challenging wire-
less LAN. The result can be that the quality is good in
the wireless device but poor on the PSTN side. There-
fore, it is crucial that all devices in a VoIP solution are
designed to be robust against network degradation.
Delay
Many factors affect the perceived quality in two-way
communication. An important parameter is the trans-
missiondelaybetweenthetwoend-points.Ifthelatency
ishigh,itcanseverelyaffectthequalityandeaseofcon-
versation. The two main effects caused by high latency
are annoying talker overlap and echo, which both can
causesignificantreductionoftheperceivedconversation
quality.
In traditionaltelephony, long delays are experienced
only for satellite calls, other long-distance calls, and
calls to mobile phones. This is not true for VoIP. The
effects of excessive delay have often been overlooked
in VoIP design, resulting in significant conversational
qualitydegradationeveninshort-distancecalls.Wireless
VoIP,typicallyoverawirelessLAN(WLAN),isbecom-
ing increasingly popular, but increases the challenges of
delay management further.
The impact of latency on communication quality is
not easily measured and varies significantly with the us-
agescenario.Forexample,longdelaysarenotperceived
as annoying in a cell-phone environment as for a regu-
lar wired phone because of the added value of mobility.
Thepresenceofechoalsohasasignificantimpactonour
sensitivity to delay: the higher the latency, the lower the
perceived quality. Hence, it is not possible to list a sin-
gle number for how high latency is acceptable, but only
some guidelines.
If the overall delay is more than about 40ms, an
echo is audible [15.6]. For lower delays, the echo is
only perceived as an expected side-tone. For longer de-
laysawell-designedechocancelercanremovetheecho.
For very long delays (greater than 200ms), even if echo
cancelation is used, it is hard to maintain a two-way
conversation without talker overlap. This effect is often
accentuated by shortcomings of the echo canceler de-
sign. If no echo is generated, a slightly higher delay is
acceptable.
Part C 15.2
Page 4
310
Part CSpeech Coding
???????????????
??????????????????
?????????????????
????????????????????
?????????????????????
???????
?????????????????
??????????????
??????????????????
??????????
??????????????????
?????????????
Fig.15.1 Main delay sources in VoIP
?
???
??????????????????????????????
????????????????????????
?
???
??? ???
?
???
?
???
Fig.15.2 Effect of delay on conversational quality from
ITU-T G.114
The International Telecommunication Union –
Telecommunication Standardization Sector (ITU-T)
recommends in standard G.114 [15.7] that the one-
way delay should be kept below 150ms for acceptable
conversation quality (Fig.15.2 is from G.114 and
shows the perceived effect on quality as a function of
delay). Delays between 150 and 400ms may be ac-
ceptable, but have an impact on the perceived quality
of user applications. A latency larger than 400ms is
unacceptable.
Packet Loss
Packet losses often occur in the routers, either due to
high router load or to high link load. In both cases,
packets in the queues may be dropped. Packet loss also
occurswhenthereisabreakdowninatransmissionlink.
The result is data link layer error and the incomplete
packet is dropped. Configuration errors and collisions
may also result in packet loss. In non-real-time appli-
cations, packet loss is solved at the transfer layer by
retransmissionusingTCP.Fortelephony,thisisnotavi-
able solution since transmitted packets would arrive too
late for use.
When a packet loss occurs some mechanism for
filling in the missing speech must be incorporated.
Such solutions are usually referred to as packet loss
concealment(PLC)algorithms(Sect.15.5).Forbestper-
formance,thesealgorithmshavetoaccuratelypredictthe
speech signal and make a smooth transition between the
previous decoded speech and inserted segment.
Since packet losses occur mainly when the network
isheavilyloaded,itisnotuncommonforpacketlossesto
appear in bursts. A burst may consist of a series of con-
secutive lost packets or a period of high packet loss rate.
When several consecutive packets are lost, even good
PLC algorithms have problems producing acceptable
speech quality.
To save transmission bandwidth, multiple speech
framesaresometimescarriedinasinglepacket,soasin-
gle lost packet may result in multiple lost frames. Even
if the packet losses occur more spread out, the listening
experience is then similar to that of having the packet
losses occur in bursts.
Network Jitter
The latency in a voice communication system can be
attributed to algorithmic, processing, and transmission
delays. All three delay contributions are constant in
a conventional telephone network. In VoIP, the algo-
rithmic and processing delays are constant, but the
transmission delay varies over time. The transit time
of a packet through an IP network varies due to queuing
effects. The transmission delay is interpreted as consist-
ingoftwoparts,onebeingtheconstantorslowlyvarying
network delay and the other being the rapid variations
on top of the basic network delay, usually referred to as
jitter.
The jitter present in packet networks complicates
the decoding process in the receiver device because the
Part C 15.2
Page 5
Voice over IP: Speech Transmission over Packet Networks15.2 Properties of the Network 311
decoder needs to have packets of data available at the
right time instants. If the data is not available, the de-
coder cannot produce continuous speech. A jitter buffer
is normally used to make sure that packets are available
when needed.
Clock Drift
Whether the communication end-points are gateways
or other devices, low-frequency clock drift between the
two can cause receiver buffer overflow or underflow.
Simply speaking, this effect can be described as the two
devices talking to each other havingdifferent time refer-
ences. For example, the transmitter might send packets
every 20ms according to its perception of time, while
the receiver’s perception is that the packets arrive every
20.5ms. In this case, for every 40th packet, the receiver
hasto performapacket lossconcealmentto avoidbuffer
underflow. If the clock drift is not detected accurately,
delaybuildsupduringacall,soclockdriftcanhaveasig-
nificantimpactonthespeechquality.Thisisparticularly
difficult to mitigate in VoIP.
The traditional approach to address clock drift is to
deploy a clock synchronization mechanism at the re-
ceiver to correct for clock drift by comparing the time
stamps of the received RTP packets with the local clock.
It is hard to obtain reliable clock drift estimates in VoIP
because the estimates are based on averaging packet ar-
rivalsatarateoftypically30–50persecondandbecause
of the jitter in their arrival times. Consider for compar-
ison the averaging on a per-sample basis at a rate of
8000 per second that is done in time-division multiplex-
ing(TDM)networks[15.8].Inpracticemanyalgorithms
designed to mitigate the clock drift effect fail to perform
adequately.
Wireless Networks
Traditionally,packet networksconsistedof wiredEther-
net solutions that are relatively straightforward to
manage. However, the rapid growth of wireless LAN
(WLAN) solutions is quickly changing the network
landscape. WLAN, in particular the IEEE 802.11 fam-
ily of standards [15.9], offers mobility for computer
access and also the flexibility of wireless IP phones,
and are hence of great interest for VoIP systems. Jitter
and effective packet loss rates are significantly higher
in WLAN than in a wired network, as mentioned in
Sect.15.2.3. Furthermore, the network characteristics
often change rapidly over time. In addition, as the
user moves physically, coverage and interference from
other radio sources–such as cordless phones, Blue-
tooth [15.10] devices, and microwave ovens-varies. The
result is that high-level voice quality is significantly
harder to guarantee in a wireless network than a typical
wired LAN.
WLANs are advertised as havingvery high through-
put capacity (11Mb/s for 802.11b and 54Mb/s for
802.11a and 802.11g). However, field studies show that
actualthroughputisoftenonlyhalfofthis,evenwhenthe
client is close to the access point. It has been shown that
these numbers are even worse for VoIP due to the high
packet rate, with typical throughput values of 5–10%
(Sect.15.2.3).
When several users are connected to the same wire-
less access point, congestion is common. The result is
jitter that can be significant, particularly if large data
packets are sent over the same network. The efficiency
of the system quickly deteriorates when the number of
users increases.
When roaming in a wireless network,the mobile de-
vicehastoswitchbetweenaccesspoints.Inatraditional
WLAN, it is common that such a hand-off introduces
a 500ms transmission gap, which has a clearly audible
impact on the call quality. However, solutions are now
availablethatcutthatdelaynumbertoabout20to50ms,
if the user is not switching between two IP subnets. In
the case of subnet roaming the handover is more com-
plicated and no really good solutions exist currently.
Therefore, it is common to plan the network in such
a way that likelihood of subnet roaming is minimized.
Sensitivity to congestion is only one of the limita-
tions of 802.11 networks. Degraded link quality, and
consequently reduced available bandwidth, occurs due
to a number of reasons. Some 802.11 systems operate
in the unlicensed 2.4GHz frequency range and share
this spectrum with other wireless technologies, such as
Bluetooth and cordless phones. This causes interference
with potentially severe performance degradation since
a lower connection speed than the maximum is chosen.
Poorlinkqualityalsoleadstoanincreasednumberof
retransmissions, which directly affects the delay and jit-
ter. The link quality varies rapidly when moving around
in a coverage area. This is a severe drawback, since
a WLAN is introduced to add mobility and a wireless
VoIP user can be expected to move around the coverage
area.Hence,theintroductionofVoIPintoaWLANenvi-
ronment puts higher requirements on network planning
than for an all-data WLAN.
Theresultofthehighdelaysthatoccurduetoaccess-
point congestion and bad link quality is that the packets
often arrive too late to be useful. Therefore, the effec-
tive packet loss rate after the jitter buffer is typically
significantly higher for WLANs than for wired LANs.
Part C 15.2