Conference PaperPDF Available

Using data center TCP (DCTCP) in the Internet

Authors:
Using Data Center TCP (DCTCP) in the Internet
Mirja K¨
uhlewind, David P. Wagner, Juan Manuel Reyes Espinosa, Bob Briscoe
Communication Systems Group, ETH Zurich, Switzerland
mirja.kuehlewind@tik.ee.ethz.ch
Institute of Communication Networks and Computer Engineering, University of Stuttgart, Germany
david.wagner@ikr.uni-stuttgart.de
BT Research, Ipswich, UK
bob.briscoe@bt.com
Abstract
Data Center TCP (DCTCP) is an Explicit Congestion
Notification (ECN)-based congestion control and Active Queue
Management (AQM) scheme. It has provoked widespread
interest because it keeps queuing delay and delay variance
very low. There is no theoretical reason why Data Center TCP
(DCTCP) cannot scale to the size of the Internet, resulting
in greater absolute reductions in delay than achieved in data
centres. However, no way has yet been found for DCTCP
traffic to coexist with conventional TCP without being starved.
This paper introduces a way to deploy DCTCP incremen-
tally on the public Internet that could solve this coexistence
problem. Using the widely deployed Weighted Random Early
Detection (WRED) scheme, we configure a second AQM that
is applied solely to ECN-capable packets. We focus solely
on long-running flows, not because they are realistic, but as
the critical gating test for whether starvation can occur. For
the non-ECN traffic we use TCP New Reno; again not to
seek realism, but to check for safety against the prevalent
reference. We report the promising result that, not only does
the proposed AQM always avoid starvation, but it can also
achieve equal rates. We even derived how the sharing ratio
between DCTCP and conventional TCP traffic depends on the
various AQM parameters. The next step beyond this gating test
will be to quantify the reduction in queuing delay and variance
in dynamic scenarios. This will support the standardization
process needed to define new ECN semantics for DCTCP
deployment that the authors have started at the IETF.
I. INTRODUCTION
Data Center TCP (DCTCP) [1] has provoked widespread
interest because it keeps queuing delay and delay variance very
low relative to the propagation delay across the data centre.
Alizadeh [2] shows that the DCTCP approach can scale to
networks with much larger propagation delay. Queuing delay
and delay variance grow proportionately in absolute terms.
Nonetheless they remain low relative to the propagation delay.
However, DCTCP requires changes to all three main parts of
the system: network buffers, receivers and senders. Therefore
DCTCP deployments have been confined to environments like
private data centers where a single administration can upgrade
all the parts at once.
The reason for the need to change all three parts is a
different interpretation of the network’s congestion signals than
the usual one. The difference concerns both the amount of
congestion signaled in one round of feedback and the conges-
tion’s extent in time: DCTCP’s design is based on signaling
congestion immediately when a rather small queue builds up,
leaving smoothing to the sender. In contrast, today’s Internet
is based on smoothing in the network and then signaling only
when a severe queue has already built up. Based on the interest
in DCTCP as well as other congestion management system,
current standardization activity to change the semantics of
Explicit Congestion Notification (ECN) is under way in the
Internet Engineering Task Force (IETF) [3], [4]. This change
to congestion semantics allows DCTCP to provide low latency
even at large buffers, if all involved players respect these new
semantics. This is feasible given there has been no effective
ECN deployment on the public Internet.
While it is straightforward to coordinate senders and re-
ceivers [4] even in mixed environments such as the Internet,
network nodes need to cope with heterogeneous traffic. This
paper introduces a way to configure existing switches and
routers that allows DCTCP and non-DCTCP to share a queue.
The Active Queue Management (AQM) of this queue must
support both congestion semantics, the fast and fine-grained
feedback needed by DCTCP and the more coarse-grained and
slower signal needed by conventional Transmission Control
Protocol (TCP) traffic. Since both types of traffic fill the shared
queue, low delay and high throughput can only be achieved,
if the traffic is dominated by DCTCP. We aimed at finding
configurations, that allow an evolution from today’s 100%
conventional TCP traffic to DCTCP dominated traffic. Network
operators may initially configure the dual AQM close to the
ideal for loss-based congestion control, but they can shift closer
to the ideal of low delay as the proportion of DCTCP traffic
grows.
So we target three main questions: First, how do Random
Early Detection (RED)-based AQM configurations influence
the sharing between Reno and DCTCP flows; can starvation
always be avoided; and can at least roughly equal sharing
be achieved? Second, for which Internet scenarios and con-
figurations do DCTCP users benefit in terms of low delay?
And third: which utilization can be achieved for the promising
configurations?
We evaluated these questions by simulations integrating
patched Linux kernels under constant conditions in simple
scenarios. Next, since this initial ‘gating test’ shows promise,
we plan to evaluate the idea in a wide range of more realistic
scenarios. In this paper we show the general feasibility of
our approach and give hints for how to evolve configuration
control assuming the benefits lead to an increasing proportion
of DCTCP flows in the future Internet.
We give an overview on DCTCP in Section II. In Sec-
tion III we present modifications to algorithm and imple-
mentation of DCTCP and introduce our proposed dual AQM
scheme. In Section IV we present our results and show that a
stable operation with lower latency including equal sharing, if
desired, is possible. We also derived rules for defining AQM
parameters in order to achieve both, lower latency and equal
sharing.
II. OVE RVI EW O F DCTCP
DCTCP is a combination of a congestion control scheme
and AQM scheme that is based on new semantics of congestion
notification using ECN signaling. DCTCP implements three
changes: a different reaction to congestion in the sender, a
specific RED configuration in the network nodes, and a more
accurate congestion feedback mechanism from the receiver to
the sender.
A. Simple Marking Scheme
The AQM scheme for DCTCP operation is deceptively
simple: if the current instant queue occupancy is larger than a
certain threshold K, every arriving packet will be marked. This
mechanism can be implemented as a specific parameterization
of RED [5]. RED probabilistically decides about the marking
of arriving packets based on the average queue occupancy,
calculated as a weighted moving average with the weighting
factor w. Only above a minimum threshold (Min T hresh)
arriving packets will be marked with linearly increasing proba-
bility, as also displayed in Figure 3, reaching up to a maximum
marking probability (M ax P rob) at the maximum threshold
(M ax T hresh) and a probability of 1 above. The proposed
AQM scheme for DCTCP can be realized by using RED with
M in T hresh =M ax T hresh =Kand w= 1.
B. Congestion Control Algorithm
A TCP sender maintains a congestion window (cwnd),
giving the allowed number of packets in flight during one
Round Trip Time (RTT). With DCTCP, when an ECN-Echo
arrives, cwnd is updated to reflect not just the existence of
some congestion within a round trip, but the exact proportion
of congestion marks. This is achieved according to the follow-
ing equation
cwnd (1 α/2) cwnd (1)
where αis the moving average of the fraction of marked
packets in the last RTT. Its calculation is given by
α(1 g)α+gF(2)
where Fis the fraction of the marked packets in the last
RTT and gis a weighting factor. gis recommendationed
to be set to 1/24in [1]. αis updated once per RTT. This
congestion control algorithm allows the sender to gently reduce
its congestion windows in case of low fraction of markings,
whereas strong reductions are performed in case of a high
degree of congestion.
C. Enhanced ECN Feedback
ECN allows network nodes to notify of congestion by
setting a flag in the Internet Protocol (IP) header (Congestion
Experienced (CE) codepoint), with no need to drop packets.
A host receiving a CE-marked packet will send ECN-Echoes
(ECE) in every TCP acknowledgement until it receives a
Congestion Window Reduced (CWR)-flagged TCP packet. By
this mechanism only one congestion feedback can be sent per
RTT which is appropriate for conventional TCP congestion
control but not for DCTCP. Thus DCTCP changes the ECN
feedback mechanism. It aims to get exactly one ECN-Echo
for each CE-marked packet. However, to be able to use
delayed acknowledgements, Alizadeh et al. define in [1] a
two state machine for handling ECN feedback. Note that
there is no negotiation as DCTCP assumes that the receiver
is DCTCP-enabled. For wider use in the Internet, the authors
are standardizing a negotiation phase [3].
III. MOD IFI CATIO NS
Our evaluation is based on a Linux patch provided by
the University of Stanford [6] applied to the Linux kernel
version 3.2.18. In the initial phase of our investigations we
observed unexpected and undesired behavior of that imple-
mentation which we fixed by minor modifications described
in the following section. Furthermore, we implemented two
algorithmic modifications to provide a faster adaptation to the
current congestion level. Finally, we present our dual AQM
scheme in Subsection III-C.
A. Implementation in Linux
a) Finer resolution for αvalue: In the provided im-
plementation [6] the resolution of αwas limited to minimum
value of 1/210. Because of this, in our simulation scenario the
congestion window converged to a fixed value in a situation
with very few ECN markings. For our investigations we
changed the minimum resolution to 1/220. It should be noted
that for large congestion windows and very low marking rates,
an even finer resolution might be necessary.
Setting of the Slow Start threshold: In the provided
DCTCP implementation the Slow Start threshold (ssthresh)
is incorrectly set to the current cwnd value after a reduction.
In our implementation we correctly reset the ssthresh to the
cwnd 1instead. With the original patch a DCTCP sender
was in Slow Start (cwnd <=ssthresh) after each decrease
and thus immediately increases (by one packet) on arrival of
the first ACK, then leaving Slow Start and correctly entering
Congestion Avoidance. As with DCTCP not every window
recalculation causes a window reduction, therefore this error
caused a non-linear increase in a noticeable range.
b) Allow the congestion window to grow in CWR state:
While the Linux congestion control implementation in general
does not allow any further window increases during roughly
one RTT after the reception of a congestion signal, this
does not seem to be appropriate for DCTCP. Thus in our
implementation we allow the congestion window to grow even
during this so-called CWR state. Moreover, if no reduction was
performed, we do not reset snd cwnd cnt, which maintains
when to increase the window next, to preserve the linear
increase behavior.
0 2 4 6 8 10 12 14
Time (seconds)
0
20
40
60
80
100
Congestion window
(MSS)
DCTCP
Modified DCTCP
TCP Reno
Fig. 1. Congestion window of single flows
5 10 15 20 25
Time (seconds)
0
10
20
30
40
50
Congestion window
(MSS)
TCP Reno
DCTCP
TCP Reno drops
DCTCP markings
Fig. 2. Congestion window and mark/drop events for one TCP Reno and
one DCTCP host sharing an accordingly configured queue
B. Algorithmic Modifications
c) Continuous update of α:As mentioned in Section
II-B, αis updated only once per RTT. With such a periodic
update scheme, αmight not catch the maximum congestion
level and, even worse, might still reflect an old value when
the congestion window reduction is performed. To avoid this,
we update αon the reception of each acknowledgement. It
must be mentioned that for with the modificaion also the
weighting factor gmust be chosen differently because α
is recalculated more often. Therfore we set gto be 1/28
instead of 1/24to compensate for this effect, thus making the
behavior similar to the original DCTCP patch in our rather
static evaluation scenarios. However, the right choice of g
depends on the absolute number of markings, and thus number
of recalculations performed, and therefore actually depends on
the current number of packets in flight. This dependency could
be compensated by normalizing the fraction of marked packets
Fwith the current number of packets in flight, or simply the
current congestion window value.
d) Progressive congestion window reduction: In the
original implementation the congestion window is recalculated
as soon as the CWR state is entered. But, as explained above,
the actual congestion level would need to be determined over
the following RTT in which further congestion signal are
expected to be received. We cannot wait one whole RTT to
perform any window reductions, as this would cause further
unnecessary congestion. Thus we decrease the congestion win-
dow progressively on reception of each ECN-Echo. For each
recalculation we use the congestion window value cwnd max
from the start of the CWR state and reset the congestion
window only if the resulting value is lower than the current
value.
Figure 1 shows the congestion window of one DCTCP
flow either using the original patch or our modification in
comparison to one TCP Reno flow. It can be seen that after
the Slow Start phase our implementation adapts faster but
otherwise the behavior is similar, as desired.
Fig. 3. Packet mark probability calculation
C. Dual AQM Scheme
The packets of DCTCP and other TCP flows need to be
handled differently in the AQM scheme of the bottleneck
network node according to the different congestion signal
semantics. We propose an AQM scheme based on one shared
queue but applying two differently parameterized instances of
the RED algorithm, one for non-ECN traffic and one for ECN
traffic. Our scheme classifies the traffic based on the ECN-
capability, thus packets will respectively be marked or dropped
to notify of congestion. This approach would probably result
in low throughput for ECN-enabled end-systems that still use
conventional congestion control such as Reno. However, given
this ECN standard was defined in 2001 and has hardly seen
any active use, it is unlikely this is an important factor.
Instead we propose to standardize an ECN signal that
signals congestion immediately, allowing the end hosts to dis-
tinguish between smoothed and immediate congestion notifi-
cation. DCTCP together with more accurate ECN feedback, as
already under standardization [3], [4], could be re-implemented
and turned on after ECN capability negotiation with the server.
The much greater performance benefits of DCTCP, could then
incentivize OS developers to deploy DCTCP with ECN turned
on by default.
Figure 2 shows exemplarily the resulting congestion sig-
nals along with the congestion window of one Reno and
one DCTCP flow equally sharing the bandwidth. The used
parameter set is derived from our investigations described later
on in the second part of Section IV-C.
IV. PRE LIMINA RY EVALUATIO N
In this evaluation we investigated DCTCP with the pro-
posed dual AQM scheme in a simplified scenario to show
feasibility. Our parameter study shows that a large range of
configuration can be used to achieve different operation points
in link utilization, queue occupancy (and thus latency) and
bandwidth sharing between multiple flows. We investigated the
two approaches of RED parameterization for the DCTCP traf-
fic as illustrated in Figure 3: i) (left) a degenerate configuration
with M in T hresh DC T CP =M ax T hresh D CT C P =
Kcreating a simple marking threshold as originally proposed
for DCTCP or ii) (right) with M in T hresh DC T CP <
M ax T hresh DC T CP as in standard RED configurations
as described in III-C, i.e. either using a marking threshold K
or a marking slope. The selected parameterization covers only
Fig. 4. Simulation scenario (only forward direction)
a limited range of the large parameter set but presents the two
most interesting cases. Other scenarios need to be investigated
before applying our approach to the Internet to cover corner
cases.
A. Simulation Environment and Scenario
We evaluated our approach based on simulations using the
IKR SimLib [7], an event-driven simulation library with an
extension to integrate virtual machines [8] running a Linux
kernel with our modified DCTCP implementation.
As shown in Figure 4, the simulation scenario consists
of four hosts, two senders and two receivers, connected by a
shared link of 10 Mbps with a single bottleneck queue plus the
corresponding return path and an RTT of 25ms, resulting in a
Bandwidth Delay Product (BDP) of 31250 bytes. One pair of
hosts uses DCTCP with ECN, while the others use TCP Reno
without ECN support. Each sender has data to send at any
time and uses one or more long-lived connections. The queue
implements the dual AQM scheme described in Section III-C.
To limit the parameter space, we fixed most of the RED
parameters for non-ECN traffic to the recommended values
in [9]; using a weighting factor w Reno of 0.002 and set-
ting M ax P rob Reno to 0.1. Moreover, we configured the
maximum threshold to three times the minimum threshold:
M ax T hresh Reno = 3 M in T hr esh Reno.
M in T hresh Reno is the parameter we vary because it
determines the queuing-induced latency (if non-DCTCP traffic
exists).
B. Using a Marking Threshold
For this approach both DCTCP thresholds are set to
the same value Kand smoothing is turned off by setting
w DC T CP = 1, as was originally proposed for DCTCP. We
vary the step threshold Kin relation to M in T hresh Reno
for several values of Min T hresh Reno smaller or equal
than the BDP. Note that the buffer size needed by one Reno
flow to fully utilize the link is one BDP. Kmust be between
the minimum and maximum threshold of the Reno traffic
to avoid that only one flow gets almost all capacity, thus
TABLE I. (M)IN TH RE SH RENO ( IN B DPS), K/M AT MAX IMU M
FAIR NE SS ,LIN K (U)TI LI ZATIO N AN D QU EUE (O)C CU PANC Y OF (D)CTCP
AN D (R)ENO
M K/M UDR ODR U2RO2R
1
/82.3 0.98 0.207 0.946 0.216
1
/41.73 0.987 0.335 0.968 0.336
1
/21.34 0.996 0.548 0.986 0.524
1
/21.26 0.999 0.724 0.995 0.679
11.16 1.000 0.99 0.999 0.914
0.01
0.1
1
10
Throughput ratio
Reno / DCTCP
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
Utilization
1.0 1.5 2.0 2.5 3.0
K / Min Thresh Reno
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Queue occupancy
(BDPs)
Min Thresh Reno
BDP /8
BDP /4
BDP /2
BDP /2
BDP
Fig. 5. Results when using marking threshold
K/Min T hresh Reno is varied between 1 and 3. We aim
at equal shares of DCTCP and non-DCTCP at high utilization
and lower delay, so in Figure 5 we plot the ratio between
Reno and DCTCP traffic, the utilization of the bottleneck
link and the average queue occupancy. The steps that can
be seen for the smaller thresholds, most prominently for
M in T hresh Reno =BDP/8, are a consequence of the
fixed Maximum Transmission Unit (MTU) and the small
marking threshold: if all packets are 1500 Bytes, it makes no
difference if the marking threshold is 7812B= 2 BDP/8
or 8203B= 2.1BDP/8, both are able to hold 5 packets
without marking.
As it can be seen in the throughput plot of Figure 5, for
any M in T hresh Reno there is a Kthat results in equal
bandwidth sharing (dotted line) between the DCTCP and the
Reno flow. Table IV-A lists these values and the respective
utilization and queue occupancy for either one DCTCP and
one Reno flow competing or, for comparison, two Reno flows
only. Especially, when the minimum threshold is chosen very
low to 1
/8BDP, it can be seen that DCTCP increases the
utilization (up to 4 %) while the average queue remains about
the same.
These results show that with this configuration scheme a
lower delay can be traded for a only slightly lower utilization
while sharing equally with Reno flows. More specifically,
the queue occupancy can be reduced by a factor of four
while maintaining equal bandwidth sharing and high (98 %)
utilization.
Several DCTCP flows: We also investigated the effect
of increased proportion of DCTCP traffic on the utilization for
aM in T hresh Reno as low as BDP/8. For that purpose,
we ran experiments where one Reno flow competes with
NDCTCP flows. In Figure 6 the throughput ratio between
Reno and the average of the NDCTCP flows is shown for
N= 1...5, along with the corresponding utilization and queue
occupancy. As expected the utilization increases with the num-
ber of DCTCP flows. Equal sharing can only be achieved for
0.1
1
10
Throughput ratio
Reno / (DCTCP/N)
N= 1
N= 2
N= 3
N= 4
N= 5
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
Utilization
1.0 1.5 2.0 2.5 3.0
K / Min Thresh Reno
0.14
0.16
0.18
0.20
0.22
0.24
0.26
0.28
0.30
0.32
Queue occupancy
(BDPs)
Fig. 6. Results with multiple DCTCP flows for M in T hr esh Reno =
1
/8BDP
0.01
0.1
1
10
Throughput ratio
Reno / DCTCP
0.2 0.4 0.6 0.8 1.0
Max Prob DCTCP
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
Queue occupancy
(BDPs)
Min Thresh DCTCP
BDP /8
BDP /4
BDP /2
BDP /2
BDP
TCP Reno
Fig. 7. Results when using marking slope
settings with large values of K/M in T hresh Reno which
also increases the average queue occupancy. Thus with a larger
proportion of DCTCP in a (future) traffic mix, the AQM
thresholds could be lowered while maintaining high utilization.
C. Using a Marking Slope
The alternative parameterization for the DCTCP traffic
uses a conventional RED configuration forming a slope of
increasing marking probability depending on the average queue
occupancy (in contrast to using a step function of instantaneous
queue length). We also studied the influence of the weighting
factor w D CT C P . As expected for a non-dynamic scenario
with just long-running flows, we found that it has only minor
influence on bandwidth sharing and thus chose the same
value for w D CT C P as for w Reno of 0.002. For this
study we fixed M in T hresh Reno to BDP , resulting in
M ax T hresh Reno = 3 B DP . We investigate values
for M in T hresh DC T CP smaller than BDP and shifted
M ax T hresh DC T CP to M in T hresh DC T CP + 2
BDP to keep the same distance between the thresholds and
thus the same slope which again is a simplification to narrow
Max Prob DCTCP
0.0
0.2
0.4
0.6
0.8
1.0
Min Thresh DCTCP (BDPs)
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Jain’s index
0.5
0.6
0.7
0.8
0.9
1.0
Fig. 8. Jain’s fairness index
1 2 3 4 5 6 7 8 9
Max Prob DCTCP /
Max Prob Reno
1
2
3
4
5
6
7
8
9
Min Thresh Reno /
Min Thresh DCTCP
Experimental
y=x/(x1)
Fig. 9. Maximum fairness
our parameter set. Figure 7 shows the throughput ratio between
Reno and DCTCP and the queue occupancy when varying
M ax prob DC T CP from 0 to 1. The plot shows equal
sharing is possible for many parameterizations, except for the
M in T hresh DC T CP =BD P . This is expected since
both flows get the same feedback rate but Reno reacts by
halving the sending rate while DCTCP usually will decrease
less (depending on the number of markings). Figure 8 shows a
3-dimensional plot of the Jain’s fairness index [10] depending
on M in T hresh DC T CP and M ax P rob DC T CP on
the left. The highlighted ridge marks a fairness index of
one. Figure 9 shows the parameter combinations of maximum
fairness. The function y=x/(x1) is overlaid to illustrate
that it fits quite closely to the measured maximum fairness
configurations. These results suggest that we can achieve equal
sharing for a parameterization according to the following rule:
M in T hresh DC T CP
M in T hresh Reno =
Max P rob DC T CP
Max P rob Reno 1
Max P rob DC T CP
Max P rob Reno
(3)
That means for a given configuration for Reno, there is
just one parameter to choose, M in T hresh DC T CP or
M ax P rob DC T CP .
Equal Sharing Configurations: Since this formula
provides configurations that implement (about) equal shar-
ing, we scale the maximum marking probability by
M in T hresh DC T CP and altered only the minimum
0.0
0.2
0.4
0.6
0.8
1.0
Occupancy (BDPs)
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1.00
Jain’s index,
utilization
0.0 0.2 0.4 0.6 0.8 1.0
Min Thresh Reno (BDPs)
Jain’s index
Utilization
Occupancy
Fig. 10. Equal Sharing
threshold for Reno traffic in this evaluation. That means we set
M ax P rob Reno to 0.1/(M in T hresh Reno/BD P )and
M ax P rob DC T CP to 0.2/(M in T hresh Reno/BD P ).
We show results for M in T hresh DC T CP = 1/2
M in T hresh Reno.
Figure 10 shows Jain’s fairness index, utilization and
queue occupancy. As it can be seen, such configurations
achieve almost maximum fairness in terms of equal shar-
ing and the queue occupancy depends about linearly on
M in T hresh Reno. The achieved utilization is close to
100 % with a M in T hresh Reno of only 0.3*BDP. For the
very simple scenario considered, this finding allows to define
a trade off between delay and utilization while keeping the
sharing equal thus being “TCP-friendly”.
V. CO NC LU SIONS
In this paper we propose a new dual AQM scheme that
can be implemented based on Weighted RED (WRED) to
incrementally deploy DCTCP with its different congestion
semantics in the public Internet. Today ECN sees only min-
imal deployment, but activities in standardization are under
way to re-define the congestion feedback mechanism and its
meanings. We argue that a classification solely based on the
ECN-capability of the traffic provides an opportunity for actual
DCTCP deployment in the Internet. Therefore, we evaluated
the possibility of concurrent usage of DCTCP with other
conventional TCP congestion control. We evaluated two RED
configurations for providing ECN-based feedback for DCTCP
traffic: i) a marking threshold Kas proposed by the original
DCTCP approach and ii) a marking slope as in standard
RED configurations. We showed that both approaches can
be configured for stable operation, where the proportions of
DCTCP and Reno traffic converge to a certain ratio or even
to an equal rate, if desired. Moreover, we found a formula for
RED parameters that always results in equal sharing between
DCTCP and non-DCTCP. This relation allows high utilization
to be traded off against low delay. We showed that, even with
the minimum threshold set very low to maintain low latency,
utilization increases with a larger fraction of DCTCP traffic.
This study is only a first step to show that the way
proposed to deploy DCTCP in the Internet would at least
give a reasonable share of capacity to long-running flows;
while still reducing latency and maintaining high utilization.
Further evaluation is needed using scenarios with all kinds
of traffic models, e.g. with more and not only long-running
flows and different shares of DCTCP and conventional TCP
flows. Our interest lies also in a wider parameter study focusing
on scenarios with ECN marking based on the instantaneous
queue length only, as DCTCP already implements smoothing
itself. We expect further advantages from DCTCP’s reaction to
congestion when flows with very small and very large RTTs
are sharing the same bottleneck. We also need to show that
endpoints and network nodes with the new semantics can
safely coexist with any legacy ECN endpoints or network
nodes, in case they are deployed without update.
The proposed way to deploy DCTCP in the Internet re-
quires instantaneous and more accurate ECN feedback. Today
ECN is defined as a “drop equivalent” and therefore provides
only small performance gains and consequently has not seen
wide deployment. With a change in semantics, ECN could
be used as an enabler for new low latency services also
implementing a different response to congestion, similar to
DCTCP.
Apart from a more accurate ECN signal, where a proposal
by the authors has already been adopted onto the IETF’s
agenda, we also see a need to standardize a change to the
semantics of ECN to provide immediate congestion informa-
tion without any further delays in the network. This work
provides further input on the needs for a future, immediate,
and therefore more beneficial ECN-based congestion control
loop and proposes an approach for how congestion control
could react to such signal.
VI. ACKNOWLEDGMENTS
This work was performed while the first author was still
with IKR, University of Stuttgart, Germany This work is
part-funded by the European Community under its Seventh
Framework Programme through the ETICS project and the
Reducing Internet Transport Latency (RITE) project (ICT-
317700). The views expressed here are solely those of the
authors.
REF ER EN CE S
[1] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prab-
hakar, S. Sengupta, and M. Sridharan, “DCTCP: Efficient packet
transport for the commoditized data center,” in Microsoft Research
publications, January 2010.
[2] M. Alizadeh, A. Javanmard, and B. Prabhakar, “Analysis of DCTCP:
stability, convergence, and fairness.” in SIGMETRICS. ACM, 2011.
[3] B. Briscoe, R. Scheffenegger, and M. Kuehlewind, “More Accurate
ECN Feedback in TCP: draft-kuehlewind-tcpm-accurate-ecn-03,” IETF,
Internet-Draft, Jul. 2014, (Work in Progress).
[4] M. K¨
uhlewind, R. Scheffeneger, and B. Briscoe, “Problem Statement
and Requirements for a More Accurate ECN Feedback: draft-ietf-tcpm-
accecn-reqs-06,” IETF, Internet-Draft, Jul. 2014, (Work in Progress).
[5] S. Floyd and V. Jacobson, “Random Early Detection gateways for
Congestion Avoidance,IEEE/ACM Transactions on Networking, pp.
397–413, Aug. 1993.
[6] “DCTCP patch for linux 3.2,” https://github.com/mininet/mininet-
tests/blob/master/dctcp/0001-Updated-DCTCP-patch-for-3.2-
kernels.patch, 2014.
[7] “IKR Simulation Library, http://www.ikr.uni-
stuttgart.de/Content/IKRSimLib/, 2014.
[8] T. Werthmann, M. Kaschub, M. K¨
uhlewind, S. Scholz, and D. Wagner,
“VMSimInt: A Network Simulation Tool Supporting Integration of
Arbitrary Kernels and Applications,” in Proceedings of the 7th ICST
Conference on Simulation Tools and Techniques (SIMUTools), 2014.
[9] S. Floyd, “RED: Discussions of setting parameters,
http://www.icir.org/floyd/REDparameters.txt, November 1997.
[10] R. Jain, D. Chiu, and W. Hawe, “A quantitative measure of fairness
and discrimination for resource allocation in shared computer systems,”
DEC Research Report TR-301, September 1984.
... This is because many cloud operators lease the network resource (e.g., guaranteed bandwidth) from Internet service providers (ISPs) and WAN gears are maintained by the ISPs. As a result, some switch features, e.g., ECN, may not be well supported [20], [21] (either disabled or configured with undesirable marking thresholds) in WAN. ...
Article
Full-text available
Geographically distributed applications hosted on cloud are becoming prevalent. They run on cross-datacenter network that consists of multiple data center networks (DCNs) connected by a wide area network (WAN). Such a cross-DC network poses significant challenges in transport design because the DCN and WAN segments have vastly distinct characteristics (e.g., buffer depths, RTTs). In this paper, we find that existing DCN or WAN transport reacting to ECN or delay alone do not (and cannot be extended to) work well for such an environment. The key reason is that neither of the signals, by itself only, can simultaneously capture the location and degree of conges- tion, mainly due to the discrepancies between DCN and WAN. Motivated by this, we present the design and implementation of GEMINI that strategically integrates both ECN and delay signals for cross-DC congestion control. To achieve low latency, GEMINI bounds the inter-DC latency with delay signal and prevents the intra-DC packet loss with ECN. To maintain high throughput, GEMINI modulates the window dynamics and maintains low buffer occupancy utilizing both congestion signals. GEMINI is implemented in Linux kernel and evaluated by extensive testbed experiments. Results show that GEMINI achieves up to 53%, 31%, 76% and 2% reduction of small flow average completion times, and up to 34%, 39%, 9% and 58% reduction of large flow average completion times compared to TCP Cubic, DCTCP, BBR and TCP Vegas.
... This is because many cloud operators lease the network resource (e.g., guaranteed bandwidth) from Internet service providers (ISPs) and WAN gears are maintained by the ISPs. As a result, some switch features, e.g., ECN, may not be well supported [19,70] (either disabled or configured with undesirable marking thresholds) in WAN. ...
Preprint
Full-text available
Applications running in geographically distributed setting are becoming prevalent. Large-scale online services often share or replicate their data into multiple data centers (DCs) in different geographic regions. Driven by the data communication need of these applications, inter-datacenter network (IDN) is getting increasingly important. However, we find congestion control for inter-datacenter networks quite challenging. Firstly, the inter-datacenter communication involves both data center networks (DCNs) and wide-area networks (WANs) connecting multiple data centers. Such a network environment presents quite heterogeneous characteristics (e.g., buffer depths, RTTs). Existing congestion control mechanisms consider either DCN or WAN congestion, while not simultaneously capturing the degree of congestion for both. Secondly, to reduce evolution cost and improve flexibility, large enterprises have been building and deploying their wide-area routers based on shallow-buffered switching chips. However, with legacy congestion control mechanisms (e.g., TCP Cubic), shallow buffer can easily get overwhelmed by large BDP (bandwidth-delay product) wide-area traffic, leading to high packet losses and degraded throughput. This thesis describes my research efforts on optimizing congestion control mechanisms for the inter-datacenter networks. First, we design GEMINI - a practical congestion control mechanism that simultaneously handles congestions both in DCN andWAN. Second, we present FlashPass - a proactive congestion control mechanism that achieves near zero loss without degrading throughput under the shallow-buffered WAN. Extensive evaluation shows their superior performance over existing congestion control mechanisms.
... This is because many cloud operators lease the network resource (e.g., guaranteed bandwidth) from Internet service providers (ISPs) and WAN gears are maintained by the ISPs. As a result, some switch features, e.g., ECN, may not be well supported [18], [19] (either disabled or configured with undesirable marking thresholds) in WAN. ...
Conference Paper
Full-text available
Geographically distributed applications hosted on cloud are becoming prevalent. They run on cross-datacenter network that consists of multiple data center networks (DCNs) connected by a wide area network (WAN). Such a cross-DC network imposes significant challenges in transport design because the DCN and WAN segments have vastly distinct characteristics (e.g., butter depths, RTTs). In this paper, we find that existing DCN or WAN transports reacting to ECN or delay alone do not (and cannot be extended to) work well for such an environment. The key reason is that neither of the signals, by itself, can simultaneously capture the location and degree of congestion. This is due to the discrepancies between DCN and WAN. Motivated by this, we present the design and implementation of GEMINI that strategically integrates both ECN and delay signals for cross-DC congestion control. To achieve low latency, GEMINI bounds the inter-DC latency with delay signal and prevents the intra-DC packet loss with ECN. To maintain high throughput, GEMINI modulates the window dynamics and maintains low butter occupancy utilizing both congestion signals. GEMINI is implemented in Linux kernel and evaluated by extensive testbed experiments. Results show that GEMINI achieves up to 53%, 31% and 76% reduction of small flow average completion times compared to TCP Cubic, DCTCP and BBR; and up to 58% reduction of large flow average completion times compared to TCP Vegas.
... Version 00E before it is deployed over the public Internet as proposed [KWEB14,DSTBB14]. This is because DCTCP can maintain an extremely shallow queue, so it will more often need a window below one to support this (we uncovered the present problem while testing DCTCP over a broadband accessmore than a certain number of flows suddenly started the queue growing). ...
Preprint
Full-text available
This memo explains that deploying active queue management (AQM) to counter bufferbloat will not prevent TCP from overriding the AQM and building large queues in a range of not uncommon scenarios. This is a brief paper study to explain this effect which was observed in a number of low latency testbed experiments. To keep its queue short, an AQM drops (or marks) packets to make the TCP flow(s) traversing it reduce their packet rate. Nearly all TCP implementations will not run at less than two packets per round trip time (RTT). 2pkt / RTT need not imply low bit-rate if the RTT is small. For instance, it represents 2Mb/s over a 6ms round trip. When a few TCP flows share a link, in certain scenarios, including regular broadband and data centres, no matter how much the AQM signals to the flows to keep the queue short, they will not obey, because it is impossible for them to run below this floor. The memo proposes the necessary modification to the TCP standard.
... Three ways to avoid mutual interference among different multi-tenant datacenters congestion control algorithms are as follows, 1) Strictly dividing the bandwidth between the datacenter tenants and giving each tenant a fixed allocation [3]. 2) Adjusting the fair rules of the tenants on the data center switch, such as using different tag threshold in the same queue or a separate queue [4,5]. 3) According to the feature that traffic must go through the hypervisor, using the hypervisor conversion layer to ensure that the entire data center uses one single optimal congestion control algorithm, while the different tenants still use their own congestion control algorithm configuration in the same time. ...
Article
Full-text available
With the evolution of cloud computing and virtualization, the congestion control of virtual datacenters has become the basic issue for multi-tenant datacenters transmission. Regarding to the friendly conflict of heterogeneous congestion control among multi-tenant, this paper proposes a delay-based virtual congestion control, which translates the multi-tenant heterogeneous congestion control into delay-based feedback uniformly by setting the hypervisor translation layer, modifying three-way handshake of explicit feedback and packet loss feedback and throttling receive window. The simulation results show that the delay-based virtual congestion control can effectively solve the unfairness of heterogeneous feedback congestion control algorithms.
... Therefore, non-RoCE traffic may react differently, for example less aggressively, to congestion occurrences, and behave unfairly to RoCE-based flows. Unfair coexistence of different TCP flavors in the data center has been studied before [2,4,8,10]. Its conclusion predicts unfairness between different flavors of TCP. ...
Conference Paper
Full-text available
In recent years, the usage of RDMA in data center networks has increased significantly, with RDMA over Converged Ethernet (RoCE) emerging as the canonical approach for deploying RDMA in Ethernet-based data centers. Initial implementations of RoCE required a lossless fabric for optimal performance. This is typically achieved by enabling Priority Flow Control (PFC) on Ethernet NICs and switches. The RoCEv2 specification introduced RoCE congestion control, which allows throttling the transmission rate in response to congestion. Consequently, packet loss is minimized and performance is maintained, even if the underlying Ethernet network is lossy. In this paper, we discuss the latest developments in RoCE congestion control. Hardware congestion control reduces the latency of the congestion control loop; it reacts promptly in the face of congestion by throttling the transmission rate quickly and accurately. The short control loop also prevents network buffers from overfilling under various congestion scenarios. In addition, fast hardware retransmission complements congestion control in severe congestion scenarios, by significantly reducing the performance penalty of packet drops. We survey architectural features that allow deployment of RoCE over lossy networks and present real lab test results.
Article
The sustainable growth of bandwidth has been an inevitable tendency in current Data Center Networks (DCN). However, the dramatic expansion of link capacity offers a remarkable challenge to the transport layer protocols of DCN, i.e., how to converge fast and enable data flow to utilize the high bandwidth effectively. Meanwhile, the new protocol should be compatible to the traditional TCP because the applications with old TCP versions are still widely deployed. Therefore, it is important to achieve a trade-off between the aggressiveness and TCP-friendliness in protocol design. In this article, we first empirically investigate why the existing typical data center TCP variants naturally fail to guarantee both fast convergence and TCP friendliness. Then, we design a new transport protocol for DCN, namely Fast and Friendly Converging (FFC), which makes independent decisions and self-adjustment through retrieving the two-dimensional congestion notification from both RTT and ECN. We further present a mathematic model to analyze its competing behavior and converging process. The results from simulation experiments and real implementation show that FFC can achieve fast convergence, thus benefiting the flow completion time. Moreover, when coexisting with the traditional TCP, FFC also presents a moderate behavior, while introducing trivial deployment overhead only at the end-hosts.
Conference Paper
To achieve better network performance, the cloud service providers are widely deploying the ECN-based transport protocols (i.e., DCTCP) in their data center networks (DCN). In multi-tenant environment, however, the newly introduced ECN-enabled TCP greatly impairs the performance of applications with out-dated and miscon figured TCP stacks. The reason is that the ECN-enabled datacenter switch fails to treat the mixed TCP traffic fairly, causing the distinguished performance gap between the ECN-enabled and ECN-disabled TCPs. This paper proposes DDT (Dual Dynamic Thresholds), an active queue management algorithm (AQM) that aims to achieve the flow-level fairness when the heterogeneous TCP traffic coexists. DDT monitors the switch queue in real time, and dynamically tunes the distance between ECN-marking and packet-dropping thresholds to mitigate the competitiveness difference between the ECN-enabled and ECN-disabled TCP. Our preliminary real implementations and testing results show that DDT elegantly fills the competitiveness gap of heterogeneous TCP traffic without disturbing their own control loops, while only introducing acceptable deployment overhead at the switch.
Chapter
Multipath TCP (MPTCP) enables terminals utilizing multiple interfaces for data transmission simultaneously, which provides better performance and brings many benefits. However, using multiple paths brings some new challenges. The asymmetric parameters among different subflows may cause the out-of-order problem and load imbalance problem, especially in wireless network which has more packet loss. Thus it will significantly degrade the performance of MPTCP. In this paper, we propose a Receive Buffer Pre-division based flow control mechanism (RBP) for MPTCP. RBP divides receive buffer according to the prediction of receive buffer occupancy of each subflow, and controls the data transmission on each subflow using the divided buffer and the number of out-of-order packets, which can significantly improve the performance of MPTCP. We use the NS-3 simulations to verify the performance of our scheme, and the simulation results show that RBP algorithm can significantly increase the global throughput of MPTCP.
Article
Full-text available
Cloud data centers host diverse applications, mixing workloads that require small predictable latency with others requiring large sustained throughput. In this environment, today's state-of-the-art TCP protocol falls short. We present measurements of a 6000 server production cluster and reveal impairments that lead to high application latencies, rooted in TCP's demands on the limited buffer space available in data center switches. For example, bandwidth hungry "background" flows build up queues at the switches, and thus impact the performance of latency sensitive "foreground" traffic. To address these problems, we propose DCTCP, a variant of TCP for data center networks. DCTCP leverages Explicit Congestion Notification (ECN) in the network to provide multi-bit feedback to a simple control mechanism implemented in the host OS. We evaluate DCTCP at 1 and 10Gbps speeds, through benchmark experiments and analysis. In the data center, operating with commodity, shallow buffered switches, we find DCTCP delivers the same or better throughput than TCP, while using 90% less buffer space. Unlike TCP, DCTCP also provides high burst tolerance and low latency for short flows. In handling workloads derived from operational measurements, we found DCTCP enables the applications to handle 10X the current background traffic, without impacting foreground traffic. Further, a 10X increase in foreground traffic does not cause any timeouts, thus largely eliminating incast problems.
Article
Full-text available
Fairness is an important performance criterion in all resource allocation schemes, including those in distributed computer systems. However, it is often specified only qualitatively. The quantitative measures proposed in the literature are either too specific to a particular application, or suffer from some undesirable characteristics. In this paper, we have introduced a quantitative measure called Indiex of FRairness. The index is applicable to any resource sharing or allocation problem. It is independent of the amount of the resource. The fairness index always lies between 0 and 1. This boundedness aids intuitive understanding of the fairness index. For example, a distribution algorithm with a fairness of 0.10 means that it is unfair to 90% of the users. Also, the discrimination index can be defined as 1 - fairness index.
Article
Full-text available
The authors present random early detection (RED) gateways for congestion avoidance in packet-switched networks. The gateway detects incipient congestion by computing the average queue size. The gateway could notify connections of congestion either by dropping packets arriving at the gateway or by setting a bit in packet headers. When the average queue size exceeds a present threshold, the gateway drops or marks each arriving packet with a certain probability, where the exact probability is a function of the average queue size. RED gateways keep the average queue size low while allowing occasional bursts of packets in the queue. During congestion, the probability that the gateway notifies a particular connection to reduce its window is roughly proportional to that connection's share of the bandwidth through the gateway. RED gateways are designed to accompany a transport-layer congestion control protocol such as TCP. The RED gateway has no bias against bursty traffic and avoids the global synchronization of many connections decreasing their window at the same time. Simulations of a TCP/IP network are used to illustrate the performance of RED gateways
Conference Paper
Integrating realistic behavior of end systems into simulations is challenging since the mechanisms used in protocols and applications such as Transmission Control Protocol (TCP) are complex and continuously evolving. In this paper, we present VMSimInt, a new approach which allows the INTegration of arbitrary Operating Systems (OSs) and ap- plication code into an event-driven network SIMulator by using Virtual Machines (VMs). In contrast to existing ap- proaches which integrate parts of OS kernels, our approach uses unmodified OS kernels, which eases maintenance and provides additional flexibility. By controlling the time and all I/O of the VMs, our approach guarantees that external factors such as the performance of the host do not influence the simulation outcome, so that simulations are exactly re- producible. We validated our system against the Network Simulation Cradle (NSC) by simulating the same models and comparing the system behavior. In addition, we show that our approach provides sufficient performance for usage in day-to-day research.
Article
This paper presents Random Early Detection (RED) gateways for congestion avoidance in packet-switched networks. The gateway detects incipient congestion by computing the average queue size. The gateway could notify connections of congestion either by dropping packets arriving at the gateway or by setting a bit in packet headers. When the average queue size exceeds a preset threshold, the gateway drops or marks each arriving packet with a certain probability, where the exact probability is a function of the average queue size. RED gateways keep the average queue size low while allowing occasional bursts of packets in the queue. During congestion, the probability that the gateway notifies a particular connection to reduce its window is roughly proportional to that connection's share of the bandwidth through the gateway. RED gateways are designed to accompany a transport-layer congestion control protocol such as TCP. The RED gateway has no bias against bursty traffic and avoids the global synchronization of many connections decreasing their window at the same time. Simulations of a TCP/IP network are used to illustrate the performance of RED gateways.
Conference Paper
Cloud computing, social networking and information networks (for search, news feeds, etc) are driving interest in the deployment of large data centers. TCP is the dominant Layer 3 transport protocol in these networks. However, the operating conditions---very high bandwidth links, low round-trip times, small-buffered switches---and traffic patterns cause TCP to perform very poorly. The Data Center TCP (DCTCP) algorithm has recently been proposed as a TCP variant for data centers and addresses these shortcomings. In this paper, we provide a mathematical analysis of DCTCP. We develop a fluid model of DCTCP and use it to analyze the throughput and delay performance of the algorithm, as a function of the design parameters and of network conditions like link speeds, round-trip times and the number of active flows. Unlike fluid model representations of standard congestion control loops, the DCTCP fluid model exhibits limit cycle behavior. Therefore, it is not amenable to analysis by linearization around a fixed point and we undertake a direct analysis of the limit cycles, proving their stability. Using a hybrid (continuous- and discrete-time) model, we analyze the convergence of DCTCP sources to their fair share, obtaining an explicit characterization of the convergence rate. Finally, we investigate the "RTT-fairness" of DCTCP; i.e., the rate obtained by DCTCP sources as a function of their RTTs. We find a very simple change to DCTCP which is suggested by the fluid model and which significantly improves DCTCP's RTT-fairness. We corroborate our results with ns2 simulations.
More Accurate ECN Feedback in TCP: draft-kuehlewind-tcpm-accurate-ecn-03
  • B Briscoe
  • R Scheffenegger
  • M Kuehlewind
B. Briscoe, R. Scheffenegger, and M. Kuehlewind, "More Accurate ECN Feedback in TCP: draft-kuehlewind-tcpm-accurate-ecn-03," IETF, Internet-Draft, Jul. 2014, (Work in Progress).