Internet Congestion Control Research Group N. Cardwell
Internet-Draft Y. Cheng
Intended status: Experimental S. Hassas Yeganeh
Expires: January 4, 2018 V. Jacobson
Google, Inc
July 03, 2017
BBR Congestion Control
draft-cardwell-iccrg-bbr-congestion-control-00
Abstract
This document specifies the BBR congestion control algorithm. BBR
uses recent measurements of a transport connection's delivery rate
and round-trip time to build an explicit model that includes both the
maximum recent bandwidth available to that connection, and its
minimum recent round-trip delay. BBR then uses this model to control
both how fast it sends data and the maximum amount of data it allows
in flight in the network at any time. Relative to loss-based
congestion control algorithms such as Reno [RFC5681] or CUBIC
[draft-ietf-tcpm-cubic], BBR offers substantially higher throughput
for bottlenecks with shallow buffers or random losses, and
substantially lower queueing delays for bottlenecks with deep buffers
(avoiding "bufferbloat"). This algorithm can be implemented in any
transport protocol that supports packet-delivery acknowledgment (thus
far, open source implementations are available for TCP [RFC793] and
QUIC [draft-ietf-quic-transport-00]).
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 4, 2018.
Cardwell, et al. Expires January 4, 2018 [Page 1]
Internet-Draft BBR July 2017
Copyright Notice
Copyright (c) 2017 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Design Overview . . . . . . . . . . . . . . . . . . . . . . . 6
3.1. Network Path Model . . . . . . . . . . . . . . . . . . . 6
3.2. Target Operating Point . . . . . . . . . . . . . . . . . 6
3.3. Control Parameters . . . . . . . . . . . . . . . . . . . 6
3.4. State Machine Design Overview . . . . . . . . . . . . . . 7
3.4.1. State Transition Diagram . . . . . . . . . . . . . . 7
3.5. Algorithm Organization . . . . . . . . . . . . . . . . . 7
3.5.1. Initialization . . . . . . . . . . . . . . . . . . . 8
3.5.2. Per-ACK Steps . . . . . . . . . . . . . . . . . . . . 8
3.5.3. Per-Transmit Steps . . . . . . . . . . . . . . . . . 8
3.6. Environment and Usage . . . . . . . . . . . . . . . . . . 8
4. Detailed Algorithm . . . . . . . . . . . . . . . . . . . . . 9
4.1. Maintaining the Network Path Model . . . . . . . . . . . 9
4.1.1. BBR.BtlBw . . . . . . . . . . . . . . . . . . . . . . 9
4.1.2. BBR.RTprop . . . . . . . . . . . . . . . . . . . . . 12
4.2. BBR Control Parameters . . . . . . . . . . . . . . . . . 14
4.2.1. Pacing Rate . . . . . . . . . . . . . . . . . . . . . 14
4.2.2. Send Quantum . . . . . . . . . . . . . . . . . . . . 15
4.2.3. Congestion Window . . . . . . . . . . . . . . . . . . 16
4.3. State Machine . . . . . . . . . . . . . . . . . . . . . . 20
4.3.1. Initialization Steps . . . . . . . . . . . . . . . . 21
4.3.2. Startup . . . . . . . . . . . . . . . . . . . . . . . 22
4.3.3. Drain . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3.4. ProbeBW . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.5. ProbeRTT . . . . . . . . . . . . . . . . . . . . . . 27
5. Implementation Status . . . . . . . . . . . . . . . . . . . . 29
6. Security Considerations . . . . . . . . . . . . . . . . . . . 30
7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 30
8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 30
Cardwell, et al. Expires January 4, 2018 [Page 2]
Internet-Draft BBR July 2017
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 31
9.1. Normative References . . . . . . . . . . . . . . . . . . 31
9.2. Informative References . . . . . . . . . . . . . . . . . 31
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33
1. Introduction
The Internet has largely used loss-based congestion control
algorithms like Reno ([Jac88], [Jac90], [WS95] [RFC5681]) and CUBIC
([HRX08], [draft-ietf-tcpm-cubic]), which assume that packet loss is
equivalent to congestion. These algorithms worked well for many
years, not due to first principles but rather because Internet
switches' and routers' queues were generally well-matched to the
bandwidth of Internet links. As a result, queues tended to fill up
and drop excess packets close to the moment when senders had really
begun sending data too fast.
But packet loss is not equivalent to congestion. Congestion can be
thought of as a condition that occurs when a network path operates on
a sustained basis with more data in flight than the bandwidth-delay
product (BDP) of the path. As the Internet has evolved, loss-based
congestion control is increasingly problematic because packet loss
increasingly happens at moments that are divorced from the onset of
congestion:
1. Shallow buffers: in shallow buffers, packet loss happens before
congestion. With today's high-speed, long-haul links employing
commodity switches with shallow buffers, loss-based congestion
control can result in abysmal throughput because it overreacts,
multiplicatively decreasing the sending rate upon packet loss,
even if the packet loss comes from transient traffic bursts (this
kind of packet loss can be quite frequent even when the link is
mostly idle). Because of this dynamic, it is difficult to
achieve full utilization with loss-based congestion control in
practice: sustaining 10Gbps over 100ms RTT needs a packet loss
rate below 0.000003%, and over a 100ms RTT path a more feasible
loss rate like 1% can only sustain at most 3 Mbps (no matter what
the bottleneck link is capable of) [draft-ietf-tcpm-cubic].
2. Deep buffers: at bottleneck links with deep buffers, congestion
happens before packet loss. At the edge of today's Internet,
loss-based congestion control causes the "bufferbloat" problem,
by repeatedly filling the deep buffers in many last-mile links
and causing up to seconds of needless queuing delay.
The BBR congestion control algorithm takes a different approach:
rather than assuming that packet loss is equivalent to congestion,
Cardwell, et al. Expires January 4, 2018 [Page 3]
Internet-Draft BBR July 2017
BBR builds a model of the network path in order to avoid and respond
to actual congestion.
The BBR algorithm has been described previously at a high level
[CCGHJ16][CCGHJ17], and active work on the BBR algorithm is
continuing. This document describes the current BBR algorithm in
detail.
This document is organized as follows. Section 2 provides various
definitions that will be used throughout this document. Section 3
provides an overview of the design of the BBR algorithm, and section
4 describes the BBR algorithm in detail, including BBR's network path
model, control parameters, and state machine. Section 5 describes
the implementation status, section 6 describes security
considerations, section 7 notes that there are no IANA
considerations, and section 8 closes with Acknowledgments.
2. Terminology
This document defines the following state variables and constants for
the BBR algorithm:
BBR.pacing_rate: The current pacing rate for a BBR flow, which
controls inter-packet spacing.
BBR.send_quantum: The maximum size of a data aggregate scheduled and
transmitted together.
cwnd: The transport sender's congestion window, which limits the
amount of data in flight.
BBR.BtlBw: BBR's estimated bottleneck bandwidth available to the
transport flow, estimated from the maximum delivery rate sample in a
sliding window.
BBR.BtlBwFilter: The max filter used to estimate BBR.BtlBw.
BtlBwFilterLen: A constant specifying the length of the BBR.BtlBw max
filter window for BBR.BtlBwFilter, BtlBwFilterLen is 10 packet-timed
round trips.
BBR.RTprop: BBR's estimated two-way round-trip propagation delay of
the path, estimated from the windowed minimum recent round-trip delay
sample.
BBR.rtprop_stamp: The wall clock time at which the current BBR.RTProp
sample was obtained.
Cardwell, et al. Expires January 4, 2018 [Page 4]
Internet-Draft BBR July 2017
BBR.rtprop_expired: A boolean recording whether the BBR.RTprop has
expired and is due for a refresh with an application idle period or a
transition into ProbeRTT state.
RTpropFilterLen: A constant specifying the length of the RTProp min
filter window, RTpropFilterLen is 10 secs.
BBR.pacing_gain: The dynamic gain factor used to scale BBR.BtlBw to
produce BBR.pacing_rate.
BBR.cwnd_gain: The dynamic gain factor used to scale the estimated
BDP to produce a congestion window (cwnd).
BBRHighGain: A constant specifying the minimum gain value that will
allow the sending rate to double each round (2/ln(2) ~= 2.89), used
in Startup mode for both BBR.pacing_gain and BBR.cwnd_gain.
BBR.filled_pipe: A boolean that records whether BBR estimates that it
has ever fully utilized its available bandwidth ("filled the pipe").
BBRMinPipeCwnd: The minimal cwnd value BBR tries to target using: 4
packets, or 4 * SMSS
BBR.round_count: Count of packet-timed round trips.
BBR.round_start: A boolean that BBR sets to true once per packet-
timed round trip, on ACKs that advance BBR.round_count.
BBR.next_round_delivered: packet.delivered value denoting the end of
a packet-timed round trip.
BBRGainCycleLen: the number of phases in the BBR ProbeBW gain cycle:
8.
ProbeRTTInterval: A constant specifying the minimum time interval
between ProbeRTT states: 10 secs.
ProbeRTTDuration: A constant specifying the minimum duration for
which ProbeRTT state holds inflight to BBRMinPipeCwnd or fewer
packets: 200 ms.
SMSS: The Sender Maximum Segment Size.
The variables starting with C, P, or rs are defined in
[draft-cheng-iccrg-delivery-rate-estimation].
Cardwell, et al. Expires January 4, 2018 [Page 5]
Internet-Draft BBR July 2017
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
3. Design Overview
3.1. Network Path Model
BBR is a model-based congestion control algorithm: its behavior is
based on an explicit model of the network path over which a transport
flow travels. BBR's model includes explicit estimates of two
parameters:
1. BBR.BtlBw: the estimated bottleneck bandwidth available to the
transport flow, estimated from the maximum delivery rate sample
from a moving window.
2. BBR.RTprop: the estimated two-way round-trip propagation delay of
the path, estimated from the the minimum round-trip delay sample
from a moving window.
3.2. Target Operating Point
BBR uses its model to seek an operating point with high throughput
and low delay. To operate near the optimal operating point, the
point with maximum throughput and minimum delay [K79] [GK81], the
system needs to maintain two conditions:
1. Rate balance: the bottleneck packet arrival rate equals the
bottleneck bandwidth available to the transport flow.
2. Full pipe: the total data in in fight along the path is equal to
the BDP.
BBR uses its model to maintain an estimate of the optimal operating
point for the connection. To aim for rate balance BBR paces packets
at or near BBR.BtlBw, and to aim for a full pipe it modulates the
pacing rate to maintain an amount of data in flight near the
estimated bandwidth-delay product (BDP) of the path, or BBR.BtlBw *
BBR.RTprop.
3.3. Control Parameters
BBR uses its model to control the connection's sending behavior, to
keep it near the target operating point. Rather than using a single
control parameter, like the cwnd parameter that limits the volume of
in-flight data in the Reno and CUBIC congestion control algorithms,
BBR uses three distinct control parameters:
Cardwell, et al. Expires January 4, 2018 [Page 6]
Internet-Draft BBR July 2017
1. pacing rate: the rate at which BBR sends data.
2. send quantum: the maximum size of any aggregate that the
transport sender implementation may need to transmit in order to
amortize per-packet transmission overheads.
3. cwnd: the maximum volume of data BBR allows in-flight in the
network at any time.
3.4. State Machine Design Overview
BBR varies its three control parameters with a simple state machine.
The state machine aims for high throughput, low latency, and an
approximately fair sharing of bandwidth by alternating sequentially
between probing first for BBR.BtlBw and then for BBR.RTprop.
A BBR flow starts in the Startup state, and ramps up its sending rate
quickly. When it estimates the pipe is full, it enters the Drain
state to drain the queue. In steady state a BBR flow only uses the
ProbeBW state, to periodically briefly raise inflight to probe for
higher BBR.BtlBw samples, and (if needed) the ProbeRTT state, to
briefly lower inflight to probe for lower BBR.RTprop samples.
3.4.1. State Transition Diagram
The following state transition diagram summarizes the flow of control
and the relationship between the different states:
|
V
+---> Startup ----+
| | |
| V |
| Drain ----+
| | |
| V |
+---> ProbeBW -----+
| ^ | |
| | | |
| +----+ |
| |
+---- ProbeRTT <---+
3.5. Algorithm Organization
The BBR algorithm executes steps upon connection initialization, upon
each ACK, and upon each transmission. All of the sub-steps invoked
referenced below are described below.
Cardwell, et al. Expires January 4, 2018 [Page 7]
Internet-Draft BBR July 2017
3.5.1. Initialization
Upon transport connection initialization, BBR executes its
initialization steps:
BBROnConnectionInit():
BBRInit()
3.5.2. Per-ACK Steps
On every ACK, the BBR algorithm executes the following
BBRUpdateOnACK() steps in order to update its network path model,
update its state machine, and adjust its control parameters to adapt
to the updated model:
BBRUpdateOnACK():
BBRUpdateModelAndState()
BBRUpdateControlParameters()
BBRUpdateModelAndState():
BBRUpdateBtlBw()
BBRCheckCyclePhase()
BBRCheckFullPipe()
BBRCheckDrain()
BBRUpdateRTprop()
BBRCheckProbeRTT()
BBRUpdateControlParameters()
BBRSetPacingRate()
BBRSetSendQuantum()
BBRSetCwnd()
3.5.3. Per-Transmit Steps
When transmitting, BBR merely needs to check for the case where the
flow is restarting from idle:
BBROnTransmit():
BBRHandleRestartFromIdle()
3.6. Environment and Usage
BBR is a congestion control algorithm that is agnostic to transport-
layer and link-layer technologies, requires only sender-side changes,
and does not require changes in the network. Open source
implementations of BBR are available for the TCP [RFC793] and QUIC
[draft-ietf-quic-transport-00] transport protocols, and these
Cardwell, et al. Expires January 4, 2018 [Page 8]
Internet-Draft BBR July 2017
implementations have been used in production for a large volume of
Internet traffic.
4. Detailed Algorithm
4.1. Maintaining the Network Path Model
As noted above, BBR is a model-based congestion control algorithm: it
is based on an explicit model of the network path over which a
transport flow travels. This model includes two estimated
parameters: BBR.BtlBw, and BBR.RTprop.
4.1.1. BBR.BtlBw
BBR.BtlBw is BBR's estimate of the bottleneck bandwidth available to
data transmissions for the transport flow. At any time, a transport
connection's data transmissions experience exactly one slowest link
or bottleneck. The bottleneck's delivery rate determines the
connection's maximum data-delivery rate. BBR tries to closely match
its sending rate to this bottleneck delivery rate to help seek "rate
balance", where the flow's packet arrival rate at the bottleneck
equals BBR.BtlBw. The bottleneck rate varies over the life of a
connection, so BBR continually estimates BBR.BtlBw using recent
delivery rate samples.
4.1.1.1. Delivery Rate Samples for Estimating BBR.BtlBw
Since calculating delivery rate samples is subtle, and the samples
are useful independent of congestion control, the approach BBR uses
for measuring each single delivery rate sample is specified in a
separate Internet Draft [draft-cheng-iccrg-delivery-rate-estimation].
4.1.1.2. BBR.BtlBw Max Filter
Delivery rate samples tend to be below the bottleneck bandwidth
available to the flow, due to "noise" introduced by random variation
in physical transmission processes (e.g. radio link layer noise) or
queues along the network path. Thus to filter out these effects BBR
uses a max filter: BBR estimates BBR.BtlBw using the windowed maximum
recent delivery rate sample seen by the connection over that past
BtlBwFilterLen round trips.
The length of the BBR.BtlBw max filter window is BtlBwFilterLen = 10
round trips. This length is driven by trade-offs among several
considerations:
o The BtlBwFilterLen is long enough to cover an entire ProbeBW gain
cycle (see the "ProbeBW" section below). This ensures that the
Cardwell, et al. Expires January 4, 2018 [Page 9]
Internet-Draft BBR July 2017
window contains at least some delivery rate samples that are the
result of data transmitted with a super-unity pacing_gain (a
pacing_gain larger than 1.0). Such super-unity delivery rate
samples are instrumental in revealing the path's underlying
available bandwidth even when there is noise from delivery rate
shortfalls due to aggregation delays, queuing delays from cross-
traffic, lossy link layers with uncorrected losses, or non-
congestive buffer drops (e.g., brief coincident bursts in a
shallow buffer).
o The BtlBwFilterLen aims to be long enough to cover short-term
fluctuations in the network's delivery rate due to the
aforementioned sources of noise. In particular, the delivery rate
for radio link layers (e.g., wifi and cellular technologies) can
be highly variable, and the filter window needs to be long enough
to remember "good" delivery rate samples in order to be robust to
such variations.
o The BtlBwFilterLen aims to be short enough to respond in a timely
manner to real reductions in the bandwidth available to a flow,
whether because new flows have started sharing the bottleneck, or
the bottleneck link service rate has reduced due to physical
changes, policy changes, or routing changes. In any of these
cases, existing BBR flows traversing the bottleneck should, in a
timely manner, reduce their BBR.BtlBw estimates and thus pacing
rate and in-flight data, in order to match the sending behavior to
the new available bandwidth.
4.1.1.3. Tracking Time for the BBR.BtlBw Max Filter
BBR tracks time for the BBR.BtlBw filter window using a virtual (non-
wall-clock) time tracked by BBR.round_count, a count of "packet-
timed" round-trips. BBR uses this virtual BBR.round_count because it
is more robust than using wall clock time. In particular, arbitrary
intervals of wall clock time can elapse due to variations in RTTs or
timer delays for retransmission timeouts, causing wall-clock-timed
bandwidth estimates to "time out" too quickly.
The BBR.round_count counts packet-timed round trips by recording
state about a sentinel packet, and waiting for an ACK of any data
packet that was sent after that sentinel packet, using the following
pseudocode:
Upon connection initialization:
Cardwell, et al. Expires January 4, 2018 [Page 10]
Internet-Draft BBR July 2017
BBRInitRoundCounting():
BBR.next_round_delivered = 0
BBR.round_start = false
BBR.round_count = 0
Upon sending each packet transmission:
packet.delivered = BBR.delivered
Upon receiving an ACK for a given data packet:
BBRUpdateRound():
BBR.delivered += packet.size
if (packet.delivered >= BBR.next_round_delivered)
BBR.next_round_delivered = BBR.delivered
BBR.round_count++
BBR.round_start = true
else
BBR.round_start = false
4.1.1.4. BBR.BtlBw and Application-limited Delivery Rate Samples
Transmissions can be application-limited, meaning the transmission
rate is limited by the application rather than the congestion control
algorithm. This is quite common because of request/response traffic.
When there is a transmission opportunity but no data to send, the
delivery rate sampler marks the corresponding bandwidth sample(s) as
application-limited [draft-cheng-iccrg-delivery-rate-estimation].
The BBR.BtlBw estimator carefully decides which samples to include in
the bandwidth model to ensure that BBR.BtlBw reflects network limits,
not application limits. By default, the estimator discards
application-limited samples, since by definition they reflect
application limits. However, the estimator does use application-
limited samples if the measured delivery rate happens to be larger
than the current BBR.BtlBw estimate, since this indicates the current
BBR.BtlBw estimate is too low.
4.1.1.5. Updating the BBR.BtlBw Max Filter
For every ACK that acknowledges some data packets as delivered, BBR
invokes BBRUpdateBtlBw() to update the BBR.BtlBw estimator as follows
(here packet.delivery_rate is the delivery rate sample obtained from
the "packet" that has just been ACKed, as specified in
[draft-cheng-iccrg-delivery-rate-estimation]):
Cardwell, et al. Expires January 4, 2018 [Page 11]
Internet-Draft BBR July 2017
BBRUpdateBtlBw()
BBRUpdateRound()
if (rs.delivery_rate >= BBR.BtlBw || ! rs.is_app_limited)
BBR.BtlBw = update_windowed_max_filter(
filter=BBR.BtlBwFilter,
value=rs.delivery_rate,
time=BBR.round_count,
window_length=BtlBwFilterLen)
4.1.2. BBR.RTprop
BBR.RTprop is BBR's estimate of the round-trip propagation delay of
the path over which a transport connection is sending. The path's
round-trip propagation delay determines the minimum amount of time
over which the connection must be willing to sustain transmissions at
the BBR.BtlBw rate, and thus the minimum amount of data needed in-
flight, for the connection to reach full utilization (a "Full Pipe").
The round-trip propagation delay can vary over the life of a
connection, so BBR continually estimates RTProp using recent round-
trip delay samples.
4.1.2.1. Round-Trip Time Samples for Estimating BBR.RTprop
For every data packet a connection sends, BBR calculates an RTT
sample that measures the time interval from sending a data packet
until that packet is acknowledged.
For the most part, the same considerations and mechanisms that apply
to RTT estimation for the purposes of retransmission timeout
calculations [RFC6298] apply to BBR RTT samples. Namely, BBR does
not use RTT samples based on the transmission time of retransmitted
packets, since these are ambiguous, and thus unreliable. Also, BBR
calculates RTT samples using both cumulative and selective
acknowledgments (if the transport supports [RFC2018] SACK options or
an equivalent mechanism), or transport-layer timestamps (if the
transport supports [RFC7323] TCP timestamps or an equivalent
mechanism).
The only divergence from RTT estimation for retransmission timeouts
is in the case where a given acknowledgment ACKs more than one data
packet. In order to be conservative and schedule long timeouts to
avoid spurious retransmissions, the maximum among such potential RTT
samples is typically used for computing retransmission timeouts;
i.e., SRTT is typically calculated using the data packet with the
earliest transmission time. By contrast, in order for BBR to try to
reach the minimum amount of data in flight to fill the pipe, BBR uses
the minimum among such potential RTT samples; i.e., BBR calculates
the RTT using the data packet with the latest transmission time.
Cardwell, et al. Expires January 4, 2018 [Page 12]
Internet-Draft BBR July 2017
4.1.2.2. BBR.RTprop Min Filter
RTT samples tend to be above the round-trip propagation delay of the
path, due to "noise" introduced by random variation in physical
transmission processes (e.g. radio link layer noise), queues along
the network path, the receiver's delayed ack strategy, ack
aggregation, etc. Thus to filter out these effects BBR uses a min
filter: BBR estimates BBR.RTprop using the minimum recent RTT sample
seen by the connection over that past RTpropFilterLen seconds. (Many
of the same network effects that can decrease delivery rate
measurements can increase RTT samples, which is why BBR's min-
filtering approach for RTTs is the complement of its max-filtering
approach for delivery rates.)
The length of the RTProp min filter window is RTpropFilterLen = 10
secs. This is driven by trade-offs among several considerations:
o The RTpropFilterLen is longer than ProbeRTTInterval, so that it
covers an entire ProbeRTT cycle (see the "ProbeRTT" section
below). This helps ensure that the window can contain RTT samples
that are the result of data transmitted with inflight below the
estimated BDP of the flow. Such RTT samples are important for
helping to reveal the path's underlying two-way propagation delay
even when the aforementioned "noise" effects can often obscure it.
o The RTpropFilterLen aims to be long enough to avoid needing to cut
in-flight and throughput often. Measuring two-way propagation
delay requires in-flight to be at or below BDP, which risks some
amount of underutilization, so BBR uses a filter window long
enough that such underutilization events can be rare.
o The RTpropFilterLen aims to be long enough that many applications
have a "natural" moment of silence or low utilization that can cut
in-flight below BDP and naturally serve to refresh the BBR.RTprop,
without requiring BBR to force an artificial cut in in-flight.
This applies to many popular applications, including Web, RPC,
chunked audio or video traffic.
o The RTpropFilterLen aims to be short enough to respond in a timely
manner to real increases in the two-way propagation delay of the
path, e.g. due to route changes, which are expected to typically
happen on time scales of 30 seconds or longer.
4.1.2.3. Updating the BBR.RTprop Min Filter
Upon transmitting each packet, BBR (or the associated transport
protocol) stores in per-packet data the wall-clock transmission time
of the packet (Now() returns the current wall-clock time):
Cardwell, et al. Expires January 4, 2018 [Page 13]
Internet-Draft BBR July 2017
packet.send_time = Now()
For every ACK that acknowledges some data packets as delivered, BBR
(or the associated transport protocol) calculates an RTT sample "rtt"
as follows:
packet.rtt = Now() - packet.send_time
A BBR implementation MAY use a generic windowed min filter to track
BBR.RTprop. However, a significant savings in space can be achieved
by using the same state to track BBR.RTprop and ProbeRTT timing, so
this document describes this combined approach. With this approach,
on every ACK that provides an RTT sample BBR updates the BBR.RTprop
estimator as follows:
BBRUpdateRTprop()
BBR.rtprop_expired =
Now() > BBR.rtprop_stamp + RTpropFilterLen
if (packet.rtt >= 0 and
(packet.rtt <= BBR.RTprop or BBR.rtprop_expired))
BBR.RTprop = packet.rtt
BBR.rtprop_stamp = Now()
Here BBR.rtprop_expired is a boolean recording whether the BBR.RTprop
has expired and is due for a refresh, via either an application idle
period or a transition into ProbeRTT state.
4.2. BBR Control Parameters
BBR uses three distinct but interrelated control parameters: pacing
rate, send quantum, and congestion window (cwnd).
4.2.1. Pacing Rate
To help match the packet-arrival rate to the bottleneck link's
departure rate, BBR paces data packets. Pacing enforces a maximum
rate at which BBR schedules packets for transmission. Pacing is the
primary mechanism that BBR uses to control its sending behavior; BBR
implementations MUST implement pacing.
The sending host implements pacing by maintaining inter-packet
spacing at the time each packet is scheduled for transmission,
calculating the next transmission time for a packet for a given flow
(here "next_send_time") as a function of the most recent packet size
and the current pacing rate, as follows:
next_send_time = Now() + packet.size / pacing_rate
Cardwell, et al. Expires January 4, 2018 [Page 14]
Internet-Draft BBR July 2017
To adapt to the bottleneck, in general BBR sets the pacing rate to be
proportional to BBR.BtlBw, with a dynamic gain, or scaling factor of
proportionality, called pacing_gain.
When a BBR flow starts it has no BBR.BtlBw estimate. So in this case
it sets an initial pacing rate based on the transport sender
implementation's initial congestion window ("InitialCwnd", e.g. from
[RFC6298]), the initial SRTT (smoothed round-trip time) after the
first non-zero RTT sample, and the initial pacing_gain:
BBRInitPacingRate():
nominal_bandwidth = InitialCwnd / (SRTT ? SRTT : 1ms)
BBR.pacing_rate = BBR.pacing_gain * nominal_bandwidth
After initialization, on each data ACK BBR updates its pacing rate to
be proportional to BBR.BtlBw, as long as it estimates that it has
filled the pipe (BBR.filled_pipe is true; see the "Startup" section
below for details), or doing so increases the pacing rate. Limiting
the pacing rate updates in this way helps the connection probe
robustly for bandwidth until it estimates it has reached its full
available bandwidth ("filled the pipe"). In particular, this
prevents the pacing rate from being reduced when the connection has
only seen application-limited samples. BBR updates the pacing rate
on each ACK by executing the BBRSetPacingRate() step as follows:
BBRSetPacingRateWithGain(pacing_gain):
rate = pacing_gain * BBR.BtlBw
if (BBR.filled_pipe || rate > BBR.pacing_rate)
BBR.pacing_rate = rate
BBRSetPacingRate():
BBRSetPacingRateWithGain(BBR.pacing_gain)
4.2.2. Send Quantum
In order to amortize per-packet host overheads involved in the
sending process, high-performance transport sender implementations
often schedule an aggregate containing multiple packets (multiple
MSS) worth of data as a single quantum (using TSO, GSO, or other
offload mechanisms [DC13]). The BBR congestion control algorithm
makes this control decision explicitly, dynamically calculating a
BBR.send_quantum control parameter that specifies the maximum size of
these transmission aggregates. This decision is based on a trade-
off: a smaller BBR.send_quantum is preferred a lower data rates
because it results in shorter packet bursts, shorter queues, lower
queueing delays, and lower rates of packet loss; a bigger
BBR.send_quantum can be required at higher data rates because it
results in lower CPU overheads at the sending and receiving hosts,
Cardwell, et al. Expires January 4, 2018 [Page 15]
Internet-Draft BBR July 2017
who can ship larger amounts of data with a single trip through the
networking stack.
On each ACK, BBR runs BBRSetSendQuantum() to update BBR.send_quantum
as follows:
BBRSetSendQuantum():
if (BBR.pacing_rate < 1.2 Mbps)
BBR.send_quantum = 1 * MSS
else if (BBR.pacing_rate < 24 Mbps)
BBR.send_quantum = 2 * MSS
else
BBR.send_quantum = min(BBR.pacing_rate * 1ms, 64KBytes)
A BBR implementation MAY use alternate approaches to select a
BBR.send_quantum, as appropriate for the CPU overheads anticipated
for senders and receivers, and buffering considerations anticipated
in the network path. However, for the sake of the network and other
users, a BBR implementation SHOULD attempt to use the smallest
feasible quanta.
4.2.3. Congestion Window
The congestion window, or cwnd, controls the maximum volume of data
BBR allows in flight in the network at any time. BBR's adapts the
cwnd based on its model of the network path and the state machine's
decisions about how to probe that path.
By default, BBR grows its cwnd to meet its target cwnd,
BBR.target_cwnd, which is scaled to adapt to the estimated BDP
computed from its path model. But BBR's selection of cwnd is
designed to explicitly trade off among competing considerations that
dynamically adapt to various conditions. So in loss recovery BBR
more conservatively adjusts its sending behavior based on more recent
delivery samples, and if BBR needs to re-probe the current BBR.RTprop
of the path then it cuts its cwnd accordingly. The following
sections describe the various considerations that impact cwnd.
4.2.3.1. Initial cwnd
At its core, BBR is about using measurements to build a model of the
network path and then adapting control decisions to the path based on
that model. As such, the selection of the initial cwnd is considered
to be outside the scope of the BBR algorithm, since at initialization
there are no measurements yet upon which BBR can operate. Thus, at
initialization, BBR uses the transport sender implementation's
initial congestion window (e.g. from [RFC6298] for TCP).
Cardwell, et al. Expires January 4, 2018 [Page 16]
Internet-Draft BBR July 2017
4.2.3.2. Target cwnd
The BBR BBR.target_cwnd is the upper bound on the volume of data BBR
allows in flight. This bound is always in place, and dominates when
all other considerations have been satisfied: the flow is not in loss
recovery, does not need to probe BBR.RTprop, and has accumulated
confidence in its model parameters by receiving enough ACKs to
gradually grow the current cwnd to meet the BBR.target_cwnd.
On each ACK, BBR calculates the BBR.target_cwnd as follows:
BBRInflight(gain):
if (BBR.RTprop == Inf)
return InitialCwnd /* no valid RTT samples yet */
quanta = 3*BBR.send_quantum
estimated_bdp = BBR.BtlBw * BBR.RTprop
return gain * estimated_bdp + quanta
BBRUpdateTargetCwnd():
BBR.target_cwnd = BBRInflight(BBR.cwnd_gain)
The "estimated_bdp" term allows enough packets in flight to fully
utilize the estimated BDP of the path, by allowing the flow to send
at BBR.BtlBw for a duration of BBR.RTprop. Scaling up the BDP by
cwnd_gain, selected by the BBR state machine to be above 1.0 at all
times, bounds in-flight data to a small multiple of the BDP, in order
to handle common network and receiver pathologies, such as delayed,
stretched, or aggregated ACKs [A15]. The "quanta" term allows enough
quanta in flight on the sending and receiving hosts to reach full
utilization even in high-throughput environments using offloading
mechanisms.
4.2.3.3. Minimum cwnd for Pipelining
Normally (except in the immediate aftermath of entering loss
recovery), BBR imposes a floor of BBRMinPipeCwnd (4 packets, i.e. 4 *
MSSS). This floor helps ensure that even at very low BDPs, and with
a transport like TCP where a receiver may ACK only every alternate
MSS of data, there are enough packets in flight to maintain full
pipelining. In particular BBR tries to allow at least 2 data packets
in flight and ACKs for at least 2 data packets on the path from
receiver to sender.
4.2.3.4. Modulating cwnd in Loss Recovery
BBR interprets loss as a hint that there may be recent changes in
path behavior that are not yet fully reflected in its model of the
path, and thus it needs to be more conservative.
Cardwell, et al. Expires January 4, 2018 [Page 17]
Internet-Draft BBR July 2017
Upon a retransmission timeout, meaning the sender thinks all in-
flight packets are lost, BBR conservatively reduces cwnd to one
packet (1 MSS) and sends a single packet. Then BBR gradually
increases cwnd using the normal approach outlined below in "Core cwnd
Adjustment Mechanism".
When a BBR sender detects packet loss but there are still packets in
flight, on the first round of the loss-repair process BBR temporarily
reduces the cwnd to match the current delivery rate as ACKs arrive.
On second and later rounds of loss repair, it ensures the sending
rate never exceeds twice the current delivery rate as ACKs arrive.
When BBR exits loss recovery it restores the cwnd to the "last known
good" value that cwnd held before entering recovery. This applies
equally whether the flow exits loss recovery because it finishes
repairing all losses or because it executes an "undo" event after
inferring that a loss recovery event was spurious.
There are several ways to implement this high-level design for
updating cwnd in loss recovery. One is as follows:
Upon retransmission timeout (RTO):
BBR.prior_cwnd = BBRSaveCwnd()
cwnd = 1
Upon entering Fast Recovery, set cwnd to the number of packets still
in flight (allowing at least one for a fast retransmit):
BBR.prior_cwnd = BBRSaveCwnd()
cwnd = packets_in_flight + max(packets_delivered, 1)
BBR.packet_conservation = true
Upon every ACK in Fast Recovery, run the following
BBRModulateCwndForRecovery() steps, which help ensure packet
conservation on the first round of recovery, and sending at no more
than twice the current delivery rate on later rounds of recovery
(given that "packets_delivered" packets were newly marked ACKed or
SACKed and "packets_lost" were newly marked lost):
BBRModulateCwndForRecovery():
if (packets_lost > 0)
cwnd = max(cwnd - packets_lost, 1)
if (BBR.packet_conservation)
cwnd = max(cwnd, packets_in_flight + packets_delivered)
After one round-trip in Fast Recovery:
Cardwell, et al. Expires January 4, 2018 [Page 18]
Internet-Draft BBR July 2017
BBR.packet_conservation = false
Upon exiting loss recovery (RTO recovery or Fast Recovery), either by
repairing all losses or undoing recovery, BBR restores the best-known
cwnd value we had upon entering loss recovery:
BBR.packet_conservation = false
BBRRestoreCwnd()
The BBRSaveCwnd() and BBRRestoreCwnd() helpers help remember and
restore the last-known good cwnd (the latest cwnd unmodulated by loss
recovery or ProbeRTT), and is defined as follows:
BBRSaveCwnd():
if (not InLossRecovery() and BBR.state != ProbeRTT)
return cwnd
else
return max(BBR.prior_cwnd, cwnd)
BBRRestoreCwnd():
cwnd = max(cwnd, BBR.prior_cwnd)
4.2.3.5. Modulating cwnd in ProbeRTT
If BBR decides it needs to enter the ProbeRTT state (see the
"ProbeRTT" section below), its goal is to quickly reduce the volume
of in-flight data and drain the bottleneck queue, thereby allowing
measurement of BBR.RTprop. To implement this mode, BBR bounds the
cwnd to BBRMinPipeCwnd, the minimal value that allows pipelining (see
the "Minimum cwnd for Pipelining" section, above):
BBRModulateCwndForProbeRTT():
if (BBR.state == ProbeRTT)
cwnd = min(cwnd, BBRMinPipeCwnd)
4.2.3.6. Core cwnd Adjustment Mechanism
The network path and traffic traveling over it can make sudden
dramatic changes. To adapt to these changes smoothly and robustly,
and reduce packet losses in such cases, BBR uses a conservative
strategy. When cwnd is above the BBR.target_cwnd derived from BBR's
path model, BBR cuts the cwnd immediately to the target. When cwnd
is below BBR.target_cwnd, BBR raises the cwnd gradually and
cautiously, increasing cwnd by no more than the amount of data
acknowledged (cumulatively or selectively) upon each ACK.
Cardwell, et al. Expires January 4, 2018 [Page 19]
Internet-Draft BBR July 2017
Specifically, on each ACK that acknowledges "packets_delivered"
packets as newly ACKed or SACKed, BBR runs the following BBRSetCwnd()
steps to update cwnd:
BBRSetCwnd():
BBRUpdateTargetCwnd()
BBRModulateCwndForRecovery()
if (not BBR.packet_conservation) {
if (BBR.filled_pipe)
cwnd = min(cwnd + packets_delivered, BBR.target_cwnd)
else if (cwnd < BBR.target_cwnd || BBR.delivered < InitialCwnd)
cwnd = cwnd + packets_delivered
cwnd = max(cwnd, BBRMinPipeCwnd)
}
BBRModulateCwndForProbeRTT()
There are several considerations here. If BBR has measured enough
samples to achieve confidence that it has filled the pipe (see the
description of BBR.filled_pipe in the "Startup" section below), then
it increases its cwnd based on the number of packets delivered, while
bounding its cwnd to be no larger than the BBR.target_cwnd adapted to
the estimated BDP. Otherwise, if the cwnd is below the target, or
the sender has marked so little data delivered (less than
InitialCwnd) that it does not yet judge its BBR.BtlBw estimate and
BBR.target_cwnd as useful, then it increases cwnd without bounding it
to be below the target. Finally, BBR imposes a floor of
BBRMinPipeCwnd in order to allow pipelining even with small BDPs (see
the "Minimum cwnd for Pipelining" section, above).
4.3. State Machine
BBR implements a simple state machine that uses the network path
model described above to guide its decisions, and the control
parameters described above to enact its decisions. At startup, BBR
probes to build a model of the network path; to adapt to later
changes to the path or its traffic, BBR must continue to probe to
update its model. If the available bottleneck bandwidth increases,
BBR must send faster to discover this. Likewise, if the actual
round-trip propagation delay changes, this changes the BDP, and thus
BBR must send slower to get inflight below the new BDP in order to
measure the new BBR.RTprop. Thus, BBR's state machine runs periodic,
sequential experiments, sending faster to check for BBR.BtlBw
increases or sending slower to check for BBR.RTprop decreases. The
frequency, magnitude, duration, and structure of these experiments
differ depending on what's already known (startup or steady-state)
and sending application behavior (intermittent or continuous).
This state machine has several goals:
Cardwell, et al. Expires January 4, 2018 [Page 20]
Internet-Draft BBR July 2017
o Achieve high throughput by efficiently utilizing available
bandwidth.
o Achieve low latency by keeping queues bounded and small.
o Periodically send multiplicatively faster to probe the network to
quickly discover newly-available bandwidth (in time proportional
to the logarithm of the newly-available bandwidth).
o If needed, send slower in order to drain the bottleneck queue and
probe the network to try to estimate the network's two-way
propagation delay.
o Share bandwidth with other flows in an approximately fair manner.
o Feed samples to the BBR.BtlBw and BBR.RTprop estimators to refresh
and update the model.
BBR's state machine uses two control mechanisms. First and foremost,
it uses the pacing_gain (see the "Pacing Rate" section), which
controls how fast packets are sent relative to BBR.BtlBw and is thus
key to BBR's ability to learn. A pacing_gain > 1 decreases inter-
packet time time and increases inflight. A pacing_gain < 1 has the
opposite effect, increasing inter-packet time and generally
decreasing inflight. Second, if the state machine needs to quickly
reduce inflight to a particular absolute value, it uses the cwnd.
The following sections describe each state in turn.
4.3.1. Initialization Steps
Upon transport connection initialization, BBR executes the following
steps:
BBRInit():
init_windowed_max_filter(filter=BBR.BtlBwFilter, value=0, time=0)
BBR.rtprop = SRTT ? SRTT : Inf
BBR.rtprop_stamp = Now()
BBR.probe_rtt_done_stamp = 0
BBR.probe_rtt_round_done = false
BBR.packet_conservation = false
BBR.prior_cwnd = 0
BBR.idle_restart = false
BBRInitRoundCounting()
BBRInitFullPipe()
BBRInitPacingRate()
BBREnterStartup()
Cardwell, et al. Expires January 4, 2018 [Page 21]
Internet-Draft BBR July 2017
4.3.2. Startup
4.3.2.1. Startup Dynamics
When a BBR flow starts up, it performs its first (and most rapid)
sequential probe/drain process. Network link bandwidths currently
span a range of at least 11 orders of magnitude, from a few bps to
100 Gbps. To quickly learn BBR.BtlBw, given this huge range to
explore, BBR's Startup state does an exponential search of the rate
space, doubling the sending rate each round. This finds BBR.BtlBw in
O(log_2(BDP)) round trips.
To achieve this rapid probing in the smoothest possible fashion, upon
entry into Startup state BBR sets BBR.pacing_gain and BBR.cwnd_gain
to BBRHighGain = 2/ln(2) ~= 2.89, the minimum gain value that will
allow the sending rate to double each round. Hence, when
initializing a connection, or upon any later entry into Startup mode,
BBR executes the following BBREnterStartup() steps:
BBREnterStartup():
BBR.state = Startup
BBR.pacing_gain = BBRHighGain
BBR.cwnd_gain = BBRHighGain
As BBR grows its sending rate rapidly, it obtains higher delivery
rate samples, BBR.BtlBw increases, and the pacing rate and cwnd both
adapt by smoothly growing in proportion. Once the pipe is full, a
queue generally forms, but the cwnd_gain bounds any queue to
(cwnd_gain - 1) * BDP, and the immediately following Drain state is
designed to quickly drain that queue.
4.3.2.2. Estimating When Startup has Filled the Pipe
During Startup, BBR estimates whether the pipe is full by looking for
a plateau in the BBR.BtlBw estimate. The output of this "full pipe"
estimator is tracked in BBR.filled_pipe, a boolean that records
whether BBR estimates that it has ever fully utilized its available
bandwidth ("filled the pipe"). If BBR notices that there are several
(three) rounds where attempts to double the delivery rate actually
result in little increase (less than 25 percent), then it estimates
that it has reached BBR.BtlBw, sets BBR.filled_pipe to true, exits
Startup and enters Drain.
Upon connection initialization the full pipe estimator runs:
Cardwell, et al. Expires January 4, 2018 [Page 22]
Internet-Draft BBR July 2017
BBRInitFullPipe():
BBR.filled_pipe = false
BBR.full_bw = 0
BBR.full_bw_count = 0
Once per round trip, upon an ACK that acknowledges new data, and when
the delivery rate sample is not application-limited (see
[draft-cheng-iccrg-delivery-rate-estimation]), BBR runs the "full
pipe" estimator, if needed:
BBRCheckFullPipe():
if BBR.filled_pipe or
not BBR.round_start or rs.is_app_limited
return // no need to check for a full pipe now
if (BBR.BtlBw >= BBR.full_bw * 1.25) // BBR.BtlBw still growing?
BBR.full_bw = BBR.BtlBw // record new baseline level
BBR.full_bw_count = 0
return
BBR.full_bw_count++ // another round w/o much growth
if (BBR.full_bw_count >= 3)
BBR.filled_pipe = true
BBR waits three rounds in order to have solid evidence that the
sender is not detecting a delivery-rate plateau that was temporarily
imposed by the receive window. Allowing three rounds provides time
for the receiver's receive-window autotuning to open up the receive
window and for the BBR sender to realize that BBR.BtlBw should be
higher: in the first round the receive-window autotuning algorithm
grows the receive window; in the second round the sender fills the
higher receive window; in the third round the sender gets higher
delivery-rate samples. This three-round threshold was validated by
YouTube experimental data.
4.3.3. Drain
In Startup, when the BBR "full pipe" estimator estimates that BBR has
filled the pipe, BBR switches to its Drain state. In Drain, BBR aims
to quickly drain any queue created in Startup by switching to a
pacing_gain well below 1.0; specifically, it uses a pacing_gain that
is the inverse of the value used during Startup, which drains the
queue in one round:
BBREnterDrain():
BBR.state = Drain
BBR.pacing_gain = 1/BBRHighGain // pace slowly
BBR.cwnd_gain = bbr_high_gain // maintain cwnd
Cardwell, et al. Expires January 4, 2018 [Page 23]
Internet-Draft BBR July 2017
In Drain, when the number of packets in flight matches the estimated
BDP, meaning BBR estimates that the queue has been fully drained but
the pipe is still full, then BBR leaves Drain and enters ProbeBW. To
implement this, upon every ACK BBR executes:
BBRCheckDrain():
if (BBR.state == Startup and BBR.filled_pipe)
BBREnterDrain()
if (BBR.state == Drain and packets_in_flight <= BBRInflight(1.0))
BBREnterProbeBW() // we estimate queue is drained
4.3.4. ProbeBW
4.3.4.1. Gain Cycling Dynamics
BBR flows spend the vast majority of their time in ProbeBW state,
probing for bandwidth using an approach called gain cycling, which
helps BBR flows reach high throughput, low queuing delay, and
convergence to a fair share of bandwidth. With gain cycling, BBR
cycles through a sequence of values for the pacing_gain. It uses an
eight-phase cycle with the following pacing_gain values: 5/4, 3/4, 1,
1, 1, 1, 1, 1. Each phase normally lasts for roughly BBR.RTprop.
In phase 0 of the gain cycle, BBR probes for more bandwidth by using
a pacing_gain of 5/4, which gradually raises inflight. It does this
until the elapsed time in the phase has been at least BBR.RTprop and
either inflight has reached 5/4 * estimated_BDP (which may take
longer than BBR.RTprop if BBR.RTprop is low) or some packets have
been lost (suggesting that perhaps the path cannot actually hold 5/4
* estimated_BDP in flight).
Next, phase 1 of the gain cycle is designed to drain any queue at the
bottleneck (the likely outcome of phase 0 if the pipe was full) by
using a pacing_gain of 3/4, chosen to be the same distance below 1
that 5/4 is above 1. This phase lasts until either a full BBR.RTprop
has elapsed or inflight drops below estimated_BDP (suggesting that
the queue has drained earlier than expected, perhaps because of
application-limited behavior).
Finally, phases 2 through 7 of the gain cycle are designed to cruise
with a short queue and full utilization by using a pacing_gain of
1.0. Each of these phases lasts for the BBR.RTprop.
The average gain across all phases is 1.0 because ProbeBW aims for
its average pacing rate to equal BBR.BtlBw, the estimated available
bandwidth, and thus maintain high utilization, while maintaining a
small, well-bounded queue.
Cardwell, et al. Expires January 4, 2018 [Page 24]
Internet-Draft BBR July 2017
Note that while gain cycling varies the pacing_gain value, during
ProbeBW the cwnd_gain stays constant at two, since delayed,
stretched, or aggregated acks can strike at any time (see the "Target
cwnd" section).
4.3.4.2. Gain Cycling Randomization
To improve mixing and fairness, and to reduce queues when multiple
BBR flows share a bottleneck, BBR randomizes the phases of ProbeBW
gain cycling by randomly picking an initial phase, from among all but
the 3/4 phase, when entering ProbeBW.
Why not start cycling with 3/4? The main advantage of the 3/4
pacing_gain is to drain any queue that can be created by running a
5/4 pacing_gain when the pipe is already full. When exiting Drain or
ProbeRTT and entering ProbeBW, there is typically no queue to drain,
so the 3/4 gain does not provide that advantage. Using 3/4 in those
contexts only has a cost: a link utilization for that round of 3/4
instead of 1. Since starting with 3/4 would have a cost but no
benefit, and since entering ProbeBW happens at the start of any
connection long enough to have a Drain, BBR uses this small
optimization.
4.3.4.3. Gain Cycling Algorithm
BBR's ProbeBW gain cycling algorithm operates as follows.
Upon entering ProbeBW, BBR (re)starts gain cycling with the
following:
BBREnterProbeBW():
BBR.state = ProbeBW
BBR.pacing_gain = 1
BBR.cwnd_gain = 2
BBR.cycle_index = BBRGainCycleLen - 1 - random_int_in_range(0..6)
BBRAdvanceCyclePhase()
On each ACK BBR runs BBRCheckCyclePhase(), to see if it's time to
advance to the next gain cycle phase:
Cardwell, et al. Expires January 4, 2018 [Page 25]
Internet-Draft BBR July 2017
BBRCheckCyclePhase():
if (BBR.sate == ProbeBW and BBRIsNextCyclePhase()
BBRAdvanceCyclePhase()
BBRAdvanceCyclePhase():
BBR.cycle_stamp = Now()
BBR.cycle_index = (BBR.cycle_index + 1) % BBRGainCycleLen
pacing_gain_cycle = [5/4, 3/4, 1, 1, 1, 1, 1, 1]
BBR.pacing_gain = pacing_gain_cycle[BBR.cycle_index]
BBRIsNextCyclePhase():
is_full_length = (Now() - BBR.cycle_stamp) > BBR.RTprop
if (BBR.pacing_gain == 1)
return is_full_length
if (BBR.pacing_gain > 1)
return is_full_length and
(packets_lost > 0 or
prior_inflight >= BBRInflight(BBR.pacing_gain))
else // (BBR.pacing_gain < 1)
return is_full_length or
prior_inflight <= BBRInflight(1)
Here, "prior_inflight" is the amount of data that was in flight
before processing this ACK.
4.3.4.4. Restarting From Idle
When restarting from idle, BBR leaves its cwnd as-is and paces
packets at exactly BBR.BtlBw, aiming to return as quickly as possible
to its target operating point of rate balance and a full pipe.
Specifically, if the flow's BBR.state is ProbeBW, and the flow is
application-limited, and there are no packets in flight currently,
then at the moment the flow sends one or more packets BBR sets
BBR.pacing_rate to exactly BBR.BtlBw. More precisely, the BBR
algorithm takes the following steps in BBRHandleRestartFromIdle()
before sending a packet for a flow:
BBRHandleRestartFromIdle():
if (packets_in_flight == 0 and C.app_limited)
BBR.idle_start = true
if (BBR.state == ProbeBW)
BBRSetPacingRateWithGain(1)
The "Restarting Idle Connections" section of [RFC5681] suggests
restarting from idle by slow-starting from the initial window.
However, this approach was assuming a congestion control algorithm
that had no estimate of the bottleneck bandwidth and no pacing, and
thus resorted to relying on slow-starting driven by an ACK clock.
Cardwell, et al. Expires January 4, 2018 [Page 26]
Internet-Draft BBR July 2017
The long (log_2(BDP)*RTT) delays required to reach full utilization
with that "slow start after idle" approach caused many large
deployments to disable this mechanism, resulting in a "BDP-scale
line-rate burst" approach instead. Instead of these two approaches,
BBR restarts by pacing at BBR.BtlBw, typically achieving approximate
rate balance and a full pipe after only one BBR.RTprop has elapsed.
4.3.5. ProbeRTT
To help probe for BBR.RTprop, BBR flows cooperate to periodically
drain the bottleneck queue using a state called ProbeRTT, when
needed. In any state other than ProbeRTT itself, if the RTProp
estimate has not been updated (i.e., by getting a lower RTT
measurement) for more than ProbeRTTInterval = 10 seconds, then BBR
enters ProbeRTT and reduces the cwnd to a minimal value,
BBRMinPipeCwnd (four packets). After maintaining BBRMinPipeCwnd or
fewer packets in flight for at least ProbeRTTDuration (200 ms) and
one round trip, BBR leaves ProbeRTT and transitions to either Startup
or ProbeBW, depending on whether it estimates the pipe was filled
already.
BBR is designed to spend the vast majority of its time (about 98
percent) in ProbeBW and the rest in ProbeRTT, based on a set of
tradeoffs. ProbeRTT lasts long enough (at least ProbeRTTDuration =
200 ms) to allow flows with different RTTs to have overlapping
ProbeRTT states, while still being short enough to bound the
throughput penalty of ProbeRTT's cwnd capping to roughly 2 percent
(200 ms/10 seconds).
As discussed in the "BBR.RTprop Min Filter" section above, BBR's
BBR.RTprop filter window, RTpropFilterLen, and time interval between
ProbeRTT states, ProbeRTTInterval, work in concert. A BBR
implementation MUST use a RTpropFilterLen equal to or longer than
ProbeRTTInterval, and to allow coordination with other BBR flows MUST
use the standard ProbeRTTInterval of 10 secs. It is RECOMMENDED to
use 10 secs for both RTpropFilterLen and ProbeRTTInterval. An
RTpropFilterLen of 10 secs is short enough to allow quick convergence
if traffic levels or routes change, but long enough so that
interactive applications (e.g., Web pages, remote procedure calls,
video chunks) often have natural silences or low-rate periods within
the window where the flow's rate is low enough or long enough to
drain its queue in the bottleneck. Then the BBR.RTprop filter
opportunistically picks up these BBR.RTprop measurements, and RTProp
refreshes without requiring ProbeRTT. This way, flows typically need
only pay the 2 percent throughput penalty if there are multiple bulk
flows busy sending over the entire ProbeRTTInterval window.
Cardwell, et al. Expires January 4, 2018 [Page 27]
Internet-Draft BBR July 2017
On every ACK BBR executes BBRCheckProbeRTT() to handle the steps
related to the ProbeRTT state as follows:
BBRCheckProbeRTT():
if (BBR.state != ProbeRTT and
BBR.rtprop_expired and
not BBR.idle_restart)
BBREnterProbeRTT()
BBRSaveCwnd()
BBR.probe_rtt_done_stamp = 0
if (BBR.state == ProbeRTT)
BBRHandleProbeRTT()
BBR.idle_restart = false
BBREnterProbeRTT():
BBR.state = ProbeRTT
BBR.pacing_gain = 1
BBR.cwnd_gain = 1
BBRHandleProbeRTT():
/* Ignore low rate samples during ProbeRTT: */
C.app_limited =
(BW.delivered + packets_in_flight) ? : 1
if (BBR.probe_rtt_done_stamp == 0 and
packets_in_flight <= BBRMinPipeCwnd)
BBR.probe_rtt_done_stamp =
Now() + ProbeRTTDuration
BBR.probe_rtt_round_done = false
BBR.next_round_delivered = BBR.delivered
else if (BBR.probe_rtt_done_stamp != 0)
if (BBR.round_start)
BBR.probe_rtt_round_done = true
if (BBR.probe_rtt_round_done and
Now() > BBR.probe_rtt_done_stamp)
BBR.rtprop_stamp = Now()
BBRRestoreCwnd()
BBRExitProbeRTT()
BBRExitProbeRTT():
if (BBR.filled_pipe)
BBREnterProbeBW()
else
BBREnterStartup()
When restarting from idle (BBR.idle_restart is true) and finding that
the BBR.RTprop has expired, BBR does not enter ProbeRTT; the idleness
is deemed a sufficient attempt to coordinate to drain the queue.
Cardwell, et al. Expires January 4, 2018 [Page 28]
Internet-Draft BBR July 2017
The critical point is that before BBR raises its BBR.RTprop estimate
(which will in turn raise its cwnd), it must first enter ProbeRTT to
make a concerted and coordinated effort to drain the bottleneck queue
and measure the BBR.RTprop. This allows the BBR.RTprop estimates of
ensembles of BBR flows to stay well-anchored in reality, avoiding
feedback loops of ever-increasing queues and RTT samples.
5. Implementation Status
This section records the status of known implementations of the
algorithm defined by this specification at the time of posting of
this Internet-Draft, and is based on a proposal described in
[RFC7942]. The description of implementations in this section is
intended to assist the IETF in its decision processes in progressing
drafts to RFCs. Please note that the listing of any individual
implementation here does not imply endorsement by the IETF.
Furthermore, no effort has been spent to verify the information
presented here that was supplied by IETF contributors. This is not
intended as, and must not be construed to be, a catalog of available
implementations or their features. Readers are advised to note that
other implementations may exist.
According to [RFC7942], "this will allow reviewers and working groups
to assign due consideration to documents that have the benefit of
running code, which may serve as evidence of valuable experimentation
and feedback that have made the implemented protocols more mature.
It is up to the individual working groups to use this information as
they see fit".
As of the time of writing, the following implementations of BBR have
been publicly released:
o Linux TCP
* Source code URL:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin
ux.git/tree/net/ipv4/tcp_bbr.c
* Source: Google
* Maturity: production
* License: dual-licensed: GPLv2 / BSD
* Contact: https://groups.google.com/d/forum/bbr-dev
* Last updated: June 30, 2017
Cardwell, et al. Expires January 4, 2018 [Page 29]
Internet-Draft BBR July 2017
o QUIC
* Source code URLs:
+ https://chromium.googlesource.com/chromium/src/net/+/master/
quic/core/congestion_control/bbr_sender.cc
+ https://chromium.googlesource.com/chromium/src/net/+/master/
quic/core/congestion_control/bbr_sender.h
* Source: Google
* Maturity: production
* License: BSD-style
* Contact: https://groups.google.com/d/forum/bbr-dev
* Last updated: June 30, 2017
6. Security Considerations
This proposal makes no changes to the underlying security of
transport protocols or congestion control algorithms. BBR shares the
same security considerations as the existing standard congestion
control algorithm [RFC5681].
7. IANA Considerations
This document has no IANA actions. Here we are using that phrase,
suggested by [RFC5226], because BBR does not modify or extend the
wire format of any network protocol, nor does it add new dependencies
on assigned numbers. BBR involves only a change to the congestion
control algorithm of a transport sender, and does not involve changes
in the network, the receiver, or any network protocol.
Note to RFC Editor: this section may be removed on publication as an
RFC.
8. Acknowledgments
The authors are grateful to Len Kleinrock for his work on the theory
underlying congestion control. We are indebted to Larry Brakmo for
pioneering work on the Vegas [BP95] and New Vegas [B15] congestion
control algorithms, which presaged many elements of BBR, and for
Larry's advice and guidance during BBR's early development. The
authors would also like to thank C. Stephen Gunn, Eric Dumazet, Ian
Swett, Jana Iyengar, Victor Vasiliev, Nandita Dukkipati, Pawel
Cardwell, et al. Expires January 4, 2018 [Page 30]
Internet-Draft BBR July 2017
Jurczyk, Biren Roy, David Wetherall, Amin Vahdat, Leonidas
Kontothanassis, and the YouTube, google.com, Bandwidth Enforcer, and
Google SRE teams for their invaluable help and support.
9. References
9.1. Normative References
[RFC2018] Mathis, M. and J. Mahdavi, "TCP Selective Acknowledgment
Options", RFC 2018, October 1996.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", RFC 2119, March 1997.
[RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an
IANA Considerations Section in RFCs", May 2008.
[RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
Control", RFC 5681, September 2009.
[RFC6298] Paxson, V., "Computing TCP's Retransmission Timer",
RFC 6298, June 2011.
[RFC7323] Borman, D., Braden, B., Jacobson, V., and R.
Scheffenegger, "TCP Extensions for High Performance",
September 2014.
[RFC793] Postel, J., "Transmission Control Protocol", September
1981.
[RFC7942] Sheffer, Y. and A. Farrel, "Improving Awareness of Running
Code: The Implementation Status Section", July 2016.
9.2. Informative References
[A15] Abrahamsson, M., "TCP ACK suppression", IETF AQM mailing
list , November 2015, <https://www.ietf.org/mail-
archive/web/aqm/current/msg01480.html>.
[B15] Brakmo, L., "TCP-NV: An Update to TCP-Vegas", , August
2015, <https://docs.google.com/document/d/1o-53jbO_xH-
m9g2YCgjaf5bK8vePjWP6Mk0rYiRLK-U/edit>.
[BP95] Brakmo, L. and L. Peterson, "TCP Vegas: end-to-end
congestion avoidance on a global Internet", IEEE Journal
on Selected Areas in Communications 13(8): 1465-1480 ,
October 1995.
Cardwell, et al. Expires January 4, 2018 [Page 31]
Internet-Draft BBR July 2017
[CCGHJ16] Cardwell, N., Cheng, Y., Gunn, C., Hassas Yeganeh, S., and
V. Jacobson, "BBR: Congestion-Based Congestion Control",
ACM Queue Oct 2016, September-October 2016,
<http://queue.acm.org/detail.cfm?id=3022184>.
[CCGHJ17] Cardwell, N., Cheng, Y., Gunn, C., Hassas Yeganeh, S., and
V. Jacobson, "BBR: Congestion-Based Congestion Control",
Communications of the ACM Feb 2017, February 2017,
<https://cacm.acm.org/magazines/2017/2/212428-bbr-
congestion-based-congestion-control/pdf>.
[DC13] Dumazet, E. and Y. Cheng, "TSO, fair queuing, pacing:
three's a charm", IETF 88 , November 2013,
<https://www.ietf.org/proceedings/88/slides/slides-88-
tcpm-9.pdf>.
[draft-cheng-iccrg-delivery-rate-estimation]
Cheng, Y., Cardwell, N., Hassas Yeganeh, S., and V.
Jacobson, "Delivery Rate Estimation", draft-cheng-iccrg-
delivery-rate-estimation-00 (work in progress), June 2017,
<https://tools.ietf.org/html/draft-cheng-iccrg-delivery-
rate-estimation-00>.
[draft-ietf-quic-transport-00]
Iyengar, J. and M. Thomson, "QUIC: A UDP-Based Multiplexed
and Secure Transport", draft-ietf-quic-transport-00 (work
in progress), Nov 2016, <https://tools.ietf.org/html/
draft-ietf-quic-transport-00>.
[draft-ietf-tcpm-cubic]
Rhee, I., Xu, L., Ha, S., Zimmermann, A., Eggert, L., and
R. Scheffenegger, "CUBIC for Fast Long-Distance Networks",
draft-ietf-tcpm-cubic-04 (work in progress), February
2017, <https://tools.ietf.org/html/draft-ietf-tcpm-cubic-
04>.
[GK81] Gail, R. and L. Kleinrock, "An Invariant Property of
Computer Network Power", Proceedings of the International
Conference on Communications June, 1981,
<http://www.lk.cs.ucla.edu/data/files/Gail/power.pdf>.
[HRX08] Ha, S., Rhee, I., and L. Xu, "CUBIC: A New TCP-Friendly
High-Speed TCP Variant", ACM SIGOPS Operating System
Review , 2008.
Cardwell, et al. Expires January 4, 2018 [Page 32]
Internet-Draft BBR July 2017
[Jac88] Jacobson, V., "Congestion Avoidance and Control", SIGCOMM
1988, Computer Communication Review, vol. 18, no. 4, pp.
314-329 , August 1988, <ftp://ftp.ee.lbl.gov/papers/
congavoid.ps.Z>.
[Jac90] Jacobson, V., "Modified TCP Congestion Avoidance
Algorithm", end2end-interest mailing list , April 1990,
<ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail>.
[K79] Kleinrock, L., "Power and deterministic rules of thumb for
probabilistic problems in computer communications",
Proceedings of the International Conference on
Communications 1979.
[WS95] Wright, G. and W. Stevens, "TCP/IP Illustrated, Volume 2:
The Implementation", Addison-Wesley , 1995.
Authors' Addresses
Neal Cardwell
Google, Inc
76 Ninth Avenue
New York, NY 10011
USA
Email: ncardwell@google.com
Yuchung Cheng
Google, Inc
1600 Amphitheater Parkway
Mountain View, California 94043
USA
Email: ycheng@google.com
Soheil Hassas Yeganeh
Google, Inc
76 Ninth Avenue
New York, NY 10011
USA
Email: soheil@google.com
Cardwell, et al. Expires January 4, 2018 [Page 33]
Internet-Draft BBR July 2017
Van Jacobson
Google, Inc
1600 Amphitheater Parkway
Mountain View, California 94043
Email: vanj@google.com
Cardwell, et al. Expires January 4, 2018 [Page 34]