ArticlePDF Available

# Scanning the Internet for Liveness

Authors:

## Abstract and Figures

Internet-wide scanning depends on a notion of liveness: does a target IP address respond to a probe packet? However, the interpretation of such responses, or lack of them, is nuanced and depends on multiple factors, including: how we probed, how different protocols in the network stack interact, the presence of filtering policies near the target, and temporal churn in IP responsiveness. Although often neglected, these factors can significantly affect the results of active measurement studies. We develop a taxonomy of liveness which we employ to develop a method to perform concurrent IPv4 scans using ICMP, five TCP-based, and two UDP-based protocols, comprehensively capturing all responses to our probes, including negative and cross-layer responses. Leveraging our methodology, we present a systematic analysis of liveness and how it manifests in active scanning campaigns, yielding practical insights and methodological improvements for the design and the execution of active Internet measurement studies.
Content may be subject to copyright.
Scanning the Internet for Liveness
Shehar Bano Philipp Richter Mobin Javed Srikanth Sundaresan
University College London MIT LUMS Pakistan, ICSI Berkeley Princeton University
Zakir Durumeric Steven J. Murdoch Richard Mortier Vern Paxson
Stanford University University College London University of Cambridge UC Berkeley, ICSI Berkeley
ABSTRACT
Internet-wide scanning depends on a notion of liveness: does a tar-
get IP address respond to a probe packet? However, the interpreta-
tion of such responses, or lack of them, is nuanced and depends on
multiple factors, including: how we probed, how different proto-
cols in the network stack interact, the presence of ﬁltering policies
near the target, and temporal churn in IP responsiveness. Although
often neglected, these factors can signiﬁcantly affect the results of
active measurement studies. We develop a taxonomy of liveness
which we employ to develop a method to perform concurrent IPv4
scans using ICMP, ﬁve TCP-based, and two UDP-based protocols,
comprehensively capturing all responses to our probes, including
negative and cross-layer responses. Leveraging our methodology,
we present a systematic analysis of liveness and how it manifests in
active scanning campaigns, yielding practical insights and method-
ological improvements for the design and the execution of active
Internet measurement studies.
CCS Concepts
Networks Signaling protocols; Transport protocols; Appli-
cation layer protocols; Network dynamics; Cross-layer proto-
cols;
Keywords
Active Measurement, Scanning, Cross-protocol, Census
1. INTRODUCTION
Internet-wide scanning has emerged as a key measurement tech-
nique to study a diverse set of the Internet’s properties, including
address space utilization [5, 16], host reachability [4], topology [6,
36, 18], service availability [20, 21], vulnerabilities [11, 17, 23],
and service discrimination [19]. In simplest terms, active scanning
campaigns involve the sending of one or more probe packets to
a target IP address and observing a response (or absence thereof)
from the targeted host. If a host replies to a probe packet, we refer
to it as alive. Individual measurement campaigns (see above) are
typically crafted to elucidate individual properties of the Internet
and its host population. Yet despite the widespread use of active
scanning and its critical importance for Internet measurement, we
still lack a systematic framework that allows us to understand IP
liveness and, more importantly, how it manifests in the form of
host replies to active probing. What type of probe packets should
we send if we, for example, want to maximize the responding host
population? What type of responses can we expect and which fac-
tors determine such responses? What degree of consistency can we
expect when probing the same host with different probe packets?
Fundamentally, liveness is not a straightforward binary matter,
but varies depending on: (i) probe type and the target or target
network’s policies related to ﬁrewalling and ﬁltering, (ii) temporal
churn due to targets going up and down, and (iii) protocol inter-
dependencies that result in probes to one target eliciting responses
from another (e.g., ICMP Error responses to TCP probes). While
seemingly nuanced, such characteristics have the potential to sig-
niﬁcantly affect the result of active measurement campaigns. We
argue that developing a systematic understanding of these issues
yields methodological improvements and practical implications for
active measurements at large. Towards achieving this goal, we
make the following contributions in this paper.
First, we propose a taxonomy of liveness, examining what it
means to say that a target IP address is alive and how liveness can
be inferred considering responses to active probing (§2). This tax-
onomy develops our understanding of liveness at different layers,
and covers responses across protocols and from non-targets (e.g.,
middleboxes). Informed by our taxonomy, we introduce a method-
ology for performing Internet-wide scans concurrently across a
set of different protocols at various layers, including ICMP, pop-
ular TCP services and popular UDP services (§3). Our diverse
set of probe packets allows us to study the responsiveness or non-
responsiveness of individual host populations to speciﬁc protocols.
Our scans are comprehensive in that we capture all replies to our
probe packets, including negative (e.g., TCP Rst packets) as well
as cross-layer replies (e.g., ICMP error replies). This enables us
to uncover otherwise invisible host populations and to study cross-
layer protocol interactions. Based on our gathered data, we present
an in-depth view of liveness (§4), slicing our analysis along two
dimensions: (i)probe type (i.e., what type of packet we send), and
Our analysis yields important insights for both the design of ac-
tive scanning campaigns as well as the interpretation of scanning
results. Our key ﬁndings include: (i) TCP and UDP probes in-
crease the population responsive over ICMP by 18%, (ii) compre-
hensively capturing reply trafﬁc (i.e., taking into account negative
reply packets) increases the responsive population by more than
13%, (iii) TCP stacks do not consistently respond with a TCP Rst
for non-available services—in our measurements only 24% of hosts
with an active TCP stack respond to all the probes, (iv) our concur-
rent scans allow us to identify nearly 2M tarpits that would bias
measurements that do not take them into account, and (v) we report
on the correlation of responsiveness across protocols uncovering
potential ﬁltering practices.
We believe that our measurements paint the most comprehensive
and least noisy picture of the state of Internet liveness available to
date. Our taxonomy of IP liveness can serve as a basis for design-
ing and executing future measurement studies, particularly when it
Network Layer
start
any
?
dark
probed
IP?
dark
responding IP:
Network Layer
Alive
Network Layer
Alive
probe
type
?
ICMP
TCP
?
TCP
Transport Layer Active
TCP
Rst
Transport Layer Alive
TCP
SynAck Application
Application Layer
Alive
Application Layer
Active
yes
Transport Layer Application Layer
UDP
no no
ﬁnish
expected
behaviour
?
Figure 1: Flow chart showing liveness inference. We consider liveness at different layers based on responses to our probes.
comes to decisions such as what type of probe packets should be
employed and what type of responses should be captured, how to
interpret responses, as well as whether it is appropriate to use the
output of one scan as input for subsequent measurements. We re-
lease the code and data of this work as open source to allow for
reproducibility of the results, and to enable further research.
2. TAXONOMIZING IP LIVENESS
To systematize our understanding of IP liveness and its inference
using active probing, we introduce the following terminology.
Network-layer liveness. An IP address is network-layer alive if
it responds to a probe with an IP packet. This is the most basic
liveness criterion.
Transport-layer liveness. An IP address is considered transport-
layer active if it is capable of sending TCP packets (whether
SynAck or Rst) from at least one port. That is, the transport pro-
tocol stack of the IP address is operational, sending packets with
transport layer semantics. An IP address is transport-layer alive if
it accepts TCP connections on a speciﬁc port number, indicated by
a SynAck response to a probe.
Application-layer liveness. Application-layer active means that
the IP address sends a payload, valid or invalid, for at least one
application protocol. An IP address is application-layer alive if it
speaks the probed application protocol.
Figure 1 shows a ﬂowchart of our methodology for inferring live-
ness, which depends both on the probe type and the reply, per our
taxonomy above. Observe that probes of different types have vary-
ing degrees of speciﬁcity (e.g., a TCP Syn probe targets a speciﬁc
port number, whereas an ICMP Echo probe targets a vanilla IP ad-
dress), as well as different inference power: ICMP Echo requests
can only infer network layer liveness, while TCP probes can infer
transport layer liveness in addition. The dependency on the probe
type is important to realize, as different probes can reveal differ-
ent views of liveness at a given layer. For example, IP addresses
might not directly reveal network layer liveness (e.g., we might not
while we can observe them to be transport layer alive using TCP
probes, indirectly inferring their network liveness. Since UDP is
connectionless and relies on ICMP for negative replies, we do not
have an explicit notion of transport layer liveness for UDP. Our
UDP probes contain application-speciﬁc requests, but the corre-
sponding responses can indicate network- or application-layer live-
ness (i.e., ICMP Error replies indicate network layer liveness while
UDP replies can indicate application layer liveness and network
layer liveness).
In the remainder of this work, we employ this taxonomy to ana-
lyze liveness through active probing scans. We focus our analysis
on network and transport layers, but include application layer live-
ness in the taxonomy for completeness.
3. SCANNING IP LIVENESS
In this section, we discuss our experimental setup, and scanning
methodology and considerations. We focus on intra-scan liveness,
that is how the visible IP population varies when considering differ-
ent probe types and captured responses across the same scan cam-
paign. Characterization of long-term temporal churn (i.e., the birth
and death of responsive IP addresses) and diurnal behaviour, as
well as spatial churn (i.e., difference in responses due to diverse
scan locations) is outside the scope of this work.
Overview. We collect data from simultaneous scans covering 8
protocols performed on September 5, 2017. For validation, we per-
formed the same scans also on 4th and 7th September, ﬁnding con-
sistent results. We target the entire IPv4 address space, less a black-
list covering 14.7% of the total IPv4 address space. The blacklist
includes private and reserved space (covering 14% of IPv4) and
users opting out of measurements (0.7% of IPv4 at the time of mea-
surement). The entire scan takes less than 24 hours to complete and
generates 2.3 TB of data. We select ports that run well-documented
applications, and cover both the server as well as (at least partially)
the client space. We restrict the analysis to eight concurrent scans
for feasibility. We perform one network layer ICMP Echo scan, ﬁve
transport layer TCP Syn scans covering popular ports 22 (SSH), 23
(Telnet), 80 (HTTP), 443 (HTTPS), and 7547 (CPE WAN Man-
agement Protocol, CWMP [3]) and two popular UDP-based appli-
cations, DNS and NTP.
Tools. We use ZMap [40] for the network, transport, and UDP-
based application layer scans. ZMap uses raw sockets, crafting
ICMP Echo, TCP Syn and UDP packets embedded directly into
Ethernet frames. In contrast to earlier studies using ZMap, we
tightly synchronize simultaneous scans of several protocols and
probe each IP address for all protocols within a short time window,
minimizing the effect of temporal churn. Moreover, we customize
ZMap so as to capture any reply to our probes including negative
responses such as ICMP errors and TCP Rsts. We use SiLK [38] for
scan data analysis. We convert all IP sets of interest to SiLK IPset
data structure that uses a compressed binary tree structure to store
IPs. We mainly use relevant SiLK tools (such as rwsettool) to
perform fast set operations on this data.
Cross-protocol scans. Internet liveness legitimately changes over
time due to temporal churn, for example caused by hosts going up
and down and dynamic IP address assignment [32]. To minimize
the effect of temporal churn, we probe each IP address for all of
our protocols within a short time window. To do so, we partition
the IPv4 space into /3 blocks, and conﬁgure parallel scans to probe
IPs in the same order by using the same seed value for the ZMap
address generator within each block. This block-scanning strat-
egy provides an opportunity to synchronize at the start of each new
block. We quantify the resulting lag by recording the timestamp of
every millionth packet sent by ZMap to measure the maximum time
between different protocol probes sent to the same target. While it
can be up to 25 minutes, over 80% of probes are sent to the target
within a 10 minute window.
Reply capture completeness. To provide a comprehensive picture
of liveness across different layers, we must capture all replies to our
probe packets, not just those expected for a successful response.
However, by default ZMap does not record ICMP error messages
for TCP scans, and only records ICMP PortUnreachable responses
for UDP. We modiﬁed ZMap to capture all ICMP error messages
in response to our probes. We link these error responses back to the
probe that generated them by checking the header of the original
packet, included in the payload of the error-response packets. Since
stock ZMap only records TCP Rst packets with the Ack bit set
(as others fail ZMap’s validation checks for a deterministic Ack
number), we also modiﬁed it to record all TCP Rst packets.
Packet loss mitigation. Executing multiple concurrent scans on
shared hardware increases the risk of packet loss, at the host and
due to transient network problems. To minimize in-network loss,
we probe redundantly. Since network losses often occur in bursts,
we modify ZMap to perform delayed retransmissions. We make
ZMap store each IP address it scans for the ﬁrst time in a queue of
size N. When the queue is full, ZMap begins de-queuing and re-
transmitting, interleaving with new probes. The delay between the
original and retransmitted probes depends on the size of the queue
and ZMap’s packet-sending rate. We use N= 1M and a scan rate
of 100kpps, giving a delay of 20 seconds. Our probe redundancy
increases the population of active IP addresses (i.e., responsive to
ICMP Echo probes) by 2.2%. We determine this percentage by run-
ning two ZMap instances (with no retransmission) in parallel, with
a delay of 1 minute between them. To calculate the retransmission
hit-rate, we aggregate responsive IPs across both scans, while we
only consider the ﬁrst scan to determine the hit-rate of single-packet
ZMap and adjust host buffer sizes to avoid local packet loss when
running multiple ZMap instances on the same host. We inspected
scan monitoring logs to conﬁrm that no scans experienced drops at
the NIC or in pcap.
4. CHARACTERIZING IP LIVENESS
We next analyze our scan data to answer a number of practical
questions on how the probe type and the corresponding responses
affect the view of liveness at the network and transport layers.
4.1 Network Layer Liveness
Overall, our scans recorded 487M network alive IPs (IPall) out
of 3.6B probed.
What is the coverage of different probe types?
Reachability, performance, and topology studies often employ
ICMP Echo and traceroute to scan the Internet for network alive-
ness. Here, we investigate the effect of the probe type on the mea-
sured network alive population: Figure 2a shows the coverage of
IPall over different scans by protocol. Overlapping IPs poten-
tially respond over multiple probe types while unique IPs respond
to only one probe type. As others have found [16, 21, 22, 4], we
see that ICMP Echo probes are most effective in discovering net-
work active IPs, revealing 79% of IPall, followed by TCP probes.
UDP probes, however, illuminate a very restrictive view. Further
we ﬁnd that 16% of IPall can only exclusively be discovered via
TCP, and a small but signiﬁcant 2% can only be discovered via
UDP probes. The high percentage of exclusive coverage from TCP
is somewhat surprising, suggesting widespread ﬁltering/ﬁrewalling
of ICMP trafﬁc within networks and at target hosts. A number of
studies measure network aliveness at the granularity of /24 address
ICMP TCP
(all)
UDP
(all)
UNION
0
150M
300M
450M 79.0%
59.1%
31.0%
183M
79M
9M
385M
288M
151M
487M
unique overlapping
(a) Network layer alive IP addresses.
ICMP TCP
(all)
UDP
(all)
UNION
0
1.5M
3M
4.5M
163K 222K 23K
4.90M 5.05M
4.46M
5.27M
unique overlapping
(b) Network layer alive /24 blocks.
Figure 2: Network layer aliveness inferred by scan types.
blocks [13, 6, 36, 18]. Figure 2b shows the aliveness breakdown
for /24 blocks, where we ﬁnd that the effect of the probe type is
much less pronounced. The set of /24 blocks discovered by indi-
vidual probes is far more uniform in its coverage of /24all (the set
of all discovered /24s). Surprisingly, our TCP scans show the high-
est coverage, discovering some 5M active /24 blocks, slightly more
(3%) than ICMP Echo.
What is the coverage of different probe responses?
A probe can trigger multiple types of responses, for example
TCP Syn can trigger TCP SynAck, TCP Rst, or ICMP Error re-
sponses. Interpretation of network aliveness depends on what re-
sponses are captured. In the previous analysis, we aggregated all
replies per scan, for example an ICMP response to a TCP Syn
probe is treated equivalently to a SynAck response. Figure 3a
decomposes scan replies to characterize their contribution to the
overall scan coverage of IPall. Replies to ICMP Echo probes are
dominated by ICMP EchoReply as we would expect. However,
ICMP Error responses comprise a sizable portion of TCP, and are
the dominant means of inferring network aliveness via UDP. We
ﬁnd that 2.3% of IPall are discoverable only through ICMP Error
responses, with ICMP probes lighting up 20% of such IPs, TCP
probes some 76%, and UDP probes 35%. This might be due to
routers and middleboxes that are conﬁgured to ignore direct probes
but indirectly reveal activity via ICMP Error packets [15], as well
as due to ﬁltering and ﬁrewalling in networks and end hosts.
What do ICMP error responses reveal?
ICMP Error messages, even though often neglected, not only in-
crease the visible population of network alive IP addresses, but can
also reveal characteristics of the target host and network. In Fig-
ure 3b, we break down ICMP Error messages into four categories:
(i) ICMP PortUnreachable is a type-3 (Destination unreachable)
message typically generated by end hosts when a port is not active,
(ii) ICMP HostUnreachable is a type-3 message sent by gateway
devices (e.g., routers) when the host is unreachable, (iii) ICMP
OtherUnreachable represents all other type-3 messages sent by
ICMP error
both
ICMP error
both
ICMP error
both
0 200M 400M
98.8%
0.7%
0.5%
88.7%
9.0%
2.3%
14.3%
81.3%
4.4%
ICMP scan TCP scans UDP scans
(a) Breakdown of responses to scan types.
ICMP TCP UDP
fraction of ICMP errors
0.0
0.2
0.4
0.6
0.8
1.0
port unreach host unreach other unreach
other multipleother
13.3%
59.4%
9.9%
8.6%
8.8%
23.5%
19.1%
53.8%
1.4%
2.2%
84.5%
2.5%
12.3%
0%
0.7%
(b) Breakdown of ICMP Error responses to scan types.
Figure 3: Breakdown of responses to scan types.
gateway devices when the destination is unreachable (e.g., protocol
unreachable [27]), and (iv) ICMP Other represents the remaining
three ICMP Error packets (ICMP TimeExceeded, ICMP Redirect,
and ICMP SourceQuench) that we observed in our data. We con-
sider the source IP of any of the above packets as network alive
even if it is generated by an IP other than the one we probed.
Our TCP and UDP scans generate the majority of ICMP Port-
Unreachable messages, typically generated directly by the end host.
Indeed, it is expected behavior (RFC 1122) that hosts generate
such messages if no service is a available on a given port num-
ber. However, our scans also resulted in a large number of ICMP
Error messages that were generated by intermediate devices on the
path towards the target. The majority of such messages are indica-
tive of network misconﬁgurations, or ﬁrewalling. The latter is very
prominent: among the ICMP OtherUnreachable messages for TCP
and UDP, we ﬁnd that code 13 “Communication administratively
prohibited” dominates (representing about 80% of such messages),
hinting towards either routers or gateways (e.g., Carrier-Grade NAT
deployments [33, 37]) on the path towards the target. Another 15%
of ICMP OtherUnreachable messages correspond to code 0 “Des-
tination network unreachable” and code 10 “Host administratively
prohibited”. In future work, we plan to inspect ICMP messages
more closely to reason about ﬁrewalling.
4.2 Transport Layer Liveness
Recall that we measure transport layer liveness by conducting
TCP SYN scans on ﬁve different ports. In total, we ﬁnd 262M
12345
number of responsive TCP ports per IP address
0
30M
60M
90M
28%
19%
11%
16%
24%
only SynAck
both
only Rst
(a) TCP stack completeness/consistency.
HTTP
SSH
HTTPS
CWMP
Telnet
0 50M 100M 150M 200M
SynAck Rst both
41.8% 55.2%
13.9% 86.1%
28.6% 71.4%
17.1% 81.8%
91.7% SynAck: 8.2%
(b) Breakdown of transport layer responses.
Figure 4: Transport layer liveness.
transport active IPs (i.e., those responding to TCP Syn with a
TCP Rst or TCP SynAck) representing 53.8% of IPall.
How does the probed port affect the responsive population?
If hosts responded consistently across TCP ports, we would ex-
pect to see the same number of transport active IPs across all
ﬁve scans since an IP would respond with either TCP SynAck or
TCP Rst for each probe, in accordance with RFC standards. We
ﬁnd, however, that the number of active hosts varies vastly when
probed on different port numbers. Figure 4a breaks down trans-
port active IPs into 5 classes: IPs that responded only on one probed
port, IPs that responded on exactly 2 ports, and so forth. We also
show whether the responses were TCP SynAck, TCP Rst, or both,
per class. Only 24% of active hosts respond to probe packets on
all ﬁve ports. This, in turn, shows that the vast majority of hosts
selectively suppress responses for particular application protocols,
due to ﬁrewalling and/or ﬁltering. Their visibility or non-visibility
in active scanning campaigns heavily depends on the choice of the
probed port numbers. Following up on this observation, we next
look at the coverage of the 262M TCP active IP address population
by protocol and response type (Figure 4b). HTTPS is the most ac-
tive port number, with 180M IPs, surprisingly followed closely by
CWMP with 159M IPs. We ﬁnd that the HTTP port is surprisingly
less active than the HTTPS port, and the Telnet port shows the least
activity of all the probed protocols. Each of the probed ports con-
tributes a unique set of otherwise unresponsive hosts: some 11.5%
of all TCP activity can exclusively be found by probing the CWMP
port. SSH, HTTP, and HTTPS provide unique coverage of 3–6%
of active IPs, while the exclusive coverage of Telnet is low (0.8%).
What is the coverage by probe response type?
As introduced in our taxonomy (§2), we make a distinction be-
tween transport layer activity and aliveness per RFC 793: TCP
stacks should respond to TCP Syn probes with a TCP SynAck if a
service is listening on the probed port, or TCP Rst otherwise [28].
We term the subset of the transport active population that responds
with a TCP SynAck as ‘TCP alive’, indicating a service is running
on that port. Figure 4b shows that except for HTTP, for a given
protocol, the TCP alive population is vastly smaller than the TCP
active population on that port. Hence, negative replies (TCP Rst)
are crucial for capturing the population of TCP active hosts com-
prehensively. We also ﬁnd surprising results regarding the TCP
alive (i.e., replying with a TCP SynAck) population: the size of the
of CWMP alive population is surprisingly large, and as a point of
comparison, it is greater than the SSH alive population. CWMP
provides means for remote management of end-user devices such
as modems, routers, gateways, set-top boxes, and VoIP-phones [7].
A possible explanation could be related to widespread distribution
of CWMP-speaking CPE devices by ISPs.
How do fabricated responses affect the measured population?
One consideration in enumerating the TCP alive population of
any given protocol are network tarpits: IPs masquerading as fake
hosts, responding positively with a TCP SynAck to all TCP Syn
probes [1]. We discovered 1.9M transport alive IPs that appear in
all TCP scans. To conﬁrm that these are tarpits, we scanned these
IPs on a random high port six days after the original scan—89%
responded positively, strengthening our belief that these are tarpits.
(Further analysis of the identiﬁed potential tarpits would require
studying the application-layer behaviour—or absence thereof—of
the concerned hosts, which we will undertake in future work.) If,
as is common practice, transport alive IPs are taken as a proxy for
the service population (e.g., IPs that respond to TCP Syn probe on
port 80 with SynAck represent Web servers), then the 1.9M tarpit
IPs inﬂate HTTP, HTTPS, and CWMP footprints by 3–4% of their
original size, SSH by 10% and Telnet as high as 23%. To mitigate
bias due to tarpits in studies conducting transport-layer measure-
ments from which to make application-layer inferences, simultane-
ously probing a high random port number for liveness can aid in
identifying such instances.
4.3 Cross-protocol Liveness
In this section, we investigate what fraction of the host popula-
tion that responds to a certain probe/protocol can also be captured
when probed for different protocols. Understanding these interde-
pendencies is vital for designing multi-stage scanning campaigns,
as well as for understanding consistency in ﬁltering behavior across
protocols. Figure 5 shows the conditional probabilities for activity
(which includes both positive and negative responses) of our probed
TCP- and UDP-based protocols. For ICMP, we consider network-
layer aliveness (i.e., IPs that respond with ICMP EchoReply). We
make several observations: the bottom-most row shows that a sig-
niﬁcant fraction of transport active hosts (26% on average for TCP
services and 12% for UDP) cannot be discovered via ICMP. This
is an important consideration, given that it is common practice to
use the subset of ICMP-alive IP addresses for further scans, e.g.,
to measure service availability. Correlation across TCP and UDP
protocols is generally lower when contrasted to protocols within
each family. Secondly, the TCP and UDP blocks indicate varying
degrees of correlation in ﬁltering behavior across services, when
seen pair-wise. On one hand, we ﬁnd consistent ﬁltering practices:
for example, a Telnet-active host is very likely to elicit a response
1
0.33
0.34
0.39
0.33
0.28
0.26
0.28
0.74
1
0.7
0.85
0.56
0.61
0.44
0.45
0.74
0.68
1
0.82
0.73
0.74
0.5
0.57
0.73
0.72
0.71
1
0.62
0.63
0.44
0.47
0.72
0.56
0.74
0.73
1
0.64
0.46
0.51
0.75
0.74
0.92
0.9
0.79
1
0.53
0.59
0.88
0.66
0.78
0.78
0.69
0.66
1
0.76
0.87
0.63
0.81
0.78
0.73
0.68
0.71
1
Figure 5: Conditional activity per probe type. (For ICMP, we con-
sider network-layer aliveness.)
from both SSH and HTTPS. Put another way, if a given host is ac-
tive for Telnet, then with high probability (>=0.9), it is active per
SSH and HTTPS. On the other hand, for CWMP only 56% of ac-
tive hosts respond to HTTP probes, indicating an underlying ﬁlter-
ing pattern of the CWMP-active population. We plan to investigate
cross-protocol ﬁltering practices in more depth in future work.
5. RELATED WORK
Measurement of Internet liveness has received considerable at-
tention in contexts including network topology, performance and
reachability [16, 30, 36, 35, 6], outages [31, 30], service char-
acterization [25, 20, 29], security vulnerability tracking [11, 17,
23], and service discrimination [19]. Measurement studies relied
on passive vantage points [8, 9, 32], active probing [21, 12, 16,
24], and both in combination [2]. We focus here on inference of
liveness via active probing. Internet-wide scanning has historically
taken signiﬁcant time and resources. Early work limited its scope to
BGP preﬁxes [39, 21, 36], though subsequent work demonstrated
the inadequacy of control-plane data to measure data-plane con-
ditions [4, 34]. Another way to limit the scope of active prob-
ing is to use “hitlists” to target active hosts. Early hitlists com-
prised IP addresses selected from various passive sources [26, 18].
get IP addresses offering greater coverage and higher likelihood
of liveness [6, 14], and using Internet censuses to derive respon-
sive, complete, and stable hitlists [13]. Heidemann et al. conducted
the ﬁrst full IPv4 scan over the course of 2–3 months in 2007 [16].
a RouteViews BGP dump and the local border router, in approxi-
mately 24 hours [21]. Recently, ZMap [12] and its application layer
counterpart ZGrab [10] dramatically reduced the time to complete
a full IPv4 scan to a few hours. These tools operate on commodity
hardware, and data from regular Internet scans using these tools is
made publicly available at scans.io. These data enabled a large
number of follow-on Internet-wide security-modeling and perfor-
mance studies. All these studies discuss practical considerations in
active probing, such as temporal churn, the types of probes, ﬁre-
walls, and the scanning tool itself triggering blocking.
Our work adds to this rich body of literature by systematically
examining how liveness manifests over different protocols and
across layers with active probing, the factors affecting these views,
and how they are correlated.
6. CONCLUSION
Liveness—whether or not a target IP address responds to a probe
packet—is a nuanced concept without a simple yes/no answer. Re-
sponsiveness directly depends on the probe type, the conﬁguration
of the targeted host, as well as on ﬁrewalling and ﬁltering behav-
iors at the edge or within networks. The interpretation of responses
(positive, negative, absent) in turn allows for drawing conclusions
about liveness on different layers. Towards the goal of system-
atically understanding these issues, we presented a taxonomy of
liveness that encapsulates the inherent dependencies between dif-
ferent protocols and layers. We developed and evaluated a method-
ology for performing concurrent Internet-wide scans across mul-
tiple protocols, comprehensively capturing positive, negative, and
cross-layer responses to our probes. We ﬁnd that responsive host
populations are highly sensitive to the choice of probe: while ICMP
discovers the highest number of raw IPs, our TCP and UDP mea-
surements exclusively contribute a ﬁfth to the total population of re-
sponsive hosts. Collecting ICMP Error messages for TCP and UDP
scans signiﬁcantly improves coverage and provides new opportuni-
ties to interpret scan results. At the transport layer, our concurrent
measurements reveal that the majority of hosts exhibit inconsistent
behavior when probed on different ports and that capturing nega-
tive responses signiﬁcantly improves scanning completeness. Our
study of cross-protocol liveness shows that, while responsiveness
for protocols is correlated, using the result of one scan to bootstrap
another should be taken with care, since every probe type intro-
duces an individual bias.
In the future, we plan to deepen our understanding of active scan-
ning in multiple dimensions, looking at: (i) liveness at the appli-
cation layer, (ii) how liveness varies over time and IP space, and
(iii) the multivariate probability distributions of transport layer live-
ness, and exploring using existing results and their correlations to
reduce scan trafﬁc.
Acknowledgments
Shehar Bano was supported by The EU H2020 DECODE project
under grant agreement number 732546 and in part by EPSRC Grant
EP/N028104/1 ‘Glass Houses’. Steven J. Murdoch and Shehar
Bano (for part of this work) were supported by The Royal Society
[grant number UF110392]; Engineering and Physical Sciences Re-
search Council [grant number EP/L003406/1]. Philipp Richter was
supported by the MIT Internet Policy Research Initiative, William
and Flora Hewlett Foundation grant 2014-1601. Richard Mortier
was supported by grant EPSRC EP/K031724/2. Vern Paxson
was supported by NSF grants CNS-1237265 and CNS-1518921.
Thanks to Jonathan Spring for his valuable feedback. Thanks to
the SysAdmins at University College London, especially John An-
drews, for their support.
Source code and data release
The source code of our modiﬁcations to ZMap, scripts to run
block-wise scans, and analysis can be found at https://github.
com/sheharbano/scan_liveness (and also https://doi.org/10.5281/
zenodo.1209947). Data created during this research is available
at https://doi.org/10.5281/zenodo.1068899.
7. REFERENCES
[1] Lance Alt, Robert Beverly, and Alberto Dainotti. Uncovering
Network Tarpits with Degreaser. In Proceedings of the 30th
Annual Computer Security Applications Conference, ACSAC
’14, New Orleans, Louisiana, USA, 2014.
[2] Genevieve Bartlett, John Heidemann, and Christos
Papadopoulos. Understanding Passive and Active Service
Discovery. In Proceedings of ACM IMC 2007, San Diego,
California, USA, 2007.
[3] John Blackford and Mike Digdon. CPE WAN Management
Protocol. Technical Report TR-069, Broadband Forum,
November 2013. Issue 1 Amendment 5. CWMP v1.4.
[4] Randy Bush, Olaf Maennel, Matthew Roughan, and Steve
Uhlig. Internet Optometry: Assessing the Broken Glasses in
Internet Reachability. In Proceedings of ACM IMC 2009,
Chicago, Illinois, USA, 2009.
[5] Xue Cai and John Heidemann. Understanding Block-level
Address Usage in the Visible Internet. In Proceedings of
ACM SIGCOMM 2010, New Delhi, India, 2010.
[6] k. claffy, Y. Hyun, K. Keys, M. Fomenkov, and D. Krioukov.
Internet Mapping: from Art to Science. In IEEE DHS
Cybersecurity Applications and Technologies Conference for
Homeland Security (CATCH), pages 205–211, Waltham,
MA, Mar 2009.
[7] TR-069 CPE WAN Management Protocol.
TR-069_Amendment-5.pdf.
[8] A. Dainotti, K. Benson, A. King, k. claffy, M. Kallitsis,
E. Glatz, and X. Dimitropoulos. Estimating Internet address
space usage through passive measurements. ACM CCR,
44(1):42–49, Jan 2014.
[9] A. Dainotti, K. Benson, A. King, B. Huffaker, E. Glatz,
X. Dimitropoulos, P. Richter, A. Finamore, and A. Snoeren.
Lost in Space: Improving Inference of IPv4 Address Space
Utilization. IEEE Journal on Selected Areas in
Communications (JSAC), 34(6):1862–1876, Jun 2016.
[10] Zakir Durumeric, David Adrian, Ariana Mirian, Michael
Bailey, and J. Alex Halderman. A Search Engine Backed by
Internet-Wide Scanning. In Proceedings of the 22nd ACM
Conference on Computer and Communications Security,
October 2015.
[11] Zakir Durumeric, James Kasten, Michael Bailey, and J. Alex
Halderman. Analysis of the HTTPS Certiﬁcate Ecosystem.
In Proceedings of ACM IMC 2013, Barcelona, Spain, 2013.
ACM.
[12] Zakir Durumeric, Eric Wustrow, and J. Alex Halderman.
ZMap: Fast Internet-wide Scanning and Its Security
Applications. In Proceedings of the 22Nd USENIX
Conference on Security, SEC’13, pages 605–620, Berkeley,
CA, USA, 2013. USENIX Association.
[13] Xun Fan and John Heidemann. Selecting Representative IP
Addresses for Internet Topology Studies. In Proceedings of
ACM IMC 2010, Melbourne, Australia, 2010.
[14] Ramesh Govindan and Hongsuda Tangmunarunkit.
Heuristics for Internet map discovery. In Proceedings of
INFOCOM 2000, Tel Aviv, Israel, 2000.
[15] M. H. Gunes and K. Saracc. Analyzing router responsiveness
to active measurement probes. In Proceedings of PAM 2009,
2009.
[16] John Heidemann, Yuri Pradkin, Ramesh Govindan, Christos
Papadopoulos, and Joseph Bannister. Exploring Visible
Internet Hosts through Census and Survey. Technical Report
ISI-TR-2007-640, USC/Information Sciences Institute, May
2007.
[17] Nadia Heninger, Zakir Durumeric, Eric Wustrow, and J. Alex
Halderman. Mining Your Ps and Qs: Detection of
Widespread Weak Keys in Network Devices. In Proceedings
of the 21st USENIX Conference on Security Symposium,
Security’12, Berkeley, CA, USA, 2012.
[18] B. Huffaker, M. Fomenkov, D. Moore, and k. claffy.
Macroscopic analyses of the infrastructure: measurement
and visualization of Internet connectivity and performance.
In PAM 2001, Amsterdam, Netherlands, 2001.
[19] Sheharbano Khattak, David Fiﬁeld, Sadia Afroz, Mobin
Javed, Srikanth Sundaresan, Vern Paxson, Steven J.
Murdoch, and Damon McCoy. Do You See What I See?:
Differential Treatment of Anonymous Users. In Proceedings
of NDSS 2016, San Diego, CA, United States, 2016.
[20] Marc Kührer, Thomas Hupperich, Jonas Bushart, Christian
Rossow, and Thorsten Holz. Going Wild: Large-Scale
Classiﬁcation of Open DNS Resolvers. In Proceedings of
ACM IMC 2015, Tokyo, Japan, 2015.
[21] Derek Leonard and Dmitri Loguinov. Demystifying Service
Discovery: Implementing an Internet-wide Scanner. In
Proceedings of ACM IMC 2010, Melbourne, Australia, 2010.
[22] M. Luckie, Y. Hyun, and B. Huffaker. Traceroute Probe
Method and Forward IP Path Inference. In Proceedings of
ACM IMC 2008, Vouliagmeni, Greece, 2008.
[23] Antonio Nappa, Zhaoyan Xu, Juan Caballero, and Guofei
Gu. CyberProbe: Towards Internet-Scale Active Detection of
Malicious Servers. In Proceedings of NDSS 2014, San
Diego, CA, USA, 2014.
[24] Ramakrishna Padmanabhan, Amogh Dhamdhere, Emile
Aben, kc claffy, and Neil Spring. Reasons Dynamic
Addresses Change. In Proceedings of ACM IMC 2016, Santa
Monica, California, USA, 2016.
[25] Jeffrey Pang, James Hendricks, Aditya Akella, Roberto
De Prisco, Bruce Maggs, and Srinivasan Seshan.
Availability, Usage, and Deployment Characteristics of the
Domain Name System. In Proceedings of ACM IMC 2004,
Taormina, Sicily, Italy, 2004.
[26] Jean-Jacques Pansiot and Dominique Grad. On Routes and
Multicast Trees in the Internet. ACM CCR, 28(1):41–50,
January 1998.
[27] J. Postel. Internet Control Message Protocol. RFC 792,
September 1981. https://tools.ietf.org/html/rfc792.
[28] J. Postel. Transmission Control Protocol. RFC 793,
September 1981. https://tools.ietf.org/html/rfc793.
[29] N. Provos and P. Honeyman. ScanSSH - Scanning the
Internet for SSH Servers. In 16th USENIX Systems
Administration Conference (LISA), New York, NY, USA,
2001.
[30] Lin Quan and John Heidemann. Detecting Internet Outages
with Active Probing (extended). Technical Report
ISI-TR-2011-672, USC/Information Sciences Institute, May
2010.
[31] Lin Quan, John Heidemann, and Yuri Pradkin. When the
Internet Sleeps: Correlating Diurnal Networks With External
Factors (extended). Technical Report ISI-TR-2014-691b,
USC/Information Sciences Institute, May 2014. (updated
August 2014).
[32] Philipp Richter, Georgios Smaragdakis, David Plonka, and
Arthur Berger. Beyond Counting: New Perspectives on the
Active IPv4 Address Space. In Proceedings of ACM IMC
2016, Santa Monica, California, USA, 2016.
[33] Philipp Richter, Florian Wohlfart, Narseo Vallina-Rodriguez,
Mark Allman, Randy Bush, Anja Feldmann, Christian
Kreibich, Nicholas Weaver, and Vern Paxson. A
Deployment. In Proceedings of ACM IMC 2016, Santa
Monica, California, USA, 2016.
[34] Matthew Roughan, Walter Willinger, Olaf Maennel, Debbie
Perouli, and Randy Bush. 10 Lessons from 10 Years of
Measuring and Modeling the Internet’s Autonomous
Systems. IEEE Journal on Selected Areas in
Communications, 29(9):1810–1821, 2011.
[35] Yuval Shavitt and Eran Shir. DIMES: Let the Internet
Measure Itself. ACM CCR, 35(5):71–74, October 2005.
[36] Neil Spring, Ratul Mahajan, and David Wetherall. Measuring
ISP Topologies with Rocketfuel. In Proceedings of ACM
SIGCOMM 2002, New York, NY, USA, 2002.
[37] P. Srisuresh, B. Ford, S. Sivakumar, and S. Guha. NAT
Behavioral Requirements for ICMP. RFC 5508 (Best
Current Practice), April 2009. Updated by RFC 7857.
[38] Mark Thomas, Leigh Metcalf, Jonathan M. Spring, Paul
Krystosek, and Katherine Prevost. SiLK: A tool suite for
unsampled network ﬂow analysis at scale. In IEEE BigData
Congress, pages 184–191, Anchorage, Jul 2014.
[39] Feng Wang, Zhuoqing Morley Mao, Jia Wang, Lixin Gao,
and Randy Bush. A Measurement Study on the Impact of
Routing Events on End-to-end Internet Path Performance. In
Proceedings of ACM SIGCOMM 2006, Pisa, Italy, 2006.
[40] ZMap. https://github.com/zmap/zmap/.
... The state of active scanning research was pushed forward significantly by ZMap [27], which allows researchers to scan the entire IPv4 address space in less than an hour. Several works have since used the tool to investigate the deployment of different protocols and applications in the Internet, e.g., liveness [28], TCP initial window [29], and QUIC [30]. Others have looked into passive data traces for a different viewpoint on deployment measurements. ...
Preprint
Full-text available
Multipath TCP (MPTCP) extends traditional TCP to enable simultaneous use of multiple connection endpoints at the source and destination. MPTCP has been under active development since its standardization in 2013, and more recently in February 2020, MPTCP was upstreamed to the Linux kernel. In this paper, we provide an in-depth analysis of MPTCPv0 in the Internet and the first analysis of MPTCPv1 to date. We probe the entire IPv4 address space and an IPv6 hitlist to detect MPTCP-enabled systems operational on port 80 and 443. Our scans reveal a steady increase in MPTCPv0-capable IPs, reaching 13k+ on IPv4 (2$\times$ increase in one year) and 1k on IPv6 (40$\times$ increase). MPTCPv1 deployment is comparatively low with $\approx$100 supporting hosts in IPv4 and IPv6, most of which belong to Apple. We also discover a substantial share of seemingly MPTCP-capable hosts, an artifact of middleboxes mirroring TCP options. We conduct targeted HTTP(S) measurements towards select hosts and find that middleboxes can aggressively impact the perceived quality of applications utilizing MPTCP. Finally, we analyze two complementary traffic traces from CAIDA and MAWI to shed light on the real-world usage of MPTCP. We find that while MPTCP usage has increased by a factor of 20 over the past few years, its traffic share is still quite low.
... As with any Internet Measurement research there are important considerations regarding data quality [25]. For this research they are even more pronounced as individual nodes are being measured by sending fifty or more requests as opposed to common approaches of sending one or a few. ...
Conference Paper
Full-text available
Reflected distributed denial of service (rDDoS) policy interven- tions often focus on reflector count reductions. Current rDDoS metrics (max DDoS witnessed) favour commercial responses, but don’t frame this as a problem of the commons. This results in non- objective, and non-independent discussion of policy interventions, and holds back discussion of any public health style interventions that aren’t commercially motivated. In this paper, we explore multi- ple questions when it comes to measuring the potential for rDDoS attacks (i.e. how large could a rDDoS attack become?). We also raise some new questions. The paper builds on top of our previous re- search [6]. Whereas [7] was motivated by understanding properties of the individual rDDoS reflectors, in the current paper we present evidence that chasing high bandwidth reflectors is far more impact- ful in rDDoS harm reduction. If the internet is a commons, then high bandwidth reflectors contribute the most to a tragedy of the commons (see Figure 1). We examine and compare reflector counts, contribution estimation, and empirical contribution verification as methodologies. We also extend previous works on the topic to provide ASN level metrics, and show that the top 5 ASNs contribute between 30-70 percent of the problem depending on the protocol examined. This finding alone, motivates much easier and cheaper layered policy interventions which we discuss within the paper. The motivation of our research is also given by the surprisingly strong increase of actual (r)DDoS attacks as shown by [30]. Given this increase, our aim is to trigger policy change 1 when it comes to cleaning up reflectors. Our main contribution in this paper is to show that policy should focus on the high bandwidth reflectors and some top ASNs reduce rDDoS’s potential.
... However, our certigo scan found around 20% more addresses, which we attribute to two causes. First, both Rapid7 and Censys have to respond to complaints and remove IP addresses from their scans [12,29,110]. As both scans have run for years, more address space is excluded over time. ...
Conference Paper
Full-text available
Chapter
Network scanning is widely used to assess security postures of hosts/networks, discover vulnerabilities, and study Internet trends. However, scans can generate large amounts of traffic, and efficient probing of IPv6 hosts (where global scans are infeasible) is an outstanding problem. In this chapter, we develop a framework for efficient Internet scans using machine learning, by preemptively detecting and avoiding the scanning of inactive hosts. We evaluate this framework over global scans of the IPv4 space over 20 ports, and show that using location and ownership information we can reduce the bandwidth of scans by 26.7–72.0%, while discovering 90–99% of active hosts. We then evaluate a sequential method by gradually adding information obtained from scanned ports to adaptively predict the remaining port responses, yielding 47.4–83.5% of bandwidth savings at the same true positive rates. Our framework can be used to lower the bandwidth consumption of scans and increase their hit rate, thereby reducing their intrusive nature and enabling efficient discovery of active devices.
Conference Paper
Full-text available
Multipath TCP (MPTCP) extends traditional TCP to enable simultaneous use of multiple connection endpoints at the source and destination. MPTCP has been under active development since its standardization in 2013, and more recently in February 2020, MPTCP was upstreamed to the Linux kernel. In this paper, we provide the first broad analysis of MPTCPv0 in the Internet. We probe the entire IPv4 address space and an IPv6 hitlist to detect MPTCP-enabled systems operational on port 80 and 443. Our scans reveal a steady increase in MPTCP-capable IPs, reaching 9k+ on IPv4 and a few dozen on IPv6. We also discover a significant share of seemingly MPTCP-capable hosts, an artifact of middleboxes mirroring TCP options. We conduct targeted HTTP(S) measurements towards select hosts and find that middleboxes can aggressively impact the perceived quality of applications utilizing MPTCP. Finally, we analyze two complementary traffic traces from CAIDA and MAWI to shed light on the real-world usage of MPTCP. We find that while MPTCP usage has increased by a factor of 20 over the past few years, its traffic share is still quite low.
Article
In anycast deployments, knowing how traffic will be distributed among the locations is challenging. In this paper, we propose a technique for partitioning the Internet using passive measurements of existing anycast deployments such that all IP addresses within a partition are routed to the same location for an arbitrary anycast deployment. One IP address per partition may then represent the entire partition in subsequent measurements of specific anycast deployments. We implement a practical version of our technique and apply it to production traffic from an anycast authoritative DNS service of a major CDN and demonstrate that the resulting partitions have low error even up to 2 weeks after they are generated.
Conference Paper
Full-text available
As ISPs face IPv4 address scarcity they increasingly turn to network address translation (NAT) to accommodate the address needs of their customers. Recently, ISPs have moved beyond employing NATs only directly at individual customers and instead begun deploying Carrier-Grade NATs (CGNs) to apply address translation to many independent and disparate endpoints spanning physical locations, a phenomenon that so far has received little in the way of empirical assessment. In this work we present a broad and systematic study of the deployment and behavior of these middleboxes. We develop a methodology to detect the existence of hosts behind CGNs by extracting non-routable IP addresses from peer lists we obtain by crawling the BitTorrent DHT. We complement this approach with improvements to our Netalyzr troubleshooting service, enabling us to determine a range of indicators of CGN presence as well as detailed insights into key properties of CGNs. Combining the two data sources we illustrate the scope of CGN deployment on today's Internet, and report on characteristics of commonly deployed CGNs and their effect on end users.
Conference Paper
Full-text available
Fast Internet-wide scanning has opened new avenues for security research, ranging from uncovering widespread vulnerabilities in random number generators to tracking the evolving impact of Heartbleed. However, this technique still requires significant effort: even simple questions, such as, "What models of embedded devices prefer CBC ciphers?", require developing an application scanner, manually identifying and tagging devices, negotiating with network administrators, and responding to abuse complaints. In this paper, we introduce Censys, a public search engine and data processing facility backed by data collected from ongoing Internet-wide scans. Designed to help researchers answer security-related questions, Censys supports full-text searches on protocol banners and querying a wide range of derived fields (e.g., 443.https.cipher). It can identify specific vulnerable devices and networks and generate statistical reports on broad usage patterns and trends. Censys returns these results in sub-second time, dramatically reducing the effort of understanding the hosts that comprise the Internet. We present the search engine architecture and experimentally evaluate its performance. We also explore Censys's applications and show how questions asked in recent studies become simple to answer.
Conference Paper
The utility of anonymous communication is undermined by a growing number of websites treating users of such services in a degraded fashion. The second-class treatment of anonymous users ranges from outright rejection to limiting their access to a subset of the service’s functionality or imposing hurdles such as CAPTCHA-solving. To date, the observation of such practices has relied upon anecdotal reports catalogued by frustrated anonymity users. We present a study to methodically enumerate and characterize, in the context of Tor, the treatment of anonymous users as second-class Web citizens. We focus on first-line blocking: at the transport layer, through reset or dropped connections; and at the application layer, through explicit blocks served from website home pages. Our study draws upon several data sources: comparisons of Internetwide port scans from Tor exit nodes versus from control hosts; scans of the home pages of top-1,000 Alexa websites through every Tor exit; and analysis of nearly a year of historic HTTP crawls from Tor network and control hosts. We develop a methodology to distinguish censorship events from incidental failures such as those caused by packet loss or network outages, and incorporate consideration of the endemic churn in web-accessible services over both time and geographic diversity. We find clear evidence of Tor blocking on the Web, including 3.67% of the top-1,000 Alexa sites. Some blocks specifically target Tor, while others result from fate-sharing when abuse-based automated blockers trigger due to misbehaving Web sessions sharing the same exit node.
Conference Paper