ArticlePDF Available

A Journey into Bitcoin Metadata

Authors:

Abstract and Figures

Besides recording transfers of currency, the Bitcoin blockchain is being used to save metadata — i.e. arbitrary pieces of data which do not affect transfers of bitcoins. This can be done by using different techniques, and for different purposes. For instance, a growing number of protocols embed metadata in the blockchain to certify and transfer the ownership of a variety of assets beyond cryptocurrency. A point of debate in the Bitcoin community is whether metadata negatively impact on the effectiveness of Bitcoin with respect to its primary function. This paper is a systematic analysis of the usage of Bitcoin metadata over the years. We discuss all the known techniques to embed metadata in the Bitcoin blockchain; we then extract metadata, and analyse them from different angles.
Content may be subject to copyright.
Journal of Grid Computing manuscript No.
(will be inserted by the editor)
A journey into Bitcoin metadata
Massimo Bartoletti ·Bryn Bellomy ·Livio Pompianu
Received: date / Accepted: date
Abstract Besides recording transfers of currency, the
Bitcoin blockchain is being used to save metadata — i.e.
arbitrary pieces of data which do not affect transfers of
bitcoins. This can be done by using different techniques,
and for different purposes. For instance, a growing num-
ber of protocols embed metadata in the blockchain to
certify and transfer the ownership of a variety of as-
sets beyond cryptocurrency. A point of debate in the
Bitcoin community is whether metadata negatively im-
pact on the effectiveness of Bitcoin with respect to its
primary function. This paper is a systematic analysis
of the usage of Bitcoin metadata over the years. We
discuss all the known techniques to embed metadata in
the Bitcoin blockchain; we then extract metadata, and
analyse them from different angles.
Keywords Bitcoin, blockchain, measurements
1 Introduction
The last few years have witnessed an increasing interest
in Bitcoin, the first and most widespread decentralised
cryptocurrency [40,48]. Bitcoin records currency trans-
actions in a public, append-only data structure — the
so-called blockchain. The blockchain is maintained by
M. Bartoletti
Universit`a degli Studi di Cagliari, Dipartimento di Matema-
tica e Informatica, Via Ospedale, 72, 09124 Cagliari, Italy
E-mail: bart@unica.it
B. Bellomy
ConsenSys, 49 Bogart St., Brooklyn, NY 11206
E-mail: bryn.bellomy@consensys.net
L. Pompianu
Universit`a degli Studi di Cagliari, Dipartimento di Matema-
tica e Informatica
E-mail: livio.pompianu@unica.it
a peer-to-peer network, following a consensus protocol
which ensures that tampering with the past transac-
tions is computationally unfeasible [42].
The immutability of the Bitcoin blockchain, together
with its openness, have inspired the development of
new applications, that — going beyond transfers of cur-
rency — certify the existence of documents [16,21,25],
track the ownership of assets [12,18,19], run smart con-
tracts [14,29,36] or perform other useful tasks. These
applications exploit Bitcoin transactions to “piggy-back”
their metadata, i.e., pieces of data which are not inher-
ent to currency transfers, but are needed to implement
their application logic.
A debate about scalability has been taking place in
the Bitcoin community over the last few years [1,22,23].
In particular, users argue over whether the blockchain
should allow for storing these spurious data. Besides
quantifying the impact of metadata on the effectiveness
of Bitcoin, many other relevant aspects on the usage of
metadata are still worth of investigation.
This paper is a systematic survey on the usage of
metadata in Bitcoin, based on the analysis of the first
480,000 blocks in the blockchain, i.e. 245K transac-
tions collected until 2017/08/10. Our main contribu-
tions can be summarised as follows:
1. We survey the existing techniques for embedding
metadata in the Bitcoin blockchain, identifying 11
distinct ones. We compare these techniques, dis-
cussing their side effects, and we compare their evo-
lution over time, quantifying the amount of meta-
data embedded through them.
2. We search the blockchain for metadata, and we parse
them to infer the intended usage. To this purpose,
we consider both metadata as single units of infor-
mation, and as aggregates of pieces scattered through
the blockchain (e.g., images). Overall, we recognise 7
2 Bartoletti M., Bellomy B, Pompianu, L.
different types of metadata. We quantify the amount
and size of metadata by type.
3. We identify 45 distinct protocols which are used by
applications to embed metadata in the blockchain.
We classify them according to their application do-
main, and we measure the amount of metadata they
produced. We analyse the correlation between em-
bedding techniques, metadata types and protocols.
4. We compare the size of the extracted metadata with
the overall size of the blockchain, and we investigate
peaks of metadata that occurred over the years.
5. We make available a public dataset of metadata ex-
tracted from the blockchain [35], as well as the tools
we have developed for our analyses [33,38].
Structure of the paper. Section 2is a minimalistic in-
troduction to Bitcoin, containing all the technical back-
ground needed in the subsequent sections. In Section 3
we show and compare the techniques to embed meta-
data in the blockchain, presenting several statistics on
their usage. In Section 4we illustrate some techniques
to parse metadata and reconstruct the original con-
tent; then, we categorize and quantify the metadata
extracted from the blockchain. In Section 5we investi-
gate the protocols which embed metadata, we classify
them, and quantify the volume of metadata produced
by each protocol. Section 6discusses how the usage of
metadata impacts on the Bitcoin blockchain. Finally,
Section 7discusses some related works, and Section 8
draws some conclusions.
2 Background on Bitcoin
Bitcoin [48] is a decentralized infrastructure to exchange
virtual currency — the bitcoins. Users interact with Bit-
coin through addresses, by publishing transactions that
transfer bitcoins from one address to another. The log
of all transactions is recorded on the blockchain, a pub-
lic and immutable data structure maintained by the
nodes of the Bitcoin network. A subset of nodes, called
miners, gather the transactions sent by users, aggregate
them in blocks, and try to append these blocks to the
blockchain. A consensus protocol based on moderately-
hard “proof-of-work” puzzles is used to resolve conflicts
that may happen when different miners concurrently
try to extend the blockchain, or when some miner at-
tempts to append a block with invalid transactions. Ide-
ally, the blockchain is globally agreed upon, and free
from invalid transactions, unless the adversary controls
the majority of the computational power of the net-
work [31,42,44]. The security of the consensus proto-
col relies on the assumption that miners are rational,
i.e. that following the protocol is more convenient than
trying to attack it. To make this assumption hold, min-
ers receive some economic incentives for performing the
time-consuming computations required by the protocol.
Part of these incentives is given by the fees paid by users
upon each transaction.
2.1 Transactions
To illustrate how transfers of bitcoins work, we con-
sider two transactions T0and T1, which we represent
graphically as follows:
T0
previous transaction:· · ·
in-script:· · ·
value:v0
out-script(T , σ): ver k(T , σ)
T1
previous transaction:T0
in-script:sigk()
value:v1
out-script: · · ·
The transaction T0contains v0Satoshis (1 bitcoin
= 108Satoshis). A user can redeem this amount by
publishing a transaction (e.g., T1), whose previous
transaction field contains the identifier of T0(dis-
played just as T0in the figure), and whose in-script
field makes the out-script1of T0evaluate to true.
When this happens, the value of T0is transferred to
the new transaction T1, and T0becomes unredeemable.
A subsequent transaction can then redeem T1likewise.
In the transaction T0above, out-script checks that
σis a valid signature (made with the key k) of the
redeeming transaction. We denote with verk(T , σ ) the
signature verification function, and with sigk() the sig-
nature of the enclosing transaction, including all the
parts of the transaction but its in-script (obviously,
because it contains the signature itself).
Now, assume that T0is redeemable on the blockchain
when someone tries to append T1. This is possible if
v1v0, and the out-script of T0, applied to T1and
to the signature sigk(), evaluates true.
The previous example shows the simple case of trans-
actions with only one input and one output. In general,
transactions have the form displayed in Figure 1. First,
there can be multiple inputs and outputs (denoted with
array notation in the figure): in-counter specifies the
number of inputs, and out-counter that of outputs.
Each input (resp. output) has its own in-script (resp.
out-script). The two fields in-script length and
out-script length denote the size of in-script and
out-script, respectively. Since each output can be re-
deemed independently, previous transaction fields
1The fields in-script and out-script are called, respec-
tively, scriptPubKey and scriptSig in the Bitcoin wiki.
A journey into Bitcoin metadata 3
T
version no: k
in-counter: n
previous transaction[0]:T0
previous out-index[0]: i0
in-script length[0]: · · ·
in-script[0]:· · ·
sequence no[0]: · · ·
.
.
.
out-counter: m
value[0]:v0
out-script length[0]: · · ·
out-script[0]: · · ·
.
.
.
lock time:s
Fig. 1: A Bitcoin transaction.
must specify which one they are redeeming (in the fig-
ure, previous out-index). A transaction with multi-
ple inputs redeems all the (outputs of ) transactions in
its previous transaction fields, by providing a suit-
able in-script for each of them. The lock time field
specifies the earliest moment in time when the trans-
action can appear on the blockchain. The version no
field is currently set to 1. Transaction inputs contain
also a 4-bytes field called sequence no. Normally its
value is 0xFFFFFFFF, and it is ignored unless the trans-
action lock time is greater than 0 [11].
In order for Tto be appended to the blockchain, a
few conditions must be satisfied: for instance, for each
k < n, the out-script of the ik-th output of the trans-
action Tkmust evaluate to true when fed with the value
in in-script[k]; further, none of the inputs must have
been redeemed yet, and the sum of the values of all the
redeemed outputs must be greater than or equal to the
sum of the values of all outputs in T(see [30] for a
formal specification).
The Unspent Transaction Output set (in short, UTXO
set) is the set of redeemable outputs of all transactions
in the blockchain.
2.2 Scripts
Bitcoin scripts [10,11] are programs in a stack-based
language featuring a limited set of logic, arithmetic,
and cryptographic operators (but without loops). In the
rest of this section we illustrate the pairs of in-script
and out-script which are considered standard by the
Bitcoin network [39]. In Section 3we will then show
how these scripts are commonly used for embedding
metadata.
We use the typewriter font for denoting opcodes
(e.g., OP CHECKSIG), and italic for denoting bitstrings
(e.g., sig). In Bitcoin scripts, these bitstrings are al-
ways preceded by a suitable OP PUSHDATA opcode, which
pushes the bitstring onto the stack. For the sake of sim-
plicity, hereafter we omit these OP PUSHDATA opcodes.
2.2.1 Pay to public key (P2PK)
# p ay - to - P ub k ey ( P2 P K )
in - s c ri pt : sig
ou t - s cr i pt : pu b K e y OP_CHECKSIG
This pair of scripts implements the signature verifica-
tion function outlined in Section 2.1. The evaluation of
the output script starts with sig on top of the stack,
and then proceeds by pushing also pubKey. The opcode
OP CHECKSIG performs the signature verification, pop-
ping the two top elements of the stack, and evaluating
to true if the verification succeeds.
2.2.2 Pay to public key hash (P2PKH)
# p ay - to - P ub k e yH a sh ( P2 P KH )
in - s c ri pt : sig pubKey
ou t - s cr i pt : OP _ D U P O P _ H A S H 1 6 0 pubKeyHash
OP_EQ U A L V E R I F Y O P _ C H E C K SI G
This pair of scripts performs signature verification, sim-
ilarly to the previous pair. The main difference with re-
spect to P2PK is that the output script now contains
the double hash of a public key, rather than the public
key itself.
More in detail, the in-script contains a signature
sig and a public key pubKey. The evaluation of the
output script starts with sig and pubKey on the stack.
The opcode OP DUP duplicates the top element of the
stack (i.e., pubKey), and OP HASH160 replaces the top
element with its double hash. Then, pubKeyHash (the
double hash of a key) is pushed into the stack. The op-
code OP EQUALVERIFY checks if the two top elements
of the stack are equal: if so, they are popped, other-
wise the script fails. Finally, OP CHECKSIG performs the
signature verification.
2.2.3 Multi-signature
# mu lt i - s i gn a tu re
in - s c ri p t : O P _0 sig1 ... sigM
ou t - s cr i pt : M p u b K e y 1 . . . p u b K e y N N
OP_CHECKMULTISIG
4 Bartoletti M., Bellomy B, Pompianu, L.
This pair of scripts performs a multi-signature verifi-
cation: the output can be redeemed if the in-script
provides Msignatures verified against Npublic keys,
where MN. The opcode OP CHECKMULTISIG tries to
verify the last signature with the last public key. If they
match, it proceeds to verify the previous signature in
the sequence, otherwise it tries to verify the signature
with the previous key. Notably, OP CHECKMULTISIG uses
each key only once, therefore the order of the signatures
in the in-script matters.
2.2.4 Pay to script hash (P2SH)
# p ay - to - S c ri p tH a sh ( P 2S H )
in - s c ri pt : v1 . . . vN bitstring
ou t - s cr i pt : O P _ H A S H 1 6 0 hash O P _ E Q U A L
In this pair, firstly the out-script is evaluated with
bitstring on top of the stack. The script checks that
the hash of bitstring is equal to hash. If so, the script
obtained by interpreting bitstring as a sequence of op-
codes is executed (on the parameters v1 . . . vN). Sum-
ming up, the output can be redeemed if the in-script
of the redeeming transaction provides a script whose
hash coincides with the hash contained in out-script,
and whose evaluation (on the parameters v1 . . . vN)
yields true.
2.2.5 OP RETURN
# op_return
ou t - s cr i pt : O P _ R E T U R N bitstring
An out-script containing OP RETURN always evaluates
to false, regardless of the value of the bitstring. There-
fore the corresponding output is unspendable, and it
can be safely removed from the UTXO set.
Currently, standard transactions can have only one
occurrence of OP RETURN: more precisely, if a transac-
tion has more than one out-script with OP RETURN, or
an out-script with more than one OP RETURN, or an
OP RETURN with more than one OP PUSHDATA, then the
transaction is not standard.
3 Embedding metadata in the blockchain
In Sections 3.1 to 3.7 we illustrate various techniques
(as far as we know, all those used in practice) to embed
metadata in the blockchain. In Section 3.8 we show how
to split large pieces of metadata into smaller pieces that
can be distributed among sets of transactions. In Sec-
tion 3.9 we discuss how to commit to specific values
without explicitly writing them in the blockchain. Sec-
tion 3.10 gives some statistics on embedding techniques.
3.1 Value field
Transaction outputs specify the amount of Satoshis to
send through the value field, of 8 bytes size. A first way
to encode a message min the blockchain is to build a
transaction with an output whose value is the number
that represents m. For instance, the BitcoinTimestamp
protocol (see Table 5) exploits this method for saving
a SHA256 hash. The hash is first split into 16 pieces,
which are then translated into amounts of Satoshis. Fi-
nally, the protocol builds a transaction containing an
output for each amount (e.g. see rows 1-2of Table 1).
Although users can easily recover the moved funds
(e.g. by specifying their own address as receiver), the
disadvantage of this technique is that it requires to own
at least the amount of Satoshis needed to represent m.
3.2 Input sequence
Some users exploited the 4 bytes in the sequence no
field for appending their own metadata. Although this
technique does not have negative effects on the Bit-
coin system, as far as we know no protocols use this
technique (as shown in Table 5). We conjecture that
protocols rely on other techniques because 4 bytes are
not enough for implementing any relevant use case (see
Section 4for a description of protocols and use cases).
3.3 Pay to public key / Pay to public key hash
In P2PK (Section 2.2.1) the output script specifies the
recipient of some bitcoins through a public key, encoded
in 65 bytes (or 33 bytes when the key is compressed).
Similarly, P2PKH (Section 2.2.2) uses the hash of the
key, which is encoded in 20 bytes.
Users can embed an arbitrary message m(of suit-
able size) in the P2PK or P2PKH script of some out-
put of a transaction T, by writing min place of the
expected key or hash. Assuming that Bitcoin uses se-
cure signatures and collision-resistant hash functions,
it is computationally unfeasible, in general, to craft a
transaction which redeems such output of T.
A downside of this approach is that these unspend-
able outputs are indistinguishable from the spendable
ones: actually, a Bitcoin node has no way of knowing
whether or not a user exists who possesses the needed
hash preimage (nor does it know that the data was
never intended to represent an address in the first place).
A journey into Bitcoin metadata 5
# Technique Transaction identifier / Bitcoin address
1Value f6f89da0b22ca49233197e072a39554147b55755be0c7cdf139ad33cc973ec46
249a130ce4255fc91061c3d1170cbc256f51ed671256df837500d59183cfdd64f
3OP RETURN d84f8cf06829c7202038731e5444411adc63a6d4cbf8d4361b86698abad3a68a
4
Vanity Address
1ponziUjuCVdB167ZmTWH48AURW1vE64q
51CounterpartyXXXXXXXXXXXXXXXUWLpVr
672162e9224dbadefb84834046ee8b4706af77f57fa4e8fd5aaf3255abf516807
71WeRe3jh9XiaAyabyiE2Mz4v8bbcB52Gy
81FineoW99TYAAZuRSkbZLrx65iTXELqHhv
918chaNzLXvAbYvkad7MH2LNrQmBzeXbWLo
10 1PoStJBYu49Ezqcwh1VeMWZgRopwcYwksY
11 1FAke1neYErMQLebVPYBAToTLvafr5ZPF6
12 Coinbase 4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b
13 P2PKH 5970ae129d1141663bd5e441a1555c16fb1c0586dd05f40c1db3d3e81218ee41
14 P2SH 1e47936f37e71b98e8bafe51ddc902d59c1318bc556329ba4ab1996981785292
Table 1: Some transactions and addresses containing metadata.
As a result, the nodes of the Bitcoin network must keep
these transactions in their UTXO set indefinitely. Since
the UTXO set is usually stored in RAM for efficiency
concerns [52], the bloating of the UTXO set negatively
affects the memory consumption of nodes [32].
3.4 Pay to script hash
By using P2SH scripts (Section 2.2.4), metadata can
be embedded in various ways. Similarly to the P2PKH
technique (Section 3.3), one can embed metadata in
the output script, in place of the hash. An alternative
technique is to embed metadata in the input script,
pushing them onto the stack with the OP PUSHDATA in-
struction, and immediately afterwards removing them
with OP DROP. As long as the script completes its exe-
cution successfully and there is some nonzero value on
the stack after completion, the transaction is valid [10]:
indeed, there is no rule specifying that the data accu-
mulated on the stack during the script execution must
be cleared. Stack items that are below the topmost item
at the end of execution are simply ignored. For instance,
consider the following scripts:
in - s c ri pt : OP_PUSHDATA1 8
0xaabbccddeeff0011 s i g p u b K e y
v1 . . . vN bitstring
ou t - s cr i pt : O P _ H A S H 1 6 0 hash O P _ E Q U A L
First, the in-script and the out-script are concate-
nated; the result is a common P2SH script preceded
by an OP PUSHDATA1. The evaluation starts by push-
ing 0xaabbccddeeff0011 onto the stack. This is done
through OP PUSHDATA1, where the trailing 1indicates
that the next 1byte contains the number of bytes to be
pushed onto the stack (in the snippet above, 8 bytes).
At the end of the execution, the top item on the stack
is true, resulting from the OP EQUAL. Underneath this
value there is the metadata 0xaabbccddeeff0011.
Note that transactions that make use of ignored
OP PUSHDATA for embedding metadata do not bloat the
UTXO set: indeed, their outputs can be spent by valid
addresses, because they do not need to overwrite the
address fields in the transaction scripts. The work [26]
provides further details on the P2SH technique.
3.5 OP RETURN
Standard OP RETURN transactions allows to store up to
80 bytes of arbitrary data (see e.g. row 3of Table 1). An
out-script containing OP RETURN always evaluates to
false, hence the output is unspendable, and its transac-
tion can be safely removed from the UTXO set. In this
way, OP RETURN overcomes the UTXO consumption is-
sue highlighted in Section 3.3.
3.6 Vanity address
Bitcoin addresses are the hash of ECDSA public keys.
By a brute force search during the generation of the key
pair, it is possible to obtain a so-called vanity address,
where a few bytes are equal to a given string (see e.g.
row 4of Table 1)2. One can embed these bytes as meta-
data in transactions, e.g. by using the vanity address in
P2PK scripts.
Although, in theory, the maximum size of the meta-
data corresponds to the size of an address (20 bytes),
in practice this technique is practical only for metadata
of the size of a few bytes. Longer metadata can be dis-
tributed among different vanity addresses; for instance,
the transaction at row 6of Table 1transfers bitcoins
from 5 vanity addresses (displayed at rows 7-11). By
concatenating the first four characters (but the leading
2Open-source tools like Vanitygen generate vanity ad-
dresses with user-defined patterns.
6 Bartoletti M., Bellomy B, Pompianu, L.
1) of these addresses, we read the plain English words:
“We’re fine, 8chan post fake”.
A different use of vanity addresses is when all the
bytes of the address are fixed, like e.g. at row 5of Ta-
ble 1. In this case, it is plausible that the corresponding
private key is not known to anybody, so a P2PK out-
put using such address cannot be spent. This fact is ex-
ploited to implement “Proof-of-Burn” protocols on top
of Bitcoin (like e.g. in Counterparty): in these proto-
cols, users receive some tokens in exchange for sending
some bitcoins to an unspendable address.
3.7 Coinbase transaction
Miners specify how to redeem the reward for the mined
block (and the fees of its transactions) through the first
transaction of the block. This transaction does not have
an input script, and it contains a field called coinbase,
that miners usually fill in with metadata. The coinbase
data size is between 2 and 100 bytes. Nevertheless, af-
ter block 227,835 the available space is reduced, since
the Bitcoin Improvement Proposal 34 (BIP 0034) [27]
requires the first bytes of the coinbase field to store the
block height index.
Usually, the coinbase field is used by miners for
identifying the mining pool, or for voting BIPs (for in-
stance, when they voted to support either the BIP 0016
or BIP 0017 [28]). The most famous message embedded
by using this technique is included in the genesis block
(see e.g. row 12 of Table 1): “The Times 03/Jan/ 2009
Chancellor on brink of second bailout for banks”.
3.8 Distributing metadata
The techniques discussed in Sections 3.1 to 3.7 embed
metadata within a single field of a single transaction.
Below we describe three techniques that allow to split
metadata among multiple fields and transactions.
Multisignature. This technique uses the multisignature
script introduced in Section 2.2.3. Given N1 pieces
of metadata v2, . . . , vN, the scripts are the following:
in - s c ri p t : O P _0 sig1
ou t - s cr i pt : 1 p u b K e y 1 v2 .. . v N N
OP_CHECKMULTISIG
This implements a 1-of-Nmultisignature: namely, only
one signature (sig1) needs to be verified against some
of the Npublic keys (pubKey1). The other “public keys”
in the out-script — actually, the N1 pieces of meta-
data — are irrelevant for the execution of the script.
Note that this technique bloats the UTXO set only
until the transaction is redeemed.
Multiple inputs / outputs. Although Bitcoin imposes a
hard limit of 10K bytes on the size of a single script [9],
there is no limit to the number of inputs or outputs a
transaction may contain. Hence, metadata bigger than
10K bytes can be split into smaller chunks, and dis-
tributed among many inputs or outputs within a single
transaction. To ensure that the original data can be
reconstructed from these fragments, one needs to fix
an encoding. The simplest one is to store the chunks
of metadata sequentially in Noutput scripts. Another
common encoding, used e.g. by the BIT-COMM pro-
tocol, uses the amount of bitcoins transferred by each
output to order the chunks (see e.g. row 13 of Table 1).
Transactions using this technique may or may not bloat
the UTXO set — that is determined by the structure
of the individual transaction outputs.
Transaction chains. The previous techniques store meta-
data within a single transaction. However, this is not
always ideal, or even possible. For instance:
If the size of metadata exceeds the maximum block
size, the transaction containing the metadata will
be rejected by the network.3
Large transactions require large fees. Even though,
in theory, one can send a transaction with zero fee,
in practice a transaction with no fee (or a low fee)
is unlikely to be mined. Depending on current fee
market dynamics, it may be more cost-effective to
split the metadata across multiple transactions.
Transactions greater than a certain size, or which
contain more than one OP RETURN, are considered
non-standard, with the consequence that most nodes
refuse to relay them. This limit has varied over time,
and is now replaced by the concept of “transaction
weight” which is similar, but accounts for Segre-
gated Witness data in a different manner.4
Due to these considerations, metadata are often split
into sets of transactions. A common technique is to
connect transactions containing related data in a chain
structure. When building a transaction chain, one of
the techniques from Sections 3.1 to 3.7 is chosen for en-
coding data into each individual transaction. Then, a
spendable transaction output is added to each transac-
tion, to be redeemed by the subsequent transaction in
the chain. The output must be spendable by an address
over which the user embedding the data has control.
3See github.com/bitcoin/.../src/main.cpp#L829.
4See github.com/bitcoin/bitcoin/.../src/main.h#L56, and
github.com/bitcoin/bitcoin/.../src/main.cpp#L644-L648.
A journey into Bitcoin metadata 7
3.9 Embedding vs. committing metadata
Several Bitcoin-based protocols notarize documents by
embedding their hash on the blockchain. There are al-
ternative techniques, e.g. pay-to-contract [43] and
sign-to-contract5, which allow one to commit to a
hash without actually embedding it into a transaction.
For instance, Btproof applies RIPE160 to the hash, to
obtain a Bitcoin address; similarly, Originstamp daily
aggregates hashes into a seed, hashes it into a private
key, and then derives from it the corresponding pub-
lic key and address. Both protocols pay a small fee
to the generated address in order to publish it in the
blockchain. ContractHashTool exploits elliptic curves
for building an in-script that cryptographically com-
mits to a given hash, without embedding the hash itself.
Unless a protocol keeps a public track of its transac-
tions (like, e.g., Originstamp), these techniques prevent
external observers from inferring any metadata.
3.10 Statistics on embedding techniques
Table 2shows the amount of metadata embedded with
each technique. The leftmost column groups the tech-
niques in three categories: the Single category contains
the techniques which embed the whole piece of meta-
data into a single chunk; Multi contains the techniques
which split an element into multiple chunks, but store
all the pieces into a single transaction; finally, Chains
gathers the techniques which spread pieces across mul-
tiple transactions. The third and fourth columns show
the size (in bytes) of the field storing metadata, and
where the field is located. The fifth column lists the
techniques bloating the UTXO. The sixth column dis-
plays the date in which first chunk of metadata appears,
the seventh one shows the number of times a technique
has been used6, and last two columns show the total
and average size of metadata embedded.
In the first 480,000 blocks we count 4,582,661 chunks
of metadata, for a total size of 99MB. Note that,
since the chunks of the Chains type are a subset of
those in the other categories, and the chunks in Multi
are a subset of those in Single, to avoid counting the
same piece of metadata multiple times, the totals only
consider the values of the Single type. This number
of chunks is a good indicator of the total number of
transactions with metadata, since the fraction of trans-
actions produced by Multi techniques is negligible. We
observe that 75% of the space used by metadata is due
5See bitcointalk.org/index.php?topic=915828.msg10056796
6Elements of type Single are chunks; Multi elements are
transactions; Chains elements are chains of transactions.
to OP RETURN transactions. The average size of transac-
tions chains is higher than other methods, since this
technique is used for embedding images and archives.
4 Analysis of Bitcoin metadata
In this section we present our techniques to parse meta-
data and reconstruct the original content. Then, we cat-
egorize the reconstructed items, and we measure them.
4.1 Collecting metadata
One of the most effective techniques for recognising
chunks of metadata is to search strings for “suspicious”
byte patterns. For example, long strings of contiguous
ASCII characters are unlikely to occur in regular trans-
action data; similarly, the probability of finding spe-
cific bitstrings, like the Gzip header 0x1f9d9070, is ex-
tremely low. Finding such a bitstring is a trigger for fur-
ther investigation. We employed several types of these
searches, which we discuss below.
Frequency analysis. The GNU strings utility7takes
a data source as input, and yields as output all of the
ASCII plaintext characters found in that source. It pro-
vides a flag for filtering out strings of contiguous ASCII
characters under a given length. It is possible to run
strings directly on Bitcoin Core’s .dat files, but care
must be taken when tuning the filter. Obviously, too low
a threshold will yield a huge number of false positives.
On the other hand, due to the way inputs and outputs
are encoded in transaction data, too high a threshold
eliminates plaintext that has been split across multiple
transactions or transaction scripts.
While this approach is quite simple, some of the
data that we encountered — particularly, the conversa-
tions and code embedded into the blockchain by Peter
Todd (one of the Bitcoin Core developers) — mention
that they are specifically intended to be discovered and
extracted via this method. For example, Todd’s plain-
text uploader is a Python script stored in the blockchain
(see row 14 of Table 1). It describes itself as a tool that
can “publish text in the blockchain, suitably padded
for easy recovery with strings”.8The tool appears to
have been used to upload its own source code to the
blockchain. Peter Todd’s tool takes a text file as input,
and uses the P2SH ignored OP PUSHDATA technique (see
Section 3.4) to embed the contents of the file into the
7man7.org/linux/man-pages/man1/strings.1.html
8github.com/petertodd/python-bitcoinlib/blob/
master/examples/publish-text.py#L48
8 Bartoletti M., Bellomy B, Pompianu, L.
Type Technique Field Hosted in UTXO 1st item # items Tot. size Avg. size
size Bloating
Single
Value 8 Tx output No N/A N/A N/A N/A
Input Sequence 4 Tx input No 2011/02/25 1,305,372 5,221,488 4
P2PK-P2PKH 20,33,65 Script Yes 2013/03/16 66,762 1,335,240 20
P2SH 520 Script Yes 2013/04/10 1,578 31,560 20
OP RETURN 80 Script No 2014/03/12 2,903,186 76,700,965 26
Vanity Address 20 Script No N/A N/A N/A N/A
Coinbase 2–100 Tx input No 2009/01/03 305,763 18,442,641 60.3
Total 2009/01/03 4,582,661 101,731,894 22
Multi Multi-signature Variable Script Transient 2013/04/06 15,067 2,926,590 194
Multi-in/out Variable Variable Variable 2013/03/16 529 4,437,616 8,389
Chains Tx chains Variable Variable Variable 2013/04/06 60 3,470,870 57,848
Table 2: Statistics about embedding techniques (sizes are in bytes).
input scripts of a single transaction. The reason this en-
coding lends well to strings-based extraction is that it
allows large amounts of arbitrary data to be stored with
minimal interruption by non-ASCII bytes. Input scripts
are stored contiguously in transaction data, meaning
that the only necessary interruptions will be the min-
imal set of Bitcoin script opcodes required to ensure
that the transaction is considered valid by the network.
Compared to the other methods, strings-based ex-
traction offers the lowest barrier-to-entry. Thus, users
encoding large quantities of plaintext data that are in-
tended to be easily discoverable should make note of
encoding methods that lend well to this technique.
File signature. Many file formats require the inclusion
of specific bytestrings that are common to all files of a
given format. For example, many JPEG images begin
with the bytestring 0xffd8ffe000104a464946000101.
Similarly, ASCII-armored PGP messages begin with
-----BEGIN PGP MESSAGE-----. These bytestrings of-
ten occur in the header or footer of the file, although
there are formats that place them elsewhere. The prob-
ability of finding such bytestrings in Bitcoin blocks is
exceedingly low, and as such, they provide a useful in-
dicator of embedded data.
We used several tools to detect file signatures present
in Bitcoin transactions:
binwalk is an extensible tool for discovering valid files
embedded into other data [41]. It provides a lan-
guage for defining file signatures, as well as a large
database of pre-defined signatures for common file
formats. It also has the ability to carve detected files
out of the surrounding binary data. One can pro-
duce a number of valid results simply by running
binwalk on the Bitcoin Core .dat files. However,
since the tool is unaware of the Bitcoin block for-
mat, it is only suitable for recovering files embedded
into a single transaction script.
binary-grep searches a collection of input files for a
single bytestring specified by the user [37]. It out-
puts the byte offsets of any matches, and has a sim-
ple carving function.
local-blockchain-parser provides a grep command
that, unlike binwalk and binary-grep, is aware of
the Bitcoin block format [38], skipping the parts of
the transaction that cannot embed metadata. For
each match, it outputs the block and transaction
hashes, script type (input/output), and byte offset.
One of the most successful workflows we discovered
for recovering binary files based on file signatures was
the following. (i) We ran binwalk and/or binary-grep
on a .dat file, making note of any results that appeared
to be true positives. (ii) If there were promising re-
sults, we then ran the binary-grep subcommand of
local-blockchain-parser on that .dat file, obtaining
the transaction hashes where those results were found.
(iii) For each resulting transaction, we manually in-
spected the transaction graph around it. If it appeared
to be an isolated transaction, we ran the tx-info sub-
command of local-blockchain-parser. If it appeared
to be a part of a chain, we ran the tx-chain subcom-
mand instead. (iv) We inspected the binary output from
the previous step, performing manual carving where
necessary, and we attempted to ascertain the validity
of the results by opening them with applications ap-
propriate to their file type.
Protocol Identifier. Many protocols mark their meta-
data by writing a specific string in the first few bytes of
each chunk, but the exact number of bytes may vary
from protocol to protocol. In Section 5we take ad-
vantage of this for associating metadata to protocols.
Furthermore, since protocols give a detailed description
of the format of the elements produced, in Section 4.2
we distinguish different types of metadata and classify
them. Hence, in order to associate metadata to pro-
tocols we: (i) search the web for known associations
A journey into Bitcoin metadata 9
between identifiers and protocols; (ii) we accordingly
classify strings beginning with one of the identifiers ob-
tained. In more details, in the first step we query Google
to obtain public identifier/protocol bindings. For in-
stance, since several protocols use the OP RETURN tech-
nique, we execute the query “Bitcoin OP RETURN, that
returns 26,500 results, and we manually inspect the
first few pages of them. Note that a protocol can be as-
sociated with more than one identifier (e.g., Stampery,
Blockstore), or even do not have any identifier. In this
way we obtain 45 protocols associated to 39 identifiers;
further, we find several protocols that do not use any
identifier (e.g., Diploma,Chainpoint). We also distin-
guish the main types of metadata produced by proto-
cols (e.g. Text, Hash and Record). The second step is
performed by our tool: it associates chunks of meta-
data to a protocol. The full list of protocols discovered
is shown in Table 5; identifiers are listed in Table 4.
Transaction chains. Although all spent transaction out-
puts in the Bitcoin blockchain naturally form a chain
structure, identifying chains containing embedded meta-
data is not entirely straightforward. A transaction may
have certain “giveaway” characteristics that suggest the
presence of a chain containing data, such as:
1. One or more unspendable outputs (i.e., OP RETURN
outputs), plus a single spent output. The unspend-
able output(s) would contain data, while the spent
output would be used to continue the chain.
2. One or more unspent outputs (possibly used for a
P2PKH embedding), plus a single spent output. The
unspent output(s) would contain data, while the
spent output would be used to continue the chain.
3. The unspent outputs, if any, contain a tiny amount
of Satoshis (such outputs are also known as dust).
Except for the Bit-Comm protocol, which uses out-
put values to order the data in the output scripts,
the funds included into outputs that can never be
spent are effectively “burnt”, and add no informa-
tion to the embedded data. This disincentivizes the
embedder from including any more value than is
strictly necessary to create a valid transaction.
4. The spent output contains a relatively large amount
of bitcoins, used to fund further dust outputs in sub-
sequent links in the chain.
5. Preceding or subsequent transactions share a simi-
lar structure with the transaction in question. Many
of the transaction chains we found appeared to have
been constructed with the help of software (e.g. the
Python source we extracted). The software we found
tends to create strings of transactions sharing a sim-
ilar format. While it is altogether possible to em-
bed data into chains of dissimilar transactions, they
would be difficult to find and complex to decode.
These are helpful clues, but not definitive criteria. In
fact, there are many other types of transactions which
possess the characteristics described above. For exam-
ple, payouts from mining pools and Bitcoin casinos of-
ten send small amounts of bitcoins to many users at
once. These payout transactions are often constructed
algorithmically (according to some set of “threshold”
rules intended to minimize the impact of the fee on the
payout), meaning that preceding and following trans-
actions share a similar structure.
Therefore, it is generally necessary to have some un-
derstanding of the embedded data in order to determine
whether a given chain is of interest. If a transaction con-
tains a file signature for a file type that is unlikely to fit
into the data provided by that transaction, it warrants
further investigation.
Extraction of data from transaction chains is rela-
tively easy when using the local-blockchain-parser
utility. This utility has a tx-chain subcommand that
takes a single transaction hash and crawls backwards
and forwards through the transaction graph, collect-
ing data from the transaction scripts. This data is fil-
tered and permuted to account for the various ways in
which transaction chains are constructed. Finally, the
data from each transaction are concatenated in the or-
der that they appear in the chain. This process yields a
collection of binary files corresponding to the different
ways in which data can be embedded into a chain.
4.2 Types of metadata
We associate the successfully reconstructed data items
to one of the following types:
Text Users have embedded a significant amount of text,
since the very first message by Satoshi Nakamoto.
This includes several birthday wishes, love state-
ments, prayers, greetings, developer conversations,
and magnet links. Besides user messages, miners
usually embed in coinbase transactions messages for
Bitcoin-related purposes, to identify their blocks,
vote on proposals, announce what features they sup-
port. We have also identified two pdf documents:
“Bitcoin: a peer-to-peer electronic cash system” [48],
and “The first collision for full SHA-1” [51].
Hash Many users notarize the ownership of documents
by embedding their hash on the blockchain (embed-
ding the whole document would be too expensive,
because of the required transaction fees). Some pro-
tocols notarize several documents with a single piece
of metadata; this could be the hash of the sequence
10 Bartoletti M., Bellomy B, Pompianu, L.
of document hashes, or the root of the Merkle tree
of the document hashes.
Financial record A common application of the Bit-
coin blockchain is to record the ownership and ex-
change of digital or physical assets. These assets are
represented as tokens, and users are identified by
their Bitcoin addresses.
Copyright Copyright records are produced by proto-
cols which act as marketplaces where artists publish
and sell their files to other users.
Script Developers have embedded in the blockchain
several scripts. We have found some Python scripts
(e.g., the Satoshi uploader,Satoshi downloader, and
Cryptograffiti uploader), Bash scripts (e.g., Pass-
word script, and OpenSSL encoder), and also some
games (e.g., LinPyro,Bong ball and Lucifer).
Image The blockchain contains some small images, usu-
ally spread across chains of transactions. We have
found various file formats (PNG, JPEG, and GIF).
Archive This type includes compressed archives, like
e.g. the WikiLeaks Cablegate gzipped archive.
4.3 Statistics on types of metadata
Table 3shows some statistics about the type of meta-
data we reconstructed. The second column indicates the
day in which the first piece of metadata of the corre-
sponding type appeared in the blockchain. Next, we
show the total number of elements found, followed by
their total size in bytes, and their average size.
Note that the total size of reconstructed metadata
is less than the 101,731,894 reported in Table 2: this
is because Table 2also includes bitstrings that we did
not manage to decode. For instance, the bytes embed-
ded with OP RETURN are always considered in Table 2,
but they appear in Table 3only if we are also able
to recognize their type (e.g., because their prefix re-
veals which protocol has produced them, among those
in Table 4). From the rightmost column we see that the
average size of scripts, images and archives exceeds the
maximum size of Bitcoin scripts; hence, these metadata
are embedded through “Multi-in/out” or “Tx chains”
techniques. Finally, note that although the first finan-
cial record appeared only in May 2014, this type of
metadata now constitutes 70% of the reconstructed
elements, and it uses the majority of the space.
5 Analysis of Bitcoin-based protocols
In this section we focus on protocols which embed meta-
data on the Bitcoin blockchain. We first propose a rough
taxonomy of protocols, which categorize them accord-
ing to the application domain. Then, considering the
collection of protocols reported in a previous paper [34],
we perform several analyses on their usage of metadata.
5.1 Types of protocols
Our taxonomy classifies protocols in five categories:
Financial includes protocols that manage assets, e.g.
for certifying their ownership, endorsing their value,
and keeping track of trades. Metadata in these trans-
actions are used to specify the value of the asset, the
amount of the asset transferred, the new owner, etc.
Notary includes protocols that certify the ownership
and timestamp of documents. These protocols allow
users to publish the hash of a document in a transac-
tion, thus proving its existence and integrity. Since
the transaction is signed with a private key, users
can also certify the ownership of the document.
DRM includes protocols for declaring access rights
and copyrights on digital art documents, like e.g.
images or audio files.
Message groups protocols which record text messages.
Subchain gathers protocols which construct transac-
tion chains to record execution traces of third-party
smart contracts.
We now briefly comment this taxonomy. Although
Notary and DRM protocols have the same overall
goal — certifying the ownership of documents — they
have some relevant differences. First, Notary protocols
do not usually require the original document (yet, they
ask that the document hash is provided by the owner);
further, their goal can be fulfilled also when their front-
end is no longer online. Conversely, DRM protocols
usually need to gather user documents, and have com-
plex front-ends to enable further interactions with users
(e.g., they often play the role of broker between media
producers and consumers). The ordering of metadata
embedded by Notary,DRM and Message protocols
is immaterial; instead, different orderings in Financial
and Subchain protocols usually imply different system
states. Indeed, transactions used by Financial proto-
cols are analogous to Bitcoin transactions, except that
they transfer tokens instead of bitcoins; depending on
the current balance of tokens, appending a transaction
may result in a state update, or even leave the state
unchanged (e.g., if an attacker attempts to sell assets
that she does not currently own). Subchain proto-
cols share the same mechanism, but they generalize to-
ken exchange to more complex computations, like those
arising from the execution of smart contracts (e.g., in
the RSK platform).
A journey into Bitcoin metadata 11
Type 1st item # items Tot. size Avg. size
Text 2009/01/03 309,894 18,811,329 61
Hash 2013/12/18 200,832 7,617,392 38
Financial record 2014/05/03 1,430,071 37,699,809 26,36
Copyright 2014/12/19 116,406 3,503,170 30
Script 2013/04/06 10 138,149 13,815
Image 2013/03/17 108 1,523,529 14,107
Archive 2013/04/06 12 2,838,760 236,563
TOTAL 2009/01/03 2,054,575 72,132,138 35
Table 3: Statistics on types of metadata (sizes are in bytes).
5.2 Statistics on Bitcoin-based protocols
Table 5shows some detailed statistics about protocols.
The first and second columns indicate, respectively, the
protocol type and name. We use an additional type,
called Empty, to gather the transactions which use
OP RETURN without embedding any metadata. The third
and fourth columns show the type of metadata and the
embedding technique. The fifth column shows when the
protocol generated the first chunk of metadata; since
transactions do not carry a timestamp, to this purpose
we use the timestamp of the enclosing block. The next
two columns count the total number of elements pro-
duced by a protocol, and the total size (in bytes) of
the embedded metadata (net of script instructions and
other transaction fields). The rightmost column shows
the average size of the metadata.
We were able to associate to protocols 53.7MB
of metadata, which is quite less than the total amount
extracted (99MB). This difference has various moti-
vations. First, users often embed metadata not related
to any protocol; for instance, this is the case for sev-
eral images and text messages. Second, several proto-
cols make it impossible, for an external observer, to
recognize their chunks of metadata (unlike the proto-
cols in Table 4, which append an identifier to the meta-
data): indeed, we have discovered 19 protocols that em-
bed metadata without any identifier. Finally, our list of
protocols may be incomplete, so if some other proto-
cols embed metadata with OP RETURN, we count their
items but we can not classify them. We note a relevant
component of Empty transactions (10% of the total
OP RETURN transactions), which use OP RETURN without
any data attached, so they are not associated to any
protocol. We evaluate that 96% of these transac-
tions are related to the peaks, discussed later on in Sec-
tion 6.4. The fifth column of Table 5suggests that, orig-
inally, the protocols were of Financial and Notary
type, while the other use cases were introduced sub-
sequently (indeed, the others types were not inhabited
before the end of 2014).
Type Protocol Identifiers
Financial
Colu CC
CoinSpark SPK
OpenAssets OA
Omni omni
Openchain OC
Helperbit HB
Counterparty CNTRPRTY
Notary
Factom Factom!!, FACTOM00, Fa, FA
Stampery S1, S2, S3, S4, S5, S6
Proof of Existence DOCPROOF
Blocksign BS
CryptoCopyright CryptoTests-, CryptoProof-
Stampd STAMPD##
BitProof BITPROOF
ProveBit ProveBit
Remembr RMBd, RMBe
OriginalMy ORIGMY
LaPreuve LaPreuve
Nicosia UNicDC
SmartBit SB.D
Notary Notary
DRM
Monegraph MG
Blockai 0x1f00
Ascribe ASCRIBE
Message Eternity Wall EW
BitAlias BALI
Subchain Blockstore id, 0x5888, 0x5808
Table 4: Protocol identifiers. Counterparty metadata
must be first deobfuscated with ARC4 encryption, us-
ing the transaction identifier of the first unspent trans-
action output as the encryption key.
From Table 5we see that the large majority of pro-
tocols use the OP RETURN technique. Focussing on the
metadata embedded with this technique, Figure 2dis-
plays how metadata are distributed into the protocol
types, and Figure 3shows the temporal evolution of
their usage, in terms of the number of metadata items
published per week. Comparing Table 5with Figure 2
we see that although most protocols are Notary, their
transactions are a fraction of those produced by Finan-
cial protocols.
12 Bartoletti M., Bellomy B, Pompianu, L.
Type Protocol Metadata Technique 1st item # items Tot. size Avg. size
Financial
Colu Financial record OP RETURN 2015/07/09 244,411 4,425,702 18
CoinSpark Financial record OP RETURN 2014/07/02 28,120 960,664 34
OpenAssets Financial record OP RETURN 2014/05/03 207,132 3,255,499 16
Omni Financial record OP RETURN 2015/08/10 311,605 6,249,883 20
Openchain Hash OP RETURN 2015/10/21 2,758 115,283 42
Helperbit Financial record OP RETURN 2015/09/18 33 1,251 38
Counterparty Financial record
OP RETURN 2014/06/16 636,012 22,806,810 36
P2PKH N/A N/A N/A N/A
Multi-signature N/A N/A N/A N/A
Total 2014/06/16 1,430,071 37,815,092 26
Notary
Factom Merkle root OP RETURN 2014/04/11 105,188 4,207,262 40
Stampery Merkle root, Hash OP RETURN 2015/03/09 74,887 2,648,102 35
Proof of Existence Hash OP RETURN 2014/04/21 5,464 218,513 40
Blocksign Hash OP RETURN 2014/08/04 1,477 55,676 38
CryptoCopyright Hash OP RETURN 2014/08/02 46 1,840 40
Stampd Hash OP RETURN 2015/01/03 562 22,427 40
BitProof Hash OP RETURN 2015/02/25 770 30,800 40
ProveBit Hash OP RETURN 2015/04/05 57 2,280 40
Remembr Hash OP RETURN 2015/08/25 28 1,128 40
OriginalMy Hash OP RETURN 2015/07/12 126 4,788 38
LaPreuve Hash OP RETURN 2014/12/07 68 2,663 39
Nicosia Hash of hashes OP RETURN 2014/09/12 24 840 35
SmartBit Merkle root OP RETURN 2015/11/24 8,472 304,992 36
Notary Hash OP RETURN 2017/04/11 21 798 38
Originstamp Hash of hashes (Commit metadata) 2013/12/18 905 0 0
Btproof Hash (Commit metadata) N/A N/A N/A N/A
BitcoinTimestamp Hash Value, Multi-in/out N/A N/A N/A N/A
Blocknotary Merkle root OP RETURN N/A N/A N/A N/A
Tangible Hash OP RETURN N/A N/A N/A N/A
Chainpoint Merkle root OP RETURN N/A N/A N/A N/A
Diploma Hash OP RETURN N/A N/A N/A N/A
Apertus Hash P2PKH N/A N/A N/A N/A
Chronobit Hash N/A N/A N/A N/A N/A
Seclytics Hash OP RETURN N/A N/A N/A N/A
Total 2013/12/18 198,095 7,502,109 38
DRM
Monegraph Copyright OP RETURN 2015/06/28 67,286 2,464,282 37
Blockai Copyright OP RETURN 2015/01/09 670 38,327 57
Ascribe Copyright OP RETURN 2014/12/19 48,450 1,000,561 21
Verisart Merkle root N/A N/A N/A N/A N/A
Total 2014/12/19 116,406 3,503,170 30
Message
Eternity Wall Text OP RETURN 2015/06/24 4,129 177,916 43
Cryptograffiti Text P2PKH, Multi-in/out N/A N/A N/A N/A
BIT-COMM Text P2PKH, Multi-in/out N/A N/A N/A N/A
Stone Text, File P2PKH, Multi-in/out N/A N/A N/A N/A
Key.run Magnet link OP RETURN N/A N/A N/A N/A
BitAlias Secret, Hash OP RETURN 0 0 0
Total 2015/06/24 4,129 177,916 43
Subchain
Keybase Merkle root OP RETURN N/A N/A N/A N/A
Uniquebits PGP signed hash P2PKH, P2SH N/A N/A N/A N/A
Blockstore Key-Value OP RETURN 2014/12/10 209,422 6,068,584 29
Catena [53]Text OP RETURN, Tx chains N/A N/A N/A N/A
Total 2014/12/10 209,422 6,068,584 29
Empty Total OP RETURN 2014/03/20 296,396 0 0
TOTAL 2009/01/03 2,254,519 55,066,871 24
Table 5: Statistics on Bitcoin-based protocols (sizes are in bytes).
6 Discussion
In this section we discuss the impact of metadata on the
Bitcoin blockchain. We start by describing the historical
evolution of metadata, highlighting how the adoption
of the embedding techniques has varied over the years.
We then evaluate the memory and storage consumption
due to metadata, and we discuss the phenomenon of
transaction peaks.
6.1 Historical perspective
The first piece of metadata was embedded in the gen-
esis block by Satoshi Nakamoto, through the Coinbase
technique; then, since October 2011, this technique has
been used regularly by miners. In the first 3 years of Bit-
coin, the most used technique for embedding data was
P2PKH. Later, many protocols (e.g., Counterparty) mi-
grated from the P2PKH to the OP RETURN technique,
and we rarely find protocols still using P2PKH. The
P2PKH is now used for embedding large files with the
A journey into Bitcoin metadata 13
Financial Notary
DRM Message
Subchain Empty
63.4% 8.8%
9.3%
1%
13.1%
5.2%
Fig. 2: Metadata transactions by protocol type.
04.2015
06.2015
07.2015
09.2015
10.2015
12.2015
02.2016
03.2016
05.2016
07.2016
08.2016
10.2016
12.2016
01.2017
03.2017
05.2017
06.2017
0
1
2
3
4
·104
Number of transactions
Financial
Notary
DRM
Message
Subchain
Fig. 3: Temporal evolution of metadata (transactions before April 2015 are negligible).
support of the Multi-in/out, Multi-signature, and Tx
chains techniques. Despite the similarity with P2PKH,
the P2PK technique is less used, since P2PK scripts are
considered obsolete [10]. The Input Sequence and the
Value techniques are not widely adopted as well, prob-
ably because of the limited space they offer respect to
other techniques. Also the P2SH technique is not widely
used, although there are some proposals for adopting it
for Counterparty9.
Although OP RETURN has been part of the scripting
language since the first releases of Bitcoin, originally it
was considered non-standard, so transactions contain-
ing this opcode were not reliably mined. OP RETURN be-
came standard with Bitcoin Core 0.9.0 [8], but still the
release notes state that: “This change is not an endorse-
ment of storing data in the blockchain. The OP RETURN
change creates a provably-prunable output, to avoid data
storage schemes [...] that were storing arbitrary data
such as images as forever-unspendable TX outputs, bloat-
ing bitcoin’s UTXO database”. The limit for storing
data with OP RETURN was originally planned to be 80
9See counterpartytalk.org/t/cip-proposal-p2sh-data-
encoding/2169.
bytes, but the first official client supporting the op-
code, i.e. the release 0.9.0, allowed only 40 bytes. This
animated a long debate [4,5,13,15]. From the release
0.10.0 [6] nodes could choose whether to accept or not
OP RETURN transactions, and set a maximum for their
size. The maximum size was then set to 80 bytes by the
release 0.12.0 [7]. From Table 5we see that the majority
of the applications built on top of Bitcoin embed meta-
data through the OP RETURN technique; this is coherent
with the data in Table 2, from which we see that 63%
of the metadata in the blockchain have been embed-
ded with the OP RETURN technique (which is the most
adopted one since March 2014). In the last period of
our experiments, 40,000 new OP RETURN transactions
are published each week. Overall, OP RETURN transac-
tions amount to 1,18% of the total number of trans-
actions (1,37% when considering the portion of the
blockchain from 2014/03/12, when the first OP RETURN
transaction appeared)10.
10 Despite the 5 years of delay, this percentage is quite close
to that for the whole blockchain: this is because the number
of daily transactions has largely increased since July 2014.
14 Bartoletti M., Bellomy B, Pompianu, L.
6.2 UTXO bloating
As remarked in Section 3, the embedding techniques
P2PK, P2PKH, and P2SH (which are often used to
embed media files), produce unspendable outputs. In
this way they contribute to the “UTXO bloating” ef-
fect, that deteriorates the performance of Bitcoin nodes.
In the UTXO set we have counted 68K unspend-
able outputs which are used to embed chunks of meta-
data. Of all the transaction which embed metadata,
only 1.49% contribute to the UTXO bloating effect.
The other embedding techniques, among which the
OP RETURN, do not bloat the UTXO (even though they
still affect the total size of the blockchain). This, to-
gether with the possibility of embedding up to 80 bytes
of metadata, are perhaps the reasons of the popularity
of the OP RETURN technique. Indeed, we see from Ta-
ble 5that OP RETURN is used by the large majority of
Bitcoin-based protocols. Note that the other techniques
which avoid the UTXO bloating effect are not suitable
to be used by protocols, either because they have a low
bandwidth (Coinbase), or because they do not allow to
embed enough bytes (from Table 5we see that, on aver-
age, protocols require to embed 24 bytes of metadata).
6.3 Space consumption
A debated topic in the Bitcoin community is whether
it is acceptable or not to save arbitrary data in the
blockchain. From Table 5we can see that the net size
of metadata is 99MB. In same period of observation,
the size of the whole blockchain is 125GB, so the size
of metadata amounts to 0.077% of the total size of
transactions.
For the most widespread embedding method, the
OP RETURN, Figure 4a shows the average length of the
metadata of each week. Generally, the average length
of metadata is less than 40 bytes, despite the extension
to 80 bytes introduced on 2015/07/12. Peaks down on
the same period are related to the Empty transactions,
discussed later on in Section 6.4. Figure 4b represents
the number of OP RETURN transactions with a given data
length: also this chart confirms a small number of trans-
actions that use more than the half of the available
space. Note that the discussed peak appears also in this
chart, in correspondence of the 0 value. From the last
column of Table 5we see that even the protocol which
embeds the largest number of bytes (Blockai, with 57
bytes on average), requires much less than the 80 bytes
available with OP RETURN. Several Notary protocols
take 40 bytes on average: 16 bytes for their identifiers,
and the remaining bytes for the hash they save. Gener-
ally, Notary protocols carry longer metadata than the
other protocols.
We now estimate the overall size of OP RETURN trans-
actions (including both the metadata and the other
parts of the transaction). The size of an Empty trans-
action with one input and one output is 156 bytes.
From Table 2we see that OP RETURN transaction carry
26 bytes of metadata, on average. We then approximate
the average size of an OP RETURN transaction as 182
bytes. Multiplying by the number of OP RETURN trans-
action, we obtain an approximation of their space con-
sumption as 503MB.
6.4 Transaction peaks
Figure 5represents peaks of OP RETURN transactions
from 2014/03 (date of the first OP RETURN transaction)
to 2017/08. For each week, it shows (i) the number
of Empty transactions, (ii) the number of OP RETURN
transactions which are not produced to any protocol
(among those in our collection), and (iii) the total num-
ber of OP RETURN transactions. In the graph we note
several peaks, that we explain as follows:
1. 100K transactions from 2015/07/08 to 2015/08/05.
This peak is mainly composed of two different peaks
of Empty transactions: the July peak (37K trans-
actions from 2015/07/08 to 2015/07/10) and the
August peak (29K transactions from 2015/08/01
to 2015/08/03). Both peaks occurred coincidentally
with stress tests and spam campaigns [32]11.
2. 300K transactions from 2015/09/09 to 2015/09/23.
This second peak is the highest and longest-lasting
one. As before, it is mainly caused by Empty trans-
actions (223K), although here we also observe a
component of Unclassified and Blockstore trans-
actions (35K each). The work [32] detects a spike
also in this period, precisely around 2015/09/13,
where an anonymous group performed a stress-test
on the network with a money drop. This involves a
public release of private keys, with the aim to cause
a big race and a consequent large number of double-
spend transactions. More specifically, people used
the private keys to transfer to themselves the bit-
coins redeemable with these keys; since many people
tried to perform these transfers simultaneously, the
network was flooded with many transactions trying
to double-spend the same outputs. The confirmed
11 We conjecture that Empty transactions are caused by
these events. To verify this conjecture we would need to com-
pare the transaction identifiers of our Empty transactions
with the identifiers of [32], which are not publicly available.
A journey into Bitcoin metadata 15
03.2014
09.2014
04.2015
10.2015
05.2016
12.2016
06.2017
0
20
40
Time interval
Average number of bytes
Avg length
(a) Size of metadata over time.
0 10 20 30 40 50 60 70 80
0
1
2
3
·105
Number of bytes
Number of transactions
Length
(b) Number of transactions by size of metadata.
Fig. 4: Usage and size of OP RETURN transactions.
03.2014
06.2014
09.2014
01.2015
04.2015
07.2015
10.2015
02.2016
05.2016
08.2016
12.2016
03.2017
06.2017
0
0.5
1
1.5
·105
Number of transactions
Empty
Unclassified
All
Fig. 5: Transactions (and transactions peaks) over time.
transactions caused a peak, which happened simul-
taneously to the peak of OP RETURN we measured.
3. 50K transactions from 2016/03/02 to 2016/03/09.
This last peak is given by the sum of two different
peaks: Unclassified (18K) and Stampery (23K)
transactions. The part of the peak caused by Stam-
pery can be explained as follows. Being a notariza-
tion protocol, Stampery receives document hashes
off-chain, and subsequently it embeds these hashes
in transactions. Since Stampery has only a few trans-
actions before 2016/03/02 (probably, used for test-
ing), we conjecture that the peak coincides with
its bootstrap, when the protocol publishes on the
blockchain all the transactions related to the hashes
accumulated off-chain. The other part of the peak
could be due to the bootstrap of other protocols.
Besides the peaks of OP RETURN transactions, we can
also observe other peaks: for instance, for a duration of
100 blocks starting from 2015/05/22, Bitcoin was tar-
geted by a stress test [2], during which the network was
flooded with a large number of transactions. However,
the usage of OP RETURN transactions in this period does
not seem to deviate from their normal usage.
7 Related works
There is a growing literature on the analysis of the Bit-
coin blockchain [32,45,47,49,50], and also some on-
line services which perform statistics on Bitcoin meta-
data [3,17,20,24]. Below, we group the related works
into three categories.
The first category includes online services related
to Bitcoin metadata. The website opreturn.org shows
some statistics about OP RETURN transactions, organ-
ised by protocol, and statistics about their usage in a
certain time frame. The website smartbit.com recog-
nises some OP RETURN protocols and shows statistics on
them. Finally, the website kaiko.com sells data about
OP RETURN transactions.
The second category contains the works on embed-
ding techniques. At the best of our knowledge, besides
our work, this category includes only [26,46], which
have been developed concurrently and independently
from ours. Despite the common goals, the works [26,46]
differ from ours in several aspects: (i) the “Tx chains”
methods and the techniques for committing metadata
are described only in our work; (ii) only our work and
16 Bartoletti M., Bellomy B, Pompianu, L.
[46] extract and quantify the embedded metadata; (iii)
the P2SH techniques are detailed in [26]. Further differ-
ences between our work and [46] are discussed below.
The third category includes the works which anal-
yse the types of metadata, as those in Section 4. Also
in this case, the work [46] is the closest to ours: the
main difference between the two works is that, while
[46] is focussed on discussing the benefits and risks re-
lated to metadata (e.g. privacy violations, illegal con-
tents), we develop a protocol-wise analysis, measuring
how much (and when) metadata is embedded by each
protocol, and studying which use cases they support.
Further, we recognize a few types of metadata (hash,
financial records, and copyright) which are not dealt
with by [46].
8 Conclusions
Although Bitcoin does not explicitly support for embed-
ding metadata into transactions, over the years users
have devised various techniques to reach this goal. After
illustrating and comparing these techniques, we have
extracted all the metadata embedded up to 2017/08/10
(first 480,000 blocks), measuring the data stored by
each technique. By processing the bytes extracted from
transaction metadata, we have often managed to re-
construct the original content. Overall, we have recon-
structed 69MB of documents, out of the 99MB to-
tally embedded. We have classified these documents,
finding that the majority of them are records produced
by financial protocols. We have reconstructed also 120
files of various kinds (among which, 108 images), for a
total size of 4MB.
We have discovered 45 protocols which embed meta-
data into the blockchain for developing various appli-
cations. We have identified which types of metadata
they produce, and which embedding techniques they
use. Usually, each protocol produces one type of meta-
data (depending on the protocol type), using one em-
bedding technique (most often, OP RETURN). Overall,
53.7MB of metadata are produced by the protocols in
our collection. The majority of protocols are for docu-
ment notarization, but 70% of elements are produced
by financial protocols.
Finally, we have discussed the impact of embed-
ding metadata in the blockchain, considering various
aspects, like e.g. the space consumption, the UTXO
bloating effect, and the transaction peaks.
Although the official Bitcoin documentation discour-
ages the use of the blockchain to store arbitrary data,
the trend seems to be a growth in the number of ap-
plications that embed their metadata in Bitcoin trans-
actions. We conjecture that the perceived sense of se-
curity and persistence of the Bitcoin blockchain is the
main motivation to avoid using cheaper and more effi-
cient storage. If this trend will be confirmed, the spe-
cific needs of these applications could affect the future
evolution of the Bitcoin protocol.
Acknowledgements We thank Nicola Atzei for the insight-
ful discussion on a preliminary version of this paper. This
work is partially supported by Aut. Reg. of Sardinia projects
“Sardcoin” and “Smart collaborative engineering”, and by
COST Action IC1406 cHiPSET.
References
1. Bicoin scalability, https://en.bitcoin.it/wiki/
Scalability_FAQ. Last accessed 2018/01/01
2. Bitcoin network survives surprise stress test,
http://www.coindesk.com/bitcoin-network-
survives-stress- test/. Last accessed 2018/01/01
3. Bitcoin OP RETURN wiki page, https://en.bitcoin.it/
wiki/OP_RETURN. Last accessed 2018/01/01
4. Bitcoin pull request 5075, https://github.com/
bitcoin/bitcoin/pull/5075. Last accessed 2018/01/01
5. Bitcoin pull request 5286, https://github.com/
bitcoin/bitcoin/pull/5286. Last accessed 2018/01/01
6. Bitcoin release 0.10.0, https://bitcoin.org/en/
release/v0.10.0. Last accessed 2018/01/01
7. Bitcoin release 0.12.0, https://bitcoin.org/en/
release/v0.12.0. Last accessed 2018/01/01
8. Bitcoin release 0.9.0, https://bitcoin.org/en/release/
v0.9.0. Last accessed 2018/01/01
9. Bitcoin script interpreter, https:
//github.com/bitcoin/bitcoin/blob/
fcf646c9b08e7f846d6c99314f937ace50809d7a/src/
script/interpreter.cpp. Last accessed 2018/01/01
10. Bitcoin wiki script, https://en.bitcoin.it/wiki/
Script. Last accessed 2018/01/01
11. Bitcoin wiki transaction, https://en.bitcoin.it/wiki/
Transaction. Last accessed 2018/01/01
12. Colu website, https://www.colu.com/. Last accessed
2018/01/01
13. Counterparty open letter and plea to the Bitcoin core
development team, http://counterparty.io/news/an-
open-letter- and-plea- to-the-bitcoin- core-
development-team/. Last accessed 2018/01/01
14. Counterparty website, http://counterparty.io/. Last
accessed 2018/01/01
15. Developers battle over bitcoin block chain,
http://www.coindesk.com/developers-battle-
bitcoin-block- chain/. Last accessed 2018/01/01
16. Factom website, https://www.factom.com/. Last ac-
cessed 2018/01/01
17. Kaiko data store, https://www.kaiko.com/. Last ac-
cessed 2018/01/01
18. Omni website, http://www.omnilayer.org/. Last ac-
cessed 2018/01/01
19. Open assets website, https://github.com/OpenAssets/.
Last accessed 2018/01/01
20. opreturn.org, http://opreturn.org/. Last accessed
2018/01/01
21. Proof of existence website, https://proofofexistence.
com/. Last accessed 2018/01/01
A journey into Bitcoin metadata 17
22. Scalability debate ever end, https://www.
cryptocoinsnews.com/will-bitcoin- scalability-
debate-ever- end/. Last accessed 2018/01/01
23. Scaling debate in Reddit, http://www.coindesk.com/
viabtc-ceo- sparks-bitcoin- scaling-debate-reddit-
ama/. Last accessed 2018/01/01
24. Smartbit OP RETURN statistics, https://www.smartbit.
com.au/op-returns. Last accessed 2018/01/01
25. Stampery blockchain timestamping architecture, https:
//s3.amazonaws.com/stampery-cdn/docs/Stampery-
BTA-v6- whitepaper.pdf. Last accessed 2018/01/01
26. Data insertion in Bitcoin’s blockchain (2017),
http://digitalcommons.augustana.edu/cgi/
viewcontent.cgi?article=1000&context=cscfaculty.
Last accessed 2018/01/01
27. Andresen, G.: Block v2, height in coinbase, BIP
034, https://github.com/bitcoin/bips/blob/master/
bip-0034.mediawiki. Last accessed 2018/01/01
28. Antonopoulos, A.M.: Mastering Bitcoin: unlocking digi-
tal cryptocurrencies. O’Reilly Media, Inc. (2014)
29. Atzei, N., Bartoletti, M., Cimoli, T., Lande, S., Zunino,
R.: SoK: unraveling Bitcoin smart contracts. In: Princi-
ples of Security and Trust (POST). LNCS, vol. 10804,
pp. 217–242. Springer (2018)
30. Atzei, N., Bartoletti, M., Lande, S., Zunino, R.: A formal
model of Bitcoin transactions. In: Financial Cryptogra-
phy and Data Security (2018)
31. Badertscher, C., Maurer, U., Tschudi, D., Zikas, V.: Bit-
coin as a transaction ledger: A composable treatment.
In: CRYPTO. LNCS, vol. 10401, pp. 324–356. Springer
(2017)
32. Baqer, K., Huang, D.Y., McCoy, D., Weaver, N.: Stress-
ing out: Bitcoin “stress testing”. In: Financial Cryptog-
raphy Workshops. LNCS, vol. 9604, pp. 3–18. Springer
(2016)
33. Bartoletti, M., Bracciali, A., Lande, S., Pompianu, L.: A
general framework for blockchain analytics. In: Proc. 1st
Workshop on Scalable and Resilient Infrastructures for
Distributed Ledgers (SERIAL@Middleware). pp. 7:1–7:6.
ACM (2017), https://github.com/bitbart/blockapi
34. Bartoletti, M., Pompianu, L.: An analysis of Bitcoin
OP RETURN metadata. In: Financial Cryptography
Workshops. LNCS, vol. 10323, pp. 218–230. Springer
(2017)
35. Bartoletti, M., Pompianu, L., Bellomy, B.: Bitcoin meta-
data (2018), https://doi.org/10.7910/DVN/MOLW81
36. Bartoletti, M., Zunino, R.: BitML: a calculus for Bitcoin
smart contracts. In: ACM CCS (2018)
37. Bellomy, B.: Binary grep, htps://github.com/
spooktheducks/binary-grep
38. Bellomy, B.: Local blockchain parser, https://github.
com/spooktheducks/local-blockchain- parser
39. Bistarelli, S., Mercanti, I., Santini, F.: An analysis of non-
standard Bitcoin transactions. In: Crypto Valley Confer-
ence on Blockchain Technology (2018)
40. Bonneau, J., Miller, A., Clark, J., Narayanan, A., Kroll,
J.A., Felten, E.W.: SoK: Research perspectives and chal-
lenges for Bitcoin and cryptocurrencies. In: IEEE Symp.
on Security and Privacy. pp. 104–121 (2015)
41. devttys0: binwalk, https://github.com/devttys0/
binwalk
42. Garay, J.A., Kiayias, A., Leonardos, N.: The Bitcoin
backbone protocol: Analysis and applications. In: EURO-
CRYPT. LNCS, vol. 9057, pp. 281–310. Springer (2015)
43. Gerhardt, I., Hanke, T.: Homomorphic payment ad-
dresses and the pay-to-contract protocol. arXiv preprint
arXiv:1212.3257 (2012)
44. Kosba, A.E., Miller, A., Shi, E., Wen, Z., Papamanthou,
C.: Hawk: The blockchain model of cryptography and
privacy-preserving smart contracts. In: IEEE Symp. on
Security and Privacy. pp. 839–858 (2016)
45. Lischke, M., Fabian, B.: Analyzing the Bitcoin network:
The first four years. Future Internet 8(1), 7 (2016)
46. Matzutt, R., Hiller, J., Henze, M., Ziegeldorf, J.H.,
Mu¨
llmann, D., Hohlfeld, O., Wehrle, K.: A quantitative
analysis of the impact of arbitrary blockchain content on
Bitcoin. In: Financial Cryptography and Data Security
(2018)
47. oser, M., B¨ohme, R.: Trends, tips, tolls: A longitudinal
study of Bitcoin transaction fees. In: Financial Cryptog-
raphy and Data Security. LNCS, vol. 8976, pp. 19–33.
Springer (2015)
48. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash sys-
tem. https://bitcoin.org/bitcoin.pdf (2008)
49. Reid, F., Harrigan, M.: An analysis of anonymity in the
Bitcoin system. In: Security and privacy in social net-
works, pp. 197–223. Springer (2013)
50. Ron, D., Shamir, A.: Quantitative analysis of the full
Bitcoin transaction graph. In: Financial Cryptography
and Data Security. LNCS, vol. 7859, pp. 6–24. Springer
(2013)
51. Stevens, M., Bursztein, E., Karpman, P., Albertini,
A., Markov, Y.: The first collision for full SHA-1. In:
CRYPTO. LNCS, vol. 10401, pp. 570–596. Springer
(2017)
52. Todd, P.: Delayed TXO commitments, https:
//petertodd.org/2016/delayed-txo- commitments.
Last accessed 2018/01/01
53. Tomescu, A., Devadas, S.: Catena: Efficient non-
equivocation via Bitcoin. In: IEEE Symp. on Security
and Privacy. pp. 393–409 (2017)
... Sward et al. [16] provided a survey of methods for inserting arbitrary data into Bitcoin's blockchain. Bartoletti et al. [1] analysed the usage of Bitcoin metadata over the years for the first 480000 blocks in the blockchain. ...
... The majority of miners simply discard such transactions. The second one encodes a message m in the output value (represented in Satoshis) and is rarely used because requires of owning at least the amount of Satoshis needed to represent a message m [1]. The last one exploited the 4B in the sequence no field for appending metadata and but so far no real cases are known [1]. ...
... The second one encodes a message m in the output value (represented in Satoshis) and is rarely used because requires of owning at least the amount of Satoshis needed to represent a message m [1]. The last one exploited the 4B in the sequence no field for appending metadata and but so far no real cases are known [1]. ...
Chapter
Full-text available
The main purpose of Bitcoin address is a representation of a possible source or destination for a payment. However, some users use it not for the transfer of cryptocurrencies, but to record some arbitrary data, ranging from short messages to website links. We provide the first systematic analysis, both quantitative and qualitative, of data hidden in Bitcoin addresses. In this paper, we explore all addresses existing in the Bitcoin and Ethereum blockchain in the purpose to discover those with some hidden content. Results of our research are publicly available in the projects THIBA and THIEA (Thinks/Texts Hidden In Bitcoin/Ethereum Addresses).
... Section II). On the one hand, a bloated UTXO set implies further scalability issues during the verification of transactions [29]. On the other hand, illicit blockchain content is known to imply legal risks for blockchain users [12]. ...
... Other works that consider blockchain data management include analyses of blockchain data [11]- [13], [29], [51]- [53] and the UTXO set [54], lightweight payment schemes [55], [56], approaches to prevent illicit content from being engraved into the blockchain [14], [16], [57]- [60], and sharding approaches [61]- [63]. In the following, we provide pointers to cover the research perspectives for this further related work. ...
... Sward et al. [13] concurrently investigated more sophisticated content insertion methods. Meanwhile, Bartoletti et al. [29], [53] had investigated the constructive utilization of OP_RETURN data. Delgado-Segura et al. [54] and Pérez-Solà et al. [69] have investigated the UTXO sets of Bitcoin and related cryptocurrencies. ...
Preprint
Popular cryptocurrencies continue to face serious scalability issues due to their ever-growing blockchains. Thus, modern blockchain designs began to prune old blocks and rely on recent snapshots for their bootstrapping processes instead. Unfortunately, established systems are often considered incapable of adopting these improvements. In this work, we present CoinPrune, our block-pruning scheme with full Bitcoin compatibility, to revise this popular belief. CoinPrune bootstraps joining nodes via snapshots that are periodically created from Bitcoin's set of unspent transaction outputs (UTXO set). Our scheme establishes trust in these snapshots by relying on CoinPrune-supporting miners to mutually reaffirm a snapshot's correctness on the blockchain. This way, snapshots remain trustworthy even if adversaries attempt to tamper with them. Our scheme maintains its retrospective deployability by relying on positive feedback only, i.e., blocks containing invalid reaffirmations are not rejected, but invalid reaffirmations are outpaced by the benign ones created by an honest majority among CoinPrune-supporting miners. Already today, CoinPrune reduces the storage requirements for Bitcoin nodes by two orders of magnitude, as joining nodes need to fetch and process only 6 GiB instead of 271 GiB of data in our evaluation, reducing the synchronization time of powerful devices from currently 7 h to 51 min, with even larger potential drops for less powerful devices. CoinPrune is further aware of higher-level application data, i.e., it conserves otherwise pruned application data and allows nodes to obfuscate objectionable and potentially illegal blockchain content from their UTXO set and the snapshots they distribute.
... Section II). On the one hand, a bloated UTXO set implies further scalability issues during the verification of transactions [29]. On the other hand, illicit blockchain content is known to imply legal risks for blockchain users [12]. ...
... Other works that consider blockchain data management include analyses of blockchain data [11]- [13], [29], [51]- [53] and the UTXO set [54], lightweight payment schemes [55], [56], approaches to prevent illicit content from being engraved into the blockchain [14], [16], [57]- [60], and sharding approaches [61]- [63]. In the following, we provide pointers to cover the research perspectives for this further related work. ...
... Sward et al. [13] concurrently investigated more sophisticated content insertion methods. Meanwhile, Bartoletti et al. [29], [53] had investigated the constructive utilization of OP_RETURN data. Delgado-Segura et al. [54] and Pérez-Solà et al. [69] have investigated the UTXO sets of Bitcoin and related cryptocurrencies. ...
Article
Popular cryptocurrencies continue to face serious scalability issues due to their ever-growing blockchains. Thus, modern blockchain designs began to prune old blocks and rely on recent snapshots for their bootstrapping processes instead. Unfortunately, established systems are often considered incapable of adopting these improvements. In this work, we present Coin-Prune, our block-pruning scheme with full Bitcoin compatibility, to revise this popular belief. CoinPrune bootstraps joining nodes via snapshots that are periodically created from Bitcoin’s set of unspent transaction outputs (UTXO set). Our scheme establishes trust in these snapshots by relying on CoinPrune-supporting miners to mutually reaffirm a snapshot’s correctness on the blockchain. This way, snapshots remain trustworthy even if adversaries attempt to tamper with them. Our scheme maintains its retrospective deployability by relying on positive feedback only, i.e., blocks containing invalid reaffirmations are not rejected, but invalid reaffirmations are outpaced by the benign ones created by an honest majority among CoinPrune-supporting miners. Already today, CoinPrune reduces the storage requirements for Bitcoin nodes by two orders of magnitude, as joining nodes need to fetch and process only 6GiB instead of 271GiB of data in our evaluation, reducing the synchronization time of powerful devices from currently 7 h to 51min, with even larger potential drops for less powerful devices. CoinPrune is further aware of higher-level application data, i.e., it conserves otherwise pruned application data and allows nodes to obfuscate objectionable and potentially illegal blockchain content from their UTXO set and the snapshots they distribute.
... Second, users can augment their transactions with up to 83 B of data by adding a single OP_RETURN output per transaction. This limited method is widely accepted for realizing Bitcoin-backed applications [15], [16]. However, other Bitcoin forks increase the allowed payload size of OP_RETURN outputs [17]. ...
Conference Paper
Full-text available
Blockchains gained tremendous attention for their capability to provide immutable and decentralized event ledgers that can facilitate interactions between mutually distrusting parties. However, precisely this immutability and the openness of permissionless blockchains raised concerns about the consequences of illicit content being irreversibly stored on them. Related work coined the notion of redactable blockchains, which allow for removing illicit content from their history without affecting the blockchain's integrity. While honest users can safely prune identified content, current approaches either create trust issues by empowering fixed third parties to rewrite history, cannot react quickly to reported content due to using lengthy public votings, or create large per-redaction overheads. In this paper, we instead propose to outsource redactions to small and periodically exchanged juries, whose members can only jointly redact transactions using chameleon hash functions and threshold cryptography. Multiple juries are active at the same time to swiftly redact reported content. They oversee their activities via a global redaction log, which provides transparency and allows for appealing and reversing a rogue jury's decisions. Hence, our approach establishes a framework for the swift and transparent moderation of blockchain content. Our evaluation shows that our moderation scheme can be realized with feasible per-block and per-redaction overheads, i.e., the redaction capabilities do not impede the blockchain's normal operation.
... For instance, it rules out the publication of illegal material (like e.g. pornographic or blasphemous content) on the blockchain, which is possible by embedding such material in transactions [9], [10]. Further, we do not consider ''pump and dump'' schemes, where fraudsters intentionally create hype on some crypto-assets in order to pump their prices and sell their stocks right away. ...
Article
Full-text available
Since the inception of Bitcoin in 2009, the market of cryptocurrencies has grown beyond the initial expectations, as witnessed by the thousands of tokenised assets available on the market, whose daily trades exceed dozens of USD billions. The pseudonymity features of cryptocurrencies have attracted the attention of cybercriminals, who exploit them to carry out potentially untraceable scams. The wide range of cryptocurrency-based scams observed over the last ten years has fostered the study on their effects, and the development of techniques to counter them. The research in this field is hampered by various factors. First, there exist only a few public data sources about cryptocurrency scams, and they often contain incomplete or misclassified data. Further, there is no standard taxonomy of scams, which leads to ambiguous and incoherent interpretations of their nature. Indeed, the unavailability of reliable datasets makes it difficult to train effective automatic classifiers that can detect and analyse scams. In this paper, we perform an extensive review of the scientific literature on cryptocurrency scams, which we systematise according to a novel taxonomy. By collecting and homogenising data from different public sources, we build a uniform dataset of thousands of cryptocurrency scams. We build upon this dataset to implement a tool that automatically recognises scams and classifies them according to our taxonomy. We assess the effectiveness of our tool through standard performance metrics. We then analyse the results of the classification, providing key insights about the distribution of scam types, and the correlation between different types. Finally, we propose a set of guidelines that policymakers could follow to improve user protection against cryptocurrency scams.
... These techniques vary depending on the target blockchain. Indeed, since Bitcoin's birth, several applications have exploited different fields of its protocol to append metadata for various use cases, including tracking objects [35]. The most common technique in the Bitcoin blockchain, which also applies to Litecoin, exploits the OP_RETURN field [36]: applications append a string of data encoded by following a personal protocol. ...
Article
Full-text available
In the last decades, modern societies are experiencing an increasing adoption of interconnected smart devices. This revolution involves not only canonical devices such as smartphones and tablets, but also simple objects like light bulbs. Named as the Internet of Things (IoT), this ever-growing scenario offers enormous opportunities in many areas of modern society, especially if joined by other emerging technologies such as, for example, the blockchain. Indeed, the latter allows users to certify transactions publicly, without relying on central authorities or intermediaries. This work aims to exploit the scenario above by proposing a novel blockchain-based distributed paradigm to secure localization services, here defined as Internet of Entities (IoE). It represents a mechanism for the reliable localization of people and things, and it exploits the increasing number of existing wireless devices and the blockchain-based distributed ledger technology. Moreover, unlike most of the canonical localization approaches, it is strongly oriented towards the protection of the users' privacy. Finally, its implementation requires minimal efforts since it employs the existing infrastructures and devices, thus giving life to a new and wide data environment, exploitable in many domains, such as e-health, smart cities, and smart mobility.
... There is no singular point of failure or attack at the hardware level (Finck, 2019). DLTs serve as an accounting system that can be used by many actors to standardise and link data to 'enable credible accounting of digital events' (Bartoletti et al., 2019). The first widespread commercial implementation of DLT was the Bitcoin crypto currency, though its use and application has broadened out into sectors such as retail, transportation and accounting. ...
Book
Full-text available
This report addresses the nature, scope and possible effects of digital automation. It reviews relevant literature and situate s modern debates on technological change in historical context. It identifies threats to job quality and an unequal distribution of the risks and benefits associated with digital automation. It also offers some policy options that, if implemented, would help to harness technology for positive economic and social ends. The policy options range from industry and sectoral skills alliances that focus on facilitating transitions for workers in 'at risk' jobs, to proposals for the reduction in work time. The suggested policies derive from the view that digital automation must be managed on the basis of principles of industrial democracy and social partnership. The report argues for a new Digital Social Contract. At a time of crisis, the policy options set out in the report aim to offer hope for a digital future that works for all.
... The PMD value is especially important when storing and processing data in a CDCS environment with many and diverse data sources and with a complete or partial lack of trust between groups of administratively unrelated users. A more detailed discussion of various aspects of PMD can be found in the reviews [9,12,24,33,59,72]. ...
Article
Full-text available
We suggest a novel approach to designing totally decentralized data management systems in distributed environments with administratively unrelated or loosely related user groups and in conditions of partial or complete lack of trust between them. This approach is based on the integration of blockchain technology, smart contracts and provenance metadata driven data management. Provenance metadata (PMD) contain key information that is necessary to determine the origin, authorship and quality of relevant data, their storage and usage consistency, and for interpretation and confirmation of relevant results of data processing. Architecture, operation principles and algorithms have been developed for the system, entitled ProvHL (Provenance HyperLedger), which provides fault-tolerant, safe and reliable management of provenance metadata, control of operations with data files, as well as resource access management in collaborative distributed computing systems (CDCS). CDCS refers to distributed systems formed by combining into a single pool of computer resources of various organizations (institutions) to work together in the framework of a project. The paper also suggests a new blockchain-based method for delegation of rights within distributed computing systems which is free from shortcomings inherent in other solutions. The main goal of the work is to demonstrate the capabilities of the proposed approach and the above technologies to improve the functional properties of CDCSs.
Article
With the increase in connectivity, the popularization of cloud services, and the rise of the Internet of Things (IoT), decentralized approaches for trust management are gaining momentum. Since blockchain technologies provide a distributed ledger, they are receiving massive attention from the research community in different application fields. However, this technology does not provide with cybersecurity by itself. Thus, this survey aims to provide with a comprehensive review of techniques and elements that have been proposed to achieve cybersecurity in blockchain-based systems. The analysis is intended to target area researchers, cybersecurity specialists and blockchain developers. For this purpose, we analyze 272 papers from 2013 to 2020 and 128 industrial applications. We summarize the lessons learned and identify several matters to foster further research in this area.
Conference Paper
Full-text available
We propose a formal model of Bitcoin transactions, which is sufficiently abstract to enable formal reasoning, and at the same time is concrete enough to serve as an alternative documentation to Bitcoin. We use our model to formally prove some well-formedness properties of the Bitcoin blockchain, for instance that each transaction can only be spent once. We release an open-source tool through which programmers can write transactions in our abstract model, and compile them into standard Bitcoin transactions.
Article
Full-text available
Chapter
Full-text available
Albeit the primary usage of Bitcoin is to exchange currency, its blockchain and consensus mechanism can also be exploited to securely execute some forms of smart contracts. These are agreements among mutually distrusting parties, which can be automatically enforced without resorting to a trusted intermediary. Over the last few years a variety of smart contracts for Bitcoin have been proposed, both by the academic community and by that of developers. However, the heterogeneity in their treatment, the informal (often incomplete or imprecise) descriptions, and the use of poorly documented Bitcoin features, pose obstacles to the research. In this paper we present a comprehensive survey of smart contracts on Bitcoin, in a uniform framework. Our treatment is based on a new formal specification language for smart contracts, which also helps us to highlight some subtleties in existing informal descriptions, making a step towards automatic verification. We discuss some obstacles to the diffusion of smart contracts on Bitcoin, and we identify the most promising open research challenges.
Conference Paper
Full-text available
Modern cryptocurrencies exploit decentralised blockchains to record a public and unalterable history of transactions. Besides transactions, further information is stored for different, and often undisclosed, purposes, making the blockchains a rich and increasingly growing source of valuable information, in part of difficult interpretation. Many data analytics have been developed, mostly based on specifically designed and ad-hoc engineered approaches. We propose a general-purpose framework, seamlessly supporting data analytics on both Bitcoin and Ethereum — currently the two most prominent cryptocurrencies. Such a framework allows us to integrate relevant blockchain data with data from other sources, and to organise them in a database, either SQL or NoSQL. Our framework is released as an open-source Scala library. We illustrate the distinguishing features of our approach on a set of significant use cases, which allow us to empirically compare ours to other competing proposals, and evaluate the impact of the database choice on scalability.
Chapter
We propose a formal model of Bitcoin transactions, which is sufficiently abstract to enable formal reasoning, and at the same time is concrete enough to serve as an alternative documentation to Bitcoin. We use our model to formally prove some well-formedness properties of the Bitcoin blockchain, for instance that each transaction can only be spent once. We release an open-source tool through which programmers can write transactions in our abstract model, and compile them into standard Bitcoin transactions.
Chapter
Blockchains primarily enable credible accounting of digital events, e.g., money transfers in cryptocurrencies. However, beyond this original purpose, blockchains also irrevocably record arbitrary data, ranging from short messages to pictures. This does not come without risk for users as each participant has to locally replicate the complete blockchain, particularly including potentially harmful content. We provide the first systematic analysis of the benefits and threats of arbitrary blockchain content. Our analysis shows that certain content, e.g., illegal pornography, can render the mere possession of a blockchain illegal. Based on these insights, we conduct a thorough quantitative and qualitative analysis of unintended content on Bitcoin’s blockchain. Although most data originates from benign extensions to Bitcoin’s protocol, our analysis reveals more than 1600 files on the blockchain, over 99% of which are texts or images. Among these files there is clearly objectionable content such as links to child pornography, which is distributed to all Bitcoin participants. With our analysis, we thus highlight the importance for future blockchain designs to address the possibility of unintended data insertion and protect blockchain users accordingly.
Conference Paper
We introduce BitML, a domain-specific language for specifying contracts that regulate transfers of bitcoins among participants, without relying on trusted intermediaries. We define a symbolic and a computational model for reasoning about BitML security. In the symbolic model, participants act according to the semantics of BitML, while in the computational model they exchange bitstrings, and read/append transactions on the Bitcoin blockchain. A compiler is provided to translate contracts into standard Bitcoin transactions. Participants can execute a contract by appending these transactions on the Bitcoin blockchain, according to their strategies. We prove the correctness of our compiler, showing that computational attacks on compiled contracts are also observable in the symbolic model.
This book constitutes the refereed proceedings of three workshops held at the 19th International Conference on Financial Cryptography and Data Security, FC 2015, in San Juan, Puerto Rico, in January 2015. The 22 full papers presented were carefully reviewed and selected from 39 submissions. They feature the outcome of the Second Workshop on Bitcoin Research, BITCOIN 2015, the Third Workshop on Encrypted Computing and Applied Homomorphic Cryptography, WAHC 2015, and the First Workshop on Wearable Security and Privacy, Wearable 2015.