Data Provenance in Vehicle Data Chains
BMW Technology Ofﬁce Israel Ltd
Dr. Carsten Stoecker
Founder and CEO,
Dr. Juan Caballero
Research, Spherity GmbH,
Decentralized Identity Fdn
Abstract—With almost every new vehicle being connected, the
importance of vehicle data is growing rapidly. Many mobility
applications rely on the fusion of data coming from heterogeneous
data sources, like vehicle and ”smart-city” data or process data
generated by systems out of their control. This external data
determines much about the behaviour of the relying applications:
it impacts the reliability, security and overall quality of the
application’s input data and ultimately of the application itself.
Hence, knowledge about the provenance of that data is a critical
component in any data-driven system. The secure traceability of
the data handling along the entire processing chain, which passes
through various distinct systems, is critical for the detection
and avoidance of misuse and manipulation. In this paper, we
introduce a mechanism for establishing secure data provenance
in real time, demonstrating an exemplary use-case based on a
machine learning model that detects dangerous driving situations.
We show with our approach based on W3C decentralized identity
standards that data provenance in closed data systems can be
effectively achieved using technical standards designed for an
open data approach.
Driven by technological innovation and organic ecosys-
tem growth, mobility value chains are signiﬁcantly chang-
ing from monolithic and closed systems to distributed, open
ones  . Data ﬂows are increasingly deﬁned dynamically
and stretch across multiple organizational boundaries and even
legal jurisdictions with diverse rules and regulations . The
trustworthiness and accuracy of output data (such as in-car
sensors) generated along distributed digital mobility value
chains is of increasing importance for safety and reliability;
this importance can only increase as Machine Learning (ML)
systems grow more central to mobility systems .
The growing use of data-producing and/or Internet of Things
(IoT) devices across every industry – including logistics,
manufacturing, and mobility – is ushering in an era of data
abundance. The trend towards processing open data in the
mobility ecosystem in particular makes urgent the question of
how to ensure the validity of the source data, and about how to
ensure the quality and the accuracy of the output data as well.
Furthermore, retroactively demoting or forgetting data from
sources proven to be unreliable remains an elusive capability.
Mobility systems have a very low tolerance for fraud and
abuse, as the impact of fraudulent data in safety-critical
features can have immediate real-world impact . Upcoming
VTC‘2021-Spring, April, 25-28, 2021, Helsinki, Finland. XXX-
regulation around automotive cyber-security underlines this
Besides these security considerations, ensuring the privacy
of the individuals when processing their data is also crucial.
Understanding the data ﬂow and providing proof that it is
treated carefully and ethically is essential. Another important
aspect is the quality assurance of the data: if customers have
the option to forward on data to third party applications, how
can they forward along metadata and trust ratings necessary for
safety and benchmarking in those new contexts? We believe
that only by providing full transparency and deeper under-
standing of the data and the way it gets processed the required
quality can be provided, by detecting and mitigating ﬂaws in
the data pipeline. Businesses must know the origin and risks
of data from different sources before using them. Businesses
need highly automated and veriﬁable instruments for assessing
the provenance of data vital for ML applications .
This paper is organized as follows: the remainder of sec-
tion I will further detail the requirements for data provenance
in an automotive context. section II will discuss related work
across automotive, ML, and data-processing spheres. We in-
troduce our approach in Figure III and the details and results
of our implementation in Figure IV. Finally, we provide our
conclusion in section V.
Fig. 1. Architecture overview: data ﬂow from the Data Producer to the Data
Consumer and the provenance ﬂow in the opposite direction.
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained
for all other uses, in any current or future media, including reprinting/republishing this material for
advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works.
Data provenance in automotive
An effective data provenance solution has to be applicable to
multiple layers of a typical data processing ﬂow in automotive
contexts. A typical high-level architecture of such a ﬂow is
shown in Figure 1. The data processing ﬂow starts with a
Data Producer, typically an Electronic Control Unit (ECU),
which transports a signal over a system bus. The signal is
received and ”edge-processed” within the vehicle by a Data
Collector. The ﬁrst pre-processing of the signal (e.g., adding
meta information, doing some pre-calculation) will be done
here, before it will be transported to a Digital representation,
which deﬁnes the incoming data of the physical vehicle
through a digital model, and offers it to applications as a
digital shadow or even as a digital twin of the vehicle .
From there, we assume an optional data fusion with External
Data Sources (e.g. weather, trafﬁc information, etc.). This data
is then consumed by Data Processors, which in turn creates
new data of interest for an external Data Consumer. Data
Processors perform at least one data transformation. In case
of ML, at least two data transformations have to be done -
pre-processing and applying the actual model.
In this paper, we propose and detail one mechanism for
establishing data provenance in real-time along a data chain
that sources driving event data and processes it into an
ML label. When the data provenance of a given dangerous-
driving machine-learning label is known, a scoring model
can be applied to it that calculates the risks of consuming
this data label for system control or responsible decision-
making. This example, chosen for clarity and simplicity, is
nevertheless applicable much more widely to any digital data
chain. We feel that, with time, machine-learning labels will
beneﬁt from some degree of rating and scoring to be used
safely, legally, and/or responsibly, balancing privacy and se-
curity requirements appropriately. We support standardization
on decentralized identity primitives that make these kinds of
rating and scoring systems more portable and interoperable,
particularly in architectures where cryptographic agility can
be incorporated to maximize forward compatibility.
II. RE LATE D WOR K
At least since ML started to go mainstream, the provenance
issues of big data have been almost universally acknowledged,
although it is less of a solveable problem than a deﬁnitional
impasse. How can one standardize the capture of data from
improvised sources, which will be structured post-facto by the
ML process? Some would say that the writing has been on the
wall since at least 2009, when the authors of ”Provenance: a
Future History” pointed out that the problem would undermine
ML research until, in retrospect, it would look like a glaring
problem overlooked all along . Our intention in this section
is to position our methods in a wide ﬁeld of complementary
approaches which all support and further the broader aim of
improving the monitoring, reﬁnement, and reputation capabil-
ityes of ML pipelines.
Decentralized identity and decentralized PKI offers a scaf-
folding and anchor points for accountable and traceable prove-
nance. At the same time, a signiﬁcant amount of standardiza-
tion of the data exchange and integration of ﬂexible and re-
ﬂexive data capture is crucial. The complexity of the semantic
and data-capture work needed to take full advantage of that
scaffolding is not to be understated. In fact, pride of place
is given in the aforementioned 2009 provenance manifesto to
semantics and expressive data, which proved inﬂuential on the
development of many big-data-oriented initiatives within the
W3C and the semantic-web community. The system built on
top of ML and PROV schemata elaborated on by Souza et al.
in 2019 is a good example of how such cross-silo semantic
capture could be tracked and accounted for in an ML context
How to quantify and benchmark risk scoring is never a
simple matter in ML, and some recent work has tried to
address the quantiﬁcation of provenance scoring speciﬁc to
contexts analogous to those described. Barclay et al 2019,
for example, speciﬁcally addresses shifting ethical standards
and transparency vis-a-vis regulatory and ethical scrutiny. It
is particularly important to recognize that provenance must
always be linked to dynamic rather than static valuations and
rubrics, as regulatory and liability frameworks will likely take
decades to stabilize internationally .
Other related work seeks to incorporate risk scoring
throughout the training process  , the scope of which
is being expanded by the preceding efforts. Adjusting the
ML training methodology to be more reﬂexive throughout on
the basis of such scoring has been a major focus of efforts
at Amazon’s ML design division, and presumably will be a
feature of production-grade ML in the future, if not a core
feature of off-the-shelf product offerings for ML training and
for the lifecycle management thereof .
The contribution of this paper is to make an architecture pro-
posal which achieves data transparency and data provenance
by enriching the meta information of the data itself. This is
done right where the creation or transformation of the data
takes place. We focus in this paper on the automotive sector,
but we believe that this contribution can have an impact on
achieving reliable risk scoring of ML algorithms in general.
III. PROP OS ED ME TH OD S
In order to realize data provenance in an automotive data
processing chain, the methods we are proposing in this paper
consists of two interdependent elements:
1) Identiﬁcation of entities which create data or perform
data transformations in the data processing chain.
2) Introduction of encrypted data structures for representing
Distributed Automotive Data (DAD).
In order to be able to render the data provenance mean-
ingful, it is crucial to know in which way the data has
been transformed at each step from the Data Producer to
the Data Consumer. In particular, it has to be transparent
which exact digital entity or algorithm has carried out each
transformation on the data and when. The transformation can
take place in one or many centralized or decentralized mobility
systems, so the identiﬁcation of all the entities is greatly
simpliﬁed if the identities are maximally portable and not
administered within their respective closed systems. Therefore,
we propose to adopt the decentralized identiﬁer (DID) standard
as an open, interoperable addressing scheme and to establish
mechanisms for resolving DIDs across multiple centralized
and/or decentralized mobility systems . Even if these
identiﬁer resolution schemes are not used for discovery or
communications within a given system, they are invaluable for
reconstructing or tracing data trails that cross many systems,
whether in real-time or forensically. Figure 2 shows which
identities in the data processing ﬂow introduced in Figure 1
get provisioned with DIDs in our architecture, and thus get
cryptographic data-signing capabilities.
Fig. 2. Architecture with signing identities marked. Data provenance ﬂow
on the left with scoring on top, realized by the provenance of the data the
algorithm relies on.
DIDs were originally designed to function as identiﬁers
for individual people, but can readily be extended to any
entity or resource. They are derived from public/private key
pairs, registered in an immutable registry for discovery pur-
poses. Spherity and other companies pioneering the decentral-
ized identity technology sector use innovative cryptographic
solutions for secure key management such as private key
sharding (multi-party computation), on-device/secure-enclave
biometrics, and HSMs to make signatures and data trails more
secure and non-repudiable . The ﬁeld is fast-moving and
signiﬁcant progress is being made in expanding the options
for high-security use-cases and in building forward-secure
cryptographic agility into systems to accommodate these new
Each domain or namespace for DIDs corresponds to a
method of encoding and decoding, making DIDs resolvable
like domain names relative to a method-speciﬁc but
interoperable resolution infrastructure . For this project,
Spherity registered its DIDs on the Ethereum Blockchain via
the standard W3C DID method ethr . In this method, the
Fig. 3. Data structure as chained transformation. Each DAD includes the
DID of the transforming entity and points to the previous DAD(s), which
were used during the transformation.
public key of any valid ethereum keypair (i.e., an ”ethereum
address”) can be used as the identiﬁer string within the
namespace deﬁned by the ETH preﬁx. Thus, our DIDs look
As you can see in the ’iss’ and ’payload’[’inputDataDids’]
parameters of the sample data label (Figure 5), all identities are
expressed as DID references. Essentially, each transformation
appends a new link in a chain of linked and signed versions–
each data point can be updated, and each updated data point is
both signed by the transformer and linked back to its previous
state. By applying this to each transformation from the Data
Producer to the Data Consumer, Figure 3 shows the resulting
data chain, which enables the system to follow back how and
where the data has been transformed and which algorithm
was responsible for the transformation and how the incoming
data at a transformation step inﬂuenced the outcome. In this
way the data would ensure data provenance throughout the
entire data processing chain. The outcome of the algorithm
could be traced back to the Data Producer and ensures greater
conﬁdence in the outcomes of the algorithm.
IV. RES ULT S
In this paper we have presented the implementation of a
veriﬁable data chain for a supervised learning scenario with
an RNN algorithm detecting dangerous driving scenarios as
shown in . The referenced scenario detects situations of
dangerous driving on an incoming stream of vehicle data and
classiﬁes the maneuver (e.g. left turn, right turn, acceleration,
etc.). The algorithm predicts for every timestamp the result,
based on the previous ten frames or data points. The input data
consists of categorical (e.g. gear, brakes pressed) and continu-
ous signals (e.g position, lateral and longitudinal acceleration,
For our solution presented here, we used a cloud environ-
ment and historic dangerous driving event data sets that were
used to train the RNN model. As shown in Figure 4, we used
Fig. 4. Overview of each step taken in the prototype implementation. The
application is based on a data-stream of historic data. The data is processed
and consumed by an algorithm. The proposed provenance ﬂow is applied.
the historical data to simulate a live vehicle data stream. The
historical data set contains, for each point in time, an array of
data points, which are the relevant features for the RNN model.
Each array is sent to a component, responsible for the data
handling and processing. It creates for every data point a DAD.
Then, a feature vector of 10 entries is prepared for the RNN
model. The outcome of the RNN Model - the classiﬁcation
of the situation as dangerous, the type of maneuver and the
conﬁdence - is then stored as another DAD. It refers to the
DIDs of the entries, which are included in the DAD output
of the ﬁnal result (See Figure 5). In order to show how the
application orders a stream of source events over time, we also
present the incoming data on a map, linking backwards to the
source DIDs and forward to the resulting DADs (See Figure
This way, the cryptographic data structure provides instru-
ments for end-to-end veriﬁability that enabled us to prove the
integrity of the data chain, identify all the entities involved in
the creation of a speciﬁc machine learning label, and request,
in turn, life-cycle credentials from these entities to feed a
scoring model for the respective machine learning label.
The end-to-end veriﬁability of entity attributes and quality
data about the entities involved in cyber-physical value chains
would allow us to build algorithms that accurately score
machine learning output data. Any consumer of these output
data could then assess their trustworthiness prior to processing
them in the consumer’s application.
In a second iteration of the project, we would focus on using
the proposed methods in a real-world scenario, e.g. using live
data streams from a ﬂeet of real vehicles integrated with this
validated data-chain infrastructure.
V. CONCLUSION AND OU TL OO K
The growing complexity of data chains and the increasing
number of actors processing their data goes beyond what can
be expected of manual quality assurance and maintenance
Fig. 5. Output VC with provenance pointing back to DIDs of simulated
telemetry devices (encrypted data and cryptographic signature cropped out
Fig. 6. Web interface of the application presenting the results
processes. A transparent and reliable evaluation of the risk
and quality of data inference methodologies is essential, and
this requires scaffolding for such accounting. Data provenance
about the entities involved in a data processing chain and
the resulting machine learning labels (using DIDs, VCs, and
DLT to ensure uniform metadata) provides a foundation for
reliable risk scoring. Harmonized and standardized data (and
more importantly, metadata) is the key to AI explainability,
whether managed in traditional top-down ways, by new forms
of reputation, or by new forms of actuary accounting and trust-
worthiness ratings. In any of these cases, veriﬁable credentials
about identity subjects - i.e., vehicles, pre-processing and ML
algorithms - could be consumed by algorithms at the heart of
scoring models that both assess risk and reﬁne the labeling
process and its outputs.
Overall, we believe many different costs can be reduced sig-
niﬁcantly: those incurred by poor data quality, those resulting
from poorly-understood data ﬂows from which inferences are
drawn by the customer or the vehicle context and the risk of
data manipulation in any data system, whether open or closed.
New economic opportunities internal to the data marketplace
will open up, we believe, as the minimum and average level
of data quality in marketplaces rises.
As shown in Alvarez-Coello et al. , it is beneﬁcial for
the industry to move towards a data-centric architecture, which
can drive major gains in the stability and reliability of data-
processing ﬂows. This stability and reliability is necessary to
maintain safety and innovation, and our solution introduced
here contributes directly to these ends.
This approach also has signiﬁcant indirect beneﬁts as well
for the quality-assurance and legal aspects of these systems.
The kinds of discovery and forensic audits required by both
routine regulatory compliance and dispute resolution could
be executed in a much more efﬁcient way once entire data
processing pipelines become veriﬁable to any auditor with the
right consents or credentials. This also fosters innovation and
business-process agility, as individual actors (even non-human
ones!) would be better able to assess the risks of relying on
data sets, data sources, and algorithms dynamically.
We have shown how such data provenance can be applied
to data streams in an automotive context. The scenario was
based on historical data and simulated a typical live data chain.
Next steps would include applying the solution to a situation
in which a real vehicle serves as the actual Data Producer,
extending the concept from the restricted environment shown
in this work to a real world application with all different layers
of the data chain shown in Figure 1.
 P. Yadav, S. Hassan, A. Ojo, and E. Curry, “The role of open data
in driving sustainable mobility in nine smart cities,” 06 2017, pp. pp.
 “Driving Positive Outcomes through Open Data Solu-
tions for Mobility,” Dell, Lero, Forum For the Future,
Open DataSoft, City of Palo Alto, Tech. Rep., 02
2018. [Online]. Available: https://www.dell.com/learn/pa/en/pacorp1/
 N. Gruschka, V. Mavroeidis, K. Vishi, and M. Jensen, “Privacy
Issues and Data Protection in Big Data: A Case Study Analysis
under GDPR,” arXiv:1811.08531 [cs], Nov. 2018, arXiv: 1811.08531.
[Online]. Available: http://arxiv.org/abs/1811.08531
 B. Spanfelner, D. Richter, S. Ebel, R. B. GmbH, U. Wilhelm, R. B.
GmbH, W. Branz, R. B. GmbH, C. Patz, and R. B. GmbH, “Challenges
in applying the ISO 26262 for driver assistance systems,” Tagung
Fahrerassistenz, p. 23, 2012.
 P. Koopman and M. Wagner, “Challenges in Autonomous Vehicle
Testing and Validation,” SAE International Journal of Transportation
Safety, vol. 4, no. 1, pp. 15–24, Apr. 2016. [Online]. Available:
 Ondrej Burkacky, Johannes Deichmann, Benjamin Klein,
Klaus Pototzky, and Gundbert Scherf, “Cybersecu-
rity in automotive,” McKinsey, Mar. 2020. [On-
line]. Available: https://www.gsaglobal.org/wp-content/uploads/2020/
03/Cybersecurity-in- automotive-Mastering-the- challenge.pdf
 R. Souza and et al, “Provenance data in the machine learning lifecycle
in computational science and engineering,” p. 10, 10 2019. [Online].
 W. Kritzinger, M. Karner, G. Traar, J. Henjes, and W. Sihn,
“Digital Twin in manufacturing: A categorical literature review and
classiﬁcation,” IFAC-PapersOnLine, vol. 51, no. 11, pp. 1016–1022,
2018. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/
 S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, “Data Prepro-
cessing for Supervised Leaning,” vol. 1, no. 12, p. 6, 2007.
 J. Cheney, S. Chong, N. Foster, M. Seltzer, and S. Vansummeren,
“Provenance: A future history,” 10 2009, pp. 957–964. [Online].
 I. Barclay, A. D. Preece, I. J. Taylor, and D. C. Verma, “Quantifying
transparency of machine learning systems through analysis of
contributions,” CoRR, vol. abs/1907.03483, 2019. [Online]. Available:
 R. Souza, L. Azevedo, V. Louren ˜
A§o, E. Soares, R. Thiago,
R. Brand ˜
A£o, D. Civitarese, E. V. Brazil, M. Moreno, P. Valduriez,
M. Mattoso, R. Cerqueira, and M. A. S. Netto, “Provenance Data
in the Machine Learning Lifecycle in Computational Science and
Engineering,” arXiv:1910.04223 [cs], Oct. 2019, arXiv: 1910.04223.
[Online]. Available: http://arxiv.org/abs/1910.04223
 H. Miao, A. Li, L. S. Davis, and A. Deshpande, “Towards Uniﬁed
Data and Lifecycle Management for Deep Learning,” in 2017 IEEE
33rd International Conference on Data Engineering (ICDE). San
Diego, CA, USA: IEEE, Apr. 2017, pp. 571–582. [Online]. Available:
 S. Schelter, J.-H. B¨
ose, J. Kirschnick, T. Klein, and S. Seufert,
“Automatically tracking metadata and provenance of machine learning
experiments,” 2017. [Online]. Available: http://learningsys.org/nips17/
 “Decentralized identiﬁers (dids) v1.0,” W3C Working Draft 22 June
2020. [Online]. Available: https://www.w3.org/TR/did-core/
 C. Allen, A. Brock, V. Buterin, J. Callas, D. Dorje, C. Lundkvist,
P. Kravchenko, J. Nelson, D. Reed, M. Sabadello, G. Slepak, N. Thorp,
and H. T. Wood, “Decentralized public key infrastructure,” 12 2015.
[Online]. Available: https://github.com/WebOfTrustInfo/rwot1-sf/blob/
master/ﬁnal-documents/satisfying- real-world- use-cases.pdf
 ConsenSys, “Did method ethr speciﬁcation v3.0,” 2020. [Online].
Available: https://github.com/decentralized-identity/ethr-did- resolver/
 D. Alvarez-Coello, B. Klotz, D. Wilms, S. Fejji, J. M. Gomez, and
R. Troncy, “Modeling dangerous driving events based on in-vehicle data
using Random Forest and Recurrent Neural Network,” in 2019 IEEE
Intelligent Vehicles Symposium (IV). Paris, France: IEEE, Jun. 2019,
pp. 165–170. [Online]. Available: https://ieeexplore.ieee.org/document/
 D. Alvarez-Coello, D. Wilms, A. Bekan, and J. Marx Gomez, “Towards
a Data-Centric Architecture in the Automotive Industry,” in Interna-
tional Conference on ENTERprise Information Systems (CENTERIS).
Algarve, Portugal: Elsevier, Oct. 2020, accepted.