PreprintPDF Available

Data Provenance in Vehicle Data Chains

  • Spherity GmbH
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

With almost every new vehicle being connected, the importance of vehicle data is growing rapidly. Many mobility applications rely on the fusion of data coming from heterogeneous data sources, like vehicle and ”smart-city” data or process data generated by systems out of their control. This external data determines much about the behaviour of the relying applications: it impacts the reliability, security and overall quality of the application’s input data and ultimately of the application itself. Hence, knowledge about the provenance of that data is a critical component in any data-driven system. The secure traceability of the data handling along the entire processing chain, which passes through various distinct systems, is critical for the detection and avoidance of misuse and manipulation. In this paper, we introduce a mechanism for establishing secure data provenance in real time, demonstrating an exemplary use-case based on a machine learning model that detects dangerous driving situations. We show with our approach based on W3C decentralized identity standards that data provenance in closed data systems can be effectively achieved using technical standards designed for an open data approach.
Content may be subject to copyright.
Data Provenance in Vehicle Data Chains
Daniel Wilms
Research Engineer,
BMW Technology Office Israel Ltd
Dr. Carsten Stoecker
Founder and CEO,
Spherity GmbH
Dr. Juan Caballero
Research, Spherity GmbH,
Decentralized Identity Fdn
Abstract—With almost every new vehicle being connected, the
importance of vehicle data is growing rapidly. Many mobility
applications rely on the fusion of data coming from heterogeneous
data sources, like vehicle and ”smart-city” data or process data
generated by systems out of their control. This external data
determines much about the behaviour of the relying applications:
it impacts the reliability, security and overall quality of the
application’s input data and ultimately of the application itself.
Hence, knowledge about the provenance of that data is a critical
component in any data-driven system. The secure traceability of
the data handling along the entire processing chain, which passes
through various distinct systems, is critical for the detection
and avoidance of misuse and manipulation. In this paper, we
introduce a mechanism for establishing secure data provenance
in real time, demonstrating an exemplary use-case based on a
machine learning model that detects dangerous driving situations.
We show with our approach based on W3C decentralized identity
standards that data provenance in closed data systems can be
effectively achieved using technical standards designed for an
open data approach.
Driven by technological innovation and organic ecosys-
tem growth, mobility value chains are significantly chang-
ing from monolithic and closed systems to distributed, open
ones [1] [2]. Data flows are increasingly defined dynamically
and stretch across multiple organizational boundaries and even
legal jurisdictions with diverse rules and regulations [3]. The
trustworthiness and accuracy of output data (such as in-car
sensors) generated along distributed digital mobility value
chains is of increasing importance for safety and reliability;
this importance can only increase as Machine Learning (ML)
systems grow more central to mobility systems [4].
The growing use of data-producing and/or Internet of Things
(IoT) devices across every industry – including logistics,
manufacturing, and mobility – is ushering in an era of data
abundance. The trend towards processing open data in the
mobility ecosystem in particular makes urgent the question of
how to ensure the validity of the source data, and about how to
ensure the quality and the accuracy of the output data as well.
Furthermore, retroactively demoting or forgetting data from
sources proven to be unreliable remains an elusive capability.
Mobility systems have a very low tolerance for fraud and
abuse, as the impact of fraudulent data in safety-critical
features can have immediate real-world impact [5]. Upcoming
VTC‘2021-Spring, April, 25-28, 2021, Helsinki, Finland. XXX-
2020 IEEE.
regulation around automotive cyber-security underlines this
importance [6].
Besides these security considerations, ensuring the privacy
of the individuals when processing their data is also crucial.
Understanding the data flow and providing proof that it is
treated carefully and ethically is essential. Another important
aspect is the quality assurance of the data: if customers have
the option to forward on data to third party applications, how
can they forward along metadata and trust ratings necessary for
safety and benchmarking in those new contexts? We believe
that only by providing full transparency and deeper under-
standing of the data and the way it gets processed the required
quality can be provided, by detecting and mitigating flaws in
the data pipeline. Businesses must know the origin and risks
of data from different sources before using them. Businesses
need highly automated and verifiable instruments for assessing
the provenance of data vital for ML applications [7].
This paper is organized as follows: the remainder of sec-
tion I will further detail the requirements for data provenance
in an automotive context. section II will discuss related work
across automotive, ML, and data-processing spheres. We in-
troduce our approach in Figure III and the details and results
of our implementation in Figure IV. Finally, we provide our
conclusion in section V.
Fig. 1. Architecture overview: data flow from the Data Producer to the Data
Consumer and the provenance flow in the opposite direction.
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained
for all other uses, in any current or future media, including reprinting/republishing this material for
advertising or promotional purposes, creating new collective works, for resale or redistribution to
servers or lists, or reuse of any copyrighted component of this work in other works.
Data provenance in automotive
An effective data provenance solution has to be applicable to
multiple layers of a typical data processing flow in automotive
contexts. A typical high-level architecture of such a flow is
shown in Figure 1. The data processing flow starts with a
Data Producer, typically an Electronic Control Unit (ECU),
which transports a signal over a system bus. The signal is
received and ”edge-processed” within the vehicle by a Data
Collector. The first pre-processing of the signal (e.g., adding
meta information, doing some pre-calculation) will be done
here, before it will be transported to a Digital representation,
which defines the incoming data of the physical vehicle
through a digital model, and offers it to applications as a
digital shadow or even as a digital twin of the vehicle [8].
From there, we assume an optional data fusion with External
Data Sources (e.g. weather, traffic information, etc.). This data
is then consumed by Data Processors, which in turn creates
new data of interest for an external Data Consumer. Data
Processors perform at least one data transformation. In case
of ML, at least two data transformations have to be done -
pre-processing and applying the actual model[9].
In this paper, we propose and detail one mechanism for
establishing data provenance in real-time along a data chain
that sources driving event data and processes it into an
ML label. When the data provenance of a given dangerous-
driving machine-learning label is known, a scoring model
can be applied to it that calculates the risks of consuming
this data label for system control or responsible decision-
making. This example, chosen for clarity and simplicity, is
nevertheless applicable much more widely to any digital data
chain. We feel that, with time, machine-learning labels will
benefit from some degree of rating and scoring to be used
safely, legally, and/or responsibly, balancing privacy and se-
curity requirements appropriately. We support standardization
on decentralized identity primitives that make these kinds of
rating and scoring systems more portable and interoperable,
particularly in architectures where cryptographic agility can
be incorporated to maximize forward compatibility.
At least since ML started to go mainstream, the provenance
issues of big data have been almost universally acknowledged,
although it is less of a solveable problem than a definitional
impasse. How can one standardize the capture of data from
improvised sources, which will be structured post-facto by the
ML process? Some would say that the writing has been on the
wall since at least 2009, when the authors of ”Provenance: a
Future History” pointed out that the problem would undermine
ML research until, in retrospect, it would look like a glaring
problem overlooked all along [10]. Our intention in this section
is to position our methods in a wide field of complementary
approaches which all support and further the broader aim of
improving the monitoring, refinement, and reputation capabil-
ityes of ML pipelines.
Decentralized identity and decentralized PKI offers a scaf-
folding and anchor points for accountable and traceable prove-
nance. At the same time, a significant amount of standardiza-
tion of the data exchange and integration of flexible and re-
flexive data capture is crucial. The complexity of the semantic
and data-capture work needed to take full advantage of that
scaffolding is not to be understated. In fact, pride of place
is given in the aforementioned 2009 provenance manifesto to
semantics and expressive data, which proved influential on the
development of many big-data-oriented initiatives within the
W3C and the semantic-web community. The system built on
top of ML and PROV schemata elaborated on by Souza et al.
in 2019 is a good example of how such cross-silo semantic
capture could be tracked and accounted for in an ML context
How to quantify and benchmark risk scoring is never a
simple matter in ML, and some recent work has tried to
address the quantification of provenance scoring specific to
contexts analogous to those described. Barclay et al 2019,
for example, specifically addresses shifting ethical standards
and transparency vis-a-vis regulatory and ethical scrutiny. It
is particularly important to recognize that provenance must
always be linked to dynamic rather than static valuations and
rubrics, as regulatory and liability frameworks will likely take
decades to stabilize internationally [11].
Other related work seeks to incorporate risk scoring
throughout the training process [12] [13], the scope of which
is being expanded by the preceding efforts. Adjusting the
ML training methodology to be more reflexive throughout on
the basis of such scoring has been a major focus of efforts
at Amazon’s ML design division, and presumably will be a
feature of production-grade ML in the future, if not a core
feature of off-the-shelf product offerings for ML training and
for the lifecycle management thereof [14].
The contribution of this paper is to make an architecture pro-
posal which achieves data transparency and data provenance
by enriching the meta information of the data itself. This is
done right where the creation or transformation of the data
takes place. We focus in this paper on the automotive sector,
but we believe that this contribution can have an impact on
achieving reliable risk scoring of ML algorithms in general.
In order to realize data provenance in an automotive data
processing chain, the methods we are proposing in this paper
consists of two interdependent elements:
1) Identification of entities which create data or perform
data transformations in the data processing chain.
2) Introduction of encrypted data structures for representing
Distributed Automotive Data (DAD).
In order to be able to render the data provenance mean-
ingful, it is crucial to know in which way the data has
been transformed at each step from the Data Producer to
the Data Consumer. In particular, it has to be transparent
which exact digital entity or algorithm has carried out each
transformation on the data and when. The transformation can
take place in one or many centralized or decentralized mobility
systems, so the identification of all the entities is greatly
simplified if the identities are maximally portable and not
administered within their respective closed systems. Therefore,
we propose to adopt the decentralized identifier (DID) standard
as an open, interoperable addressing scheme and to establish
mechanisms for resolving DIDs across multiple centralized
and/or decentralized mobility systems [15]. Even if these
identifier resolution schemes are not used for discovery or
communications within a given system, they are invaluable for
reconstructing or tracing data trails that cross many systems,
whether in real-time or forensically. Figure 2 shows which
identities in the data processing flow introduced in Figure 1
get provisioned with DIDs in our architecture, and thus get
cryptographic data-signing capabilities.
Fig. 2. Architecture with signing identities marked. Data provenance flow
on the left with scoring on top, realized by the provenance of the data the
algorithm relies on.
DIDs were originally designed to function as identifiers
for individual people, but can readily be extended to any
entity or resource. They are derived from public/private key
pairs, registered in an immutable registry for discovery pur-
poses. Spherity and other companies pioneering the decentral-
ized identity technology sector use innovative cryptographic
solutions for secure key management such as private key
sharding (multi-party computation), on-device/secure-enclave
biometrics, and HSMs to make signatures and data trails more
secure and non-repudiable [16]. The field is fast-moving and
significant progress is being made in expanding the options
for high-security use-cases and in building forward-secure
cryptographic agility into systems to accommodate these new
Each domain or namespace for DIDs corresponds to a
method of encoding and decoding, making DIDs resolvable
like domain names relative to a method-specific but
interoperable resolution infrastructure [15]. For this project,
Spherity registered its DIDs on the Ethereum Blockchain via
the standard W3C DID method ethr [17]. In this method, the
Fig. 3. Data structure as chained transformation. Each DAD includes the
DID of the transforming entity and points to the previous DAD(s), which
were used during the transformation.
public key of any valid ethereum keypair (i.e., an ”ethereum
address”) can be used as the identifier string within the
namespace defined by the ETH prefix. Thus, our DIDs look
like this:
As you can see in the ’iss’ and ’payload’[’inputDataDids’]
parameters of the sample data label (Figure 5), all identities are
expressed as DID references. Essentially, each transformation
appends a new link in a chain of linked and signed versions–
each data point can be updated, and each updated data point is
both signed by the transformer and linked back to its previous
state. By applying this to each transformation from the Data
Producer to the Data Consumer, Figure 3 shows the resulting
data chain, which enables the system to follow back how and
where the data has been transformed and which algorithm
was responsible for the transformation and how the incoming
data at a transformation step influenced the outcome. In this
way the data would ensure data provenance throughout the
entire data processing chain. The outcome of the algorithm
could be traced back to the Data Producer and ensures greater
confidence in the outcomes of the algorithm.
In this paper we have presented the implementation of a
verifiable data chain for a supervised learning scenario with
an RNN algorithm detecting dangerous driving scenarios as
shown in [18]. The referenced scenario detects situations of
dangerous driving on an incoming stream of vehicle data and
classifies the maneuver (e.g. left turn, right turn, acceleration,
etc.). The algorithm predicts for every timestamp the result,
based on the previous ten frames or data points. The input data
consists of categorical (e.g. gear, brakes pressed) and continu-
ous signals (e.g position, lateral and longitudinal acceleration,
For our solution presented here, we used a cloud environ-
ment and historic dangerous driving event data sets that were
used to train the RNN model. As shown in Figure 4, we used
Fig. 4. Overview of each step taken in the prototype implementation. The
application is based on a data-stream of historic data. The data is processed
and consumed by an algorithm. The proposed provenance flow is applied.
the historical data to simulate a live vehicle data stream. The
historical data set contains, for each point in time, an array of
data points, which are the relevant features for the RNN model.
Each array is sent to a component, responsible for the data
handling and processing. It creates for every data point a DAD.
Then, a feature vector of 10 entries is prepared for the RNN
model. The outcome of the RNN Model - the classification
of the situation as dangerous, the type of maneuver and the
confidence - is then stored as another DAD. It refers to the
DIDs of the entries, which are included in the DAD output
of the final result (See Figure 5). In order to show how the
application orders a stream of source events over time, we also
present the incoming data on a map, linking backwards to the
source DIDs and forward to the resulting DADs (See Figure
This way, the cryptographic data structure provides instru-
ments for end-to-end verifiability that enabled us to prove the
integrity of the data chain, identify all the entities involved in
the creation of a specific machine learning label, and request,
in turn, life-cycle credentials from these entities to feed a
scoring model for the respective machine learning label.
The end-to-end verifiability of entity attributes and quality
data about the entities involved in cyber-physical value chains
would allow us to build algorithms that accurately score
machine learning output data. Any consumer of these output
data could then assess their trustworthiness prior to processing
them in the consumer’s application.
In a second iteration of the project, we would focus on using
the proposed methods in a real-world scenario, e.g. using live
data streams from a fleet of real vehicles integrated with this
validated data-chain infrastructure.
The growing complexity of data chains and the increasing
number of actors processing their data goes beyond what can
be expected of manual quality assurance and maintenance
Fig. 5. Output VC with provenance pointing back to DIDs of simulated
telemetry devices (encrypted data and cryptographic signature cropped out
for clarity)
Fig. 6. Web interface of the application presenting the results
processes. A transparent and reliable evaluation of the risk
and quality of data inference methodologies is essential, and
this requires scaffolding for such accounting. Data provenance
about the entities involved in a data processing chain and
the resulting machine learning labels (using DIDs, VCs, and
DLT to ensure uniform metadata) provides a foundation for
reliable risk scoring. Harmonized and standardized data (and
more importantly, metadata) is the key to AI explainability,
whether managed in traditional top-down ways, by new forms
of reputation, or by new forms of actuary accounting and trust-
worthiness ratings. In any of these cases, verifiable credentials
about identity subjects - i.e., vehicles, pre-processing and ML
algorithms - could be consumed by algorithms at the heart of
scoring models that both assess risk and refine the labeling
process and its outputs.
Overall, we believe many different costs can be reduced sig-
nificantly: those incurred by poor data quality, those resulting
from poorly-understood data flows from which inferences are
drawn by the customer or the vehicle context and the risk of
data manipulation in any data system, whether open or closed.
New economic opportunities internal to the data marketplace
will open up, we believe, as the minimum and average level
of data quality in marketplaces rises.
As shown in Alvarez-Coello et al. [19], it is beneficial for
the industry to move towards a data-centric architecture, which
can drive major gains in the stability and reliability of data-
processing flows. This stability and reliability is necessary to
maintain safety and innovation, and our solution introduced
here contributes directly to these ends.
This approach also has significant indirect benefits as well
for the quality-assurance and legal aspects of these systems.
The kinds of discovery and forensic audits required by both
routine regulatory compliance and dispute resolution could
be executed in a much more efficient way once entire data
processing pipelines become verifiable to any auditor with the
right consents or credentials. This also fosters innovation and
business-process agility, as individual actors (even non-human
ones!) would be better able to assess the risks of relying on
data sets, data sources, and algorithms dynamically.
We have shown how such data provenance can be applied
to data streams in an automotive context. The scenario was
based on historical data and simulated a typical live data chain.
Next steps would include applying the solution to a situation
in which a real vehicle serves as the actual Data Producer,
extending the concept from the restricted environment shown
in this work to a real world application with all different layers
of the data chain shown in Figure 1.
[1] P. Yadav, S. Hassan, A. Ojo, and E. Curry, “The role of open data
in driving sustainable mobility in nine smart cities,” 06 2017, pp. pp.
[2] “Driving Positive Outcomes through Open Data Solu-
tions for Mobility,” Dell, Lero, Forum For the Future,
Open DataSoft, City of Palo Alto, Tech. Rep., 02
2018. [Online]. Available:
corporatecorp-commen/documentsmobility- open-data.pdf
[3] N. Gruschka, V. Mavroeidis, K. Vishi, and M. Jensen, “Privacy
Issues and Data Protection in Big Data: A Case Study Analysis
under GDPR,” arXiv:1811.08531 [cs], Nov. 2018, arXiv: 1811.08531.
[Online]. Available:
[4] B. Spanfelner, D. Richter, S. Ebel, R. B. GmbH, U. Wilhelm, R. B.
GmbH, W. Branz, R. B. GmbH, C. Patz, and R. B. GmbH, “Challenges
in applying the ISO 26262 for driver assistance systems,Tagung
Fahrerassistenz, p. 23, 2012.
[5] P. Koopman and M. Wagner, “Challenges in Autonomous Vehicle
Testing and Validation,SAE International Journal of Transportation
Safety, vol. 4, no. 1, pp. 15–24, Apr. 2016. [Online]. Available:
[6] Ondrej Burkacky, Johannes Deichmann, Benjamin Klein,
Klaus Pototzky, and Gundbert Scherf, “Cybersecu-
rity in automotive,McKinsey, Mar. 2020. [On-
line]. Available:
03/Cybersecurity-in- automotive-Mastering-the- challenge.pdf
[7] R. Souza and et al, “Provenance data in the machine learning lifecycle
in computational science and engineering,” p. 10, 10 2019. [Online].
[8] W. Kritzinger, M. Karner, G. Traar, J. Henjes, and W. Sihn,
“Digital Twin in manufacturing: A categorical literature review and
classification,” IFAC-PapersOnLine, vol. 51, no. 11, pp. 1016–1022,
2018. [Online]. Available:
[9] S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, “Data Prepro-
cessing for Supervised Leaning,” vol. 1, no. 12, p. 6, 2007.
[10] J. Cheney, S. Chong, N. Foster, M. Seltzer, and S. Vansummeren,
“Provenance: A future history,” 10 2009, pp. 957–964. [Online].
[11] I. Barclay, A. D. Preece, I. J. Taylor, and D. C. Verma, “Quantifying
transparency of machine learning systems through analysis of
contributions,” CoRR, vol. abs/1907.03483, 2019. [Online]. Available:
[12] R. Souza, L. Azevedo, V. Louren ˜
A§o, E. Soares, R. Thiago,
R. Brand ˜
A£o, D. Civitarese, E. V. Brazil, M. Moreno, P. Valduriez,
M. Mattoso, R. Cerqueira, and M. A. S. Netto, “Provenance Data
in the Machine Learning Lifecycle in Computational Science and
Engineering,” arXiv:1910.04223 [cs], Oct. 2019, arXiv: 1910.04223.
[Online]. Available:
[13] H. Miao, A. Li, L. S. Davis, and A. Deshpande, “Towards Unified
Data and Lifecycle Management for Deep Learning,” in 2017 IEEE
33rd International Conference on Data Engineering (ICDE). San
Diego, CA, USA: IEEE, Apr. 2017, pp. 571–582. [Online]. Available:
[14] S. Schelter, J.-H. B¨
ose, J. Kirschnick, T. Klein, and S. Seufert,
“Automatically tracking metadata and provenance of machine learning
experiments,” 2017. [Online]. Available:
assets/papers/paper 13.pdf
[15] “Decentralized identifiers (dids) v1.0,W3C Working Draft 22 June
2020. [Online]. Available:
[16] C. Allen, A. Brock, V. Buterin, J. Callas, D. Dorje, C. Lundkvist,
P. Kravchenko, J. Nelson, D. Reed, M. Sabadello, G. Slepak, N. Thorp,
and H. T. Wood, “Decentralized public key infrastructure,” 12 2015.
[Online]. Available:
master/final-documents/satisfying- real-world- use-cases.pdf
[17] ConsenSys, “Did method ethr specification v3.0,” 2020. [Online].
Available: resolver/
[18] D. Alvarez-Coello, B. Klotz, D. Wilms, S. Fejji, J. M. Gomez, and
R. Troncy, “Modeling dangerous driving events based on in-vehicle data
using Random Forest and Recurrent Neural Network,” in 2019 IEEE
Intelligent Vehicles Symposium (IV). Paris, France: IEEE, Jun. 2019,
pp. 165–170. [Online]. Available:
[19] D. Alvarez-Coello, D. Wilms, A. Bekan, and J. Marx Gomez, “Towards
a Data-Centric Architecture in the Automotive Industry,” in Interna-
tional Conference on ENTERprise Information Systems (CENTERIS).
Algarve, Portugal: Elsevier, Oct. 2020, accepted.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Vehicle software architectures have been evolving over the last twenty years to support data-driven functionalities. Several enterprises from different domains are currently focusing on improving their data architectures by re-defining the underlying data models to enable core support for analytics and artificial intelligence. Moreover, a common desire to add clear data provenance and explicit context impulses the field of semantics and knowledge graphs. Nevertheless, in the automotive industry, the scenario of connected vehicles implies extra complexity. Vehicle data has an enormous variety, making it essential to develop and adopt standards. This paper presents aspects of ongoing research at the BMW Research Department regarding a conceptual design for vehicle software architectures in the automotive industry. We discuss the principles of a modern data architecture with particular emphasis on the data-centric mindset. We also explore the current challenges and possible working points as the foundation to move from siloed data towards a so-called AI factory.
Conference Paper
Full-text available
Modern vehicles produce big data with a wide variety of formats due to missing open standards. Thus, abstractions of such data in the form of descriptive labels are desired to facilitate the development of applications in the automotive domain. We propose an approach to reduce vehicle sensor data into semantic outcomes of dangerous driving events based on aggressive maneuvers. The supervised time-series classification is implemented with Random Forest and Recurrent Neural Network separately. Our approach works with signals of a real vehicle obtained through a back-end solution, with the challenge of low and variable sampling rates. We introduce the idea of having a dangerous driving classifier as the first discriminant of relevant instances for further enrichment (e.g., type of maneuver). Additionally, we suggest a method to increment the number of driving samples for training machine learning models by weighting the window instances based on the portion of the labeled event they include. We show that a dangerous driving classifier can be used as a first discriminant to enable data integration and that transitions in driving events are relevant to consider when the dataset is limited, and sensor data has a low and unreliable frequency.
Conference Paper
Full-text available
Science, industry, and society are being revolutionized by radical new capabilities for information sharing, distributed computation, and collaboration offered by the World Wide Web. This revolution promises dramatic benefits but also poses serious risks due to the fluid nature of digital infor- mation. One important cross-cutting issue is managing and recordingprovenance, or metadata about the origin, context, or history of data. We posit that provenance will play a cen- tral role in emerging advanced digital infrastructures. In this paper, we outline the current state of provenance research and practice, identifyhard openresearch problemsinvolving provenance semantics, formal modeling, and security, and articulate a vision for the future of provenance.
The Digital Twin (DT) is commonly known as a key enabler for the digital transformation, however, in literature is no common understanding concerning this term. It is used slightly different over the disparate disciplines. The aim of this paper is to provide a categorical literature review of the DT in manufacturing and to classify existing publication according to their level of integration of the DT. Therefore, it is distinct between Digital Model (DM), Digital Shadow (DS) and Digital Twin. The results are showing, that literature concerning the highest development stage, the DT, is scarce, whilst there is more literature about DM and DS.