ArticlePublisher preview available

A Cyberinfrastructure for Big Data Transportation Engineering

To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Big data-driven transportation engineering has the potential to improve utilization of road infrastructure, decrease traffic fatalities, improve fuel consumption, and decrease construction worker injuries, among others. Despite these benefits, research on big data-driven transportation engineering is difficult today due to the computational expertise required to get started. This work proposes BoaT, a transportation-specific programming language, and its big data infrastructure that is aimed at decreasing this barrier to entry. Our evaluation, that uses over two dozen research questions from six categories, shows that research is easier to realize as a BoaT computer program, an order of magnitude faster when this program is run, and exhibits 12–14× decrease in storage requirements.
This content is subject to copyright. Terms and conditions apply.
1 3
Journal of Big Data Analytics in Transportation (2019) 1:83–94
A Cyberinfrastructure forBig Data Transportation Engineering
MdJohirulIslam1 · AnujSharma1· HrideshRajan1
Received: 27 July 2018 / Revised: 4 April 2019 / Accepted: 9 April 2019 / Published online: 9 May 2019
© Springer Nature Singapore Pte Ltd. 2019
Big data-driven transportation engineering has the potential to improve utilization of road infrastructure, decrease traf-
fic fatalities, improve fuel consumption, and decrease construction worker injuries, among others. Despite these benefits,
research on big data-driven transportation engineering is difficult today due to the computational expertise required to get
started. This work proposes BoaT, a transportation-specific programming language, and its big data infrastructure that is
aimed at decreasing this barrier to entry. Our evaluation, that uses over two dozen research questions from six categories,
shows that research is easier to realize as a BoaT computer program, an order of magnitude faster when this program is run,
and exhibits 12–14× decrease in storage requirements.
Keywords Big data· Domain-specific language· Cyberinfrastructure
The potential and challenges of leveraging big data in trans-
portation has long been recognized (Adu-Gyamfi etal. 2017;
Barai 2003; Chakraborty etal. 2017; Chen and Zhang 2014;
Fan etal. 2014; Huang etal. 2016; Jagadish etal. 2014;
Kitchin 2014; Laney 2001; Liu etal. 2016; Lv etal. 2015;
Seedah etal. 2015; Wang etal. 2017; Zhang etal. 2011). For
example, researchers have shown that big data-driven trans-
portation engineering can help reduce congestions, fatalities,
and make building transportation applications easier (Barai
2003; Huang etal. 2016; Zhang etal. 2011). The availability
of open transportation data that are accessible, e.g. on the
web under a permissive license, has the potential to fur-
ther accelerate the impact of big data-driven transportation
Despite this incredible potential, harnessing big data in
transportation for research remains difficult. To utilize big
data, expertise is needed along each of the five steps of a
typical data pipeline namely data acquisition; information
extraction and cleaning; data integration, aggregation, and
representation; modeling and analysis; and interpretation
(Jagadish etal. 2014). First three steps are further compli-
cated by the heterogeneity of data from multiple sources
(Seedah etal. 2015), e.g. speed sensors, weather station, and
national highway authority. A scientist must understand the
peculiarities of the data sources to develop a data acquisition
mechanism, clean data coming from multiple sources, and
integrate data from multiple sources. Modeling and analysis
are complicated by the volume of the data. For example, a
dataset of speed measurements from a commercial provider
for Iowa for a single day can be in multiple GBs, exceeding
the limits of a single machine. Analyses that aim to compute
trends over multiple years require storing, and computing
over, tens of TBs of just speed sensor data.
A possible solution could be to use the big data technolo-
gies like Hadoop and Apache Spark running over a distrib-
uted cluster. Using a distributed cluster with an adequate
number of nodes, problems related to the storage and time of
computation can be addressed. But these big data technolo-
gies are not so easy to use. Getting started requires techni-
cal expertise to set up the infrastructure, efficient design of
data schema, data acquisition strategy from multiple sources,
high level of programming skills, adequate knowledge of
distributed computing models, and a lot more efficiency in
writing distributed computer programs which is significantly
different than writing a sequential computer program in Mat-
lab, C, or Java. The analysis of big data in transportation
* Md Johirul Islam
Anuj Sharma
Hridesh Rajan
1 Iowa State University, Ames, IA, USA
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... in transportation systems, the authors of work [20] propose an applicable, scalable, efficient algorithm for data storage, which is paramount since, according to the authors, speed measurements of the vehicles of one only carrier in a single day can generate a dataset of multiple Gigabytes. As shown previously, the ITS concept is currently very well regarded by the global scientific community as a preliminary concept that could solve a problem that affects all countries, especially the large cities: the transportation systems [21,22]. ...
Full-text available
The big data concept has been gaining strength over the last few years. With the arise and dissemination of social media and high access easiness to information through applications, there is a necessity for all kinds of service providers to collect and analyze data, improving the quality of their services and products. In this regard, the relevance and coverage of this niche of study are notorious. It is not a coincidence that governments, supported by companies and startups, are investing in platforms to collect and analyze data, aiming at the better efficiency of the services provided to the citizens. Considering the aforementioned aspects, this work makes contextualization of the Big Data and ITS (Intelligent Transportation System) concepts by gathering recently published articles, from 2017 to 2021, considering a survey and case studies to demonstrate the importance of those themes in current days. Within the scope of big data applied to ITS, this study proposes a database for public transportation in the city of Campinas (Brazil), enabling its improvement according to the population demands. Finally, this study tries to present clearly and objectively the methodology employed with the maximum number of characteristics, applying statistical analyses (box-and-whisker diagrams and Pearson correlation), highlighting the limitations, and expanding the studied concepts to describe the application of an Advanced Traveler Information System (ATIS), a branch of Intelligent Transportation System (ITS), in a real situation. Therefore, besides the survey of the applied concepts, this work develops a specific case study, highlighting the identified deficiencies and proposing solutions. Future works are also contemplated to expand this study and improve the accuracy of the achieved results.
... With the help of the probe vehicles, data about realtime traffic information get accessed in an efficient manner. In contrast to the sensors, the study of It has detected that other sources of data including crash data, demographic data, weather reporting system, and geometric characteristics are getting not only widely used in traffic operation, but also in safety management [8]. In the present era, where mobile devices are present at the top of the market trend, social media data has also become a promising data sources as people across the world are likely to upload any news over the social media platform as soon as it happens. ...
... Its characteristics also reflect the basic theory and method of construction project management, including the whole life cycle management and comprehensive management of construction project [5] . The whole life cycle management of construction project is the control and management of planning and design stage, bidding construction stage, operation and maintenance stage, etc [6] .As a new computer technology, BIM has the advantages of high efficiency, economy, energy saving and environmental protection in bridge construction, which can realize the transmission and sharing of BIM computer information model in the whole life cycle of the project [7] . ...
Full-text available
With the rapid integration of big data, cloud computing, BIM and other new technologies into construction enterprises, the degree of information of bridge construction is gradually deepened. This process is usually very complex, there are many influencing factors, and the construction progress is difficult to control. This paper summarizes the fiber composite materials which have potential application in Bridge Engineering in recent years. This paper first discusses the development prospect of carbon fiber, and then briefly introduces the mechanical properties of carbon fiber composite laminates. Referring to relevant literature, this paper introduces BIM Technology in detail.Based on the current situation and application advantages of BIM technology,it fully understands the practical application of BIM Technology in bridge engineering design, construction stage and later operation and maintenance. Their tensile strength, elastic modulus, longitudinal and transverse thermal expansion are introduced in detail.The results show that BIM Technology has been widely used in bridge engineering projects. In 2015, the Growth rate of BIM technology usage in construction industry country a was 1.536.
... There is limited study on the application of data visualization and analytics with large data sets of highway accidents [7]. This paper aims to fill this technological gap by developing an accident data analysis system that is an advanced tool to facilitate visualization and analytics for accident data. ...
Full-text available
Thailand has been ranked as one of the most dangerous countries in terms of death from road accidents, representing ineffective road safety policies. The crucial mission of the Thai government is to provide safety and reduce accidents for road users on the highway system. This paper aims to explore the potential of using Business Intelligence (BI) in accident analysis. The availability of open accident data provides an opportunity for the BI, which can provide an advanced platform for conducting data visualization and analytics in both spatial and temporal dimensions in order to illustrate when and where the accidents occur. The accident data and provincial data were combined by using the Talend Data Integration tool. The combined data was then loaded into a MySQL database for data visualization using Tableau. The dashboard was designed and created by using Tableau as an analytical visualization tool to provide insights into highway accidents. This system is advised to be adopted by the Thai government, which can be used for data visualization and analytics to provide a mechanism to formulate strategy options and formulate appropriate contingency plans to improve the accident situation.
... To this end, we utilized BoaG to address these challenges at scale. BoaG belongs to the family of a domain-specific language and shared infrastructure, called Boa, that has been applied to address challenges in mining software repositories [9], genomics data [10], and big data transportation [11]. Boa can process and query terabytes of raw data and uses a backend based on map-reduce to effectively distribute computational analyses and querying tasks. ...
Full-text available
Background: Scientists around the world use NCBI’s non-redundant (NR) database to identify the taxonomic origin and functional annotation of their favorite protein sequences using BLAST. Unfortunately, due to the exponential growth of this database, many scientists do not have a good understanding of the contents of the NR database. There is a need for tools to explore the contents of large biological datasets, such as NR, to better understand the assumptions and limitations of the data they contain. Results: Protein sequence data, protein functional annotation, and taxonomic assignment from NCBI’s NR database were placed into a BoaG database, a domain-specific language and shared data science infrastructure for genomics, along with a CD-HIT clustering of all these protein sequences at different sequence similarity levels. We show that BoaG can efficiently perform queries on this large dataset to determine the average length of protein sequences and identify the most common taxonomic assignments and functional annotations. Using the clustering information, we also show that the non-redundant (NR) database has a considerable amount of annotation redundancy at the 95% similarity level. Conclusions: We implemented BoaG and provided a web-based interface to BoaG’s infrastructure that will help researchers to explore the dataset further. Researchers can submit queries and download the results or share them with others. Availability and implementation: The web-interface of the BoaG infrastructure can be accessed here: Please use user = boag and password = boag to login. Source code and other documentation are also provided as a GitHub repository:
... For relatively large datasets (5 GB or more), it is challenging, if not impossible, to achieve real-time visual updates with conventional visual analytic platforms. Recent developments aimed at handling big transportation data leverages high-performance computing clusters in the back end for all the heavy-lifting computations including data ingestion, aggregation, integration and reduction (Badu-Marfo et al. 2019;Islam and Sharma 2019). The filtered, aggregated and lightweight data are subsequently pushed to the front end for visual exploration. ...
Full-text available
Transportation agencies rely on a variety of data sources for condition monitoring of their assets and making critical decisions such as infrastructure investments and project prioritization. Recent exponential increase in the volumes of these datasets has been causing significant information overload problems for data analysts; data curation process has increasingly become time consuming as legacy CPU-based systems are reaching their limits for processing and visualizing relevant trends in these massive datasets. There is a need for new tools that can consume these new datasets and provide analytics at rates resonant with the speed of human thought. The current paper proposes a new framework that allows for both multidimensional visualization and analytics to be carried seamlessly on large transportation datasets. The framework stores data in a massively parallel database and leverages the immense computational power available in graphical processing units (GPUs) to carry out data analytics and rendering on the fly via a Structured Query Language which interacts with the underlying GPU database. A front-end is designed for near-instant rendering of queried results on simple charts and maps to enable decision makers to drill down insights quickly. The framework is used to develop applications for analyzing big transportation datasets with over 100 million rows. Performance benchmarking experiments conducted showed that the methodology developed is able to provide real-time visual updates for big data in less than 100 ms. The performance of the developed framework was also compared with CPU-based visual analytics platforms such as Tableau and D3.
... Boa g , on the other hand is such a tool but is currently only implemented for mining very large software repositories like GitHub and Sourceforge. It recently has been applied to address potentials and challenges of Big Data in transportation [21]. ...
Full-text available
Background: Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boag is needed to efficiently process and parse data contained in large data repositories. The main features of Boag are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. Results: As a proof of concept, Boa for genomics, Boag, has been implemented to analyze RefSeq's 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boag provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boag to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boag databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. Conclusions: In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boag, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boag using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boag could be used with large biological datasets.
We propose PASS , a O ( n ) algorithm for data reduction that is specifically aimed at preserving the semantics of time series data visualization in the form of line chart. Visualization of large trend line data is a challenge and current sampling approaches do produce reduction but result in loss of semantics and anomalous behavior. We have evaluated PASS using 7 large and well-vetted datasets (Taxi, Temperature, DEBS challenge 2012-2014 dataset, New York Stock Exchange data, and Integrated Surface Data) and found that it has several benefits when compared to existing state-of-the-art time series data reduction techniques. First, it can preserve the semantics of the trend. Second, the visualization quality using the reduced data from PASS is very close to the original visualization. Third, the anomalous behavior is preserved and can be well observed from the visualizations created using the reduced data. We have conducted two user surveys collecting 3000+ users’ responses for visual preference as well as perceptual effectiveness and found that the users prefer PASS over other techniques for different datasets. We also compare PASS using visualization metrics where it outperforms other techniques in 5 out of the 7 datasets.
Full-text available
Data mining is the extraction of implicit, previously unknown and potentially useful information from data. In recent time, data mining studies have been carried out in many engineering disciplines. In this paper the background of data mining and tools is introduced. Further applications of data mining to transportation engineering problems are reviewed. The application of data mining for typical example of ‘Vehicle Crash Study’ is demonstrated using commercially available data mining tool. The paper highlights the potential of data mining tool application in transportation engineering sector. First Published Online: 19 Dec 2011
Conference Paper
Full-text available
Early detection of incidents is one of the key step to reduce incident-related congestion. With the increasing usage of GPS based navigation, promising data-scalable crowdsourced probe data is now available which can provide near-real time traffic speed information. This study utilizes such extensive historical datasets (approximately 500 GB) to gain useful insights on the normal traffic pattern of each segment. The insights come in the form of speed threshold for different time of the day and days of week for each segment. Thereafter, the anomalous traffic behaviour are classified as incidents. The dynamic thresholds developed for each segment simplifies the calibration steps that is often required when applying a model to a different dataset. Also, in this study, two alternatives of the traditional Standard Normal Deviate (SND) based incident detection algorithm are tested. The proposed algorithms can handle the masking effect of SND method where the outliers inflate the mean and standard deviation values and result in lower threshold values and in turn, lower detection rate. The high detection rate (94-97%) obtained by these algorithms compared to the SND method (83%) shows the efficacy of the models. Although higher false alarm rate (FAR) are observed for these models, but their values (4 false alarms/day) are quite lower than the acceptable FAR (10 false alarms/day) reported in previous literature.
Conference Paper
Full-text available
Since vehicle crashes in urban area may potentially cause higher societal costs than those in rural area, it is critical to understand the contributing factors of urban crashes, especially congestions. This paper analyzes the impacts of segment characteristics, traffic-related information and weather information on monthly crash frequency based on a case study in Iowa, U.S. Random parameter negative binomial (RPNB) model was employed. Considering that same factor may impact crash frequency differently on segments with different congestion level, the heterogeneity in random parameter means was introduced and discreetly examined. Data from 77 directional segments and 24 months (2013-2014) were used in this study. The empirical results show that segment length and maximum snow depth have fixed impacts while number of lanes, shoulder width and trailers percentage have random impacts on crash frequency. In addition, heterogeneous behaviors of the random factors were identified between segments with different congestion level. For example, the model results indicate that the increase of left shoulder width tends to decrease crash frequency more under congested condition than under uncongested condition.
Conference Paper
Full-text available
Transportation, as a means for moving goods and people between different locations, is a vital element of modern society. In this paper, we discuss how big data technology infrastructure fits into the current development of China, and provide suggestions for improvement. We discuss the current situation of China's transportation system, and outline relevant big data technologies that are being used in the transportation domain. Finally we point out opportunities for improvement of China's transportation system, through standardisation, integration of big data analytics in a national framework, and point to the future of transportation in China and beyond.
This paper presents a framework for evaluating the reliability of probe-sourced traffic speed data for detection of congestion and assessment of roadway performance. The methodology outlined uses pattern recognition to quantify accurately the similarities and dissimilarities of probe-sourced and benchmarked local sensor data. First, a pattern recognition algorithm called empirical mode decomposition was used to define short-, medium-, and long-term trends for the probe-sourced and infrastructure-mounted local sensor data sets. The reliability of the probe data was then estimated on the basis of the similarity or synchrony between corresponding trends. The synchrony between long-term trends was used as a measure of accuracy for general performance assessment, whereas short-and medium-term trends were used for testing the accuracy of congestion detection with probe-sourced data. By using 1 month of high-resolution speed data, the authors were able to use probe data to detect, on average, 74% and 63% of the short-term events (events lasting for at most 30 min) and 95% and 68% of the medium-term events (events lasting between 1 and 3 h) on freeways and nonfreeways, respectively. Significant latencies do, however, exist between the data sets. On nonfreeways, the benchmarked data detected events, on average, 12 min earlier than the probe data. On freeways, the latency between the data sets was reduced to 8 min. The resulting framework can serve as a guide for state departments of transportation when they outsource collection of traffic data to probe-based services or supplement their data with data from such services.
Now it's the trend of using cloud computing capacities for the provision and support of ubiquitous connectivity and real-time applications and services for smart cities' needs. This paper presents the route map of big data relying on cloud computing to make urban traffic and transportation smarter by mining and pattern visualization with literature review and case studies. Although most of these technologies already commercialized, to be or not to be cloud is still a problem for organizations because of the top issues like security and privacy. However, a simple structured framework might help prevent some typical traps. For e-government and politics, data open and transparency to enhance decision-making are possible when not only enterprises carried out business intelligence but also general people are empowered with smart devices.