Content uploaded by Geoffrey Charles Fox

Author content

All content in this area was uploaded by Geoffrey Charles Fox on Nov 19, 2020

Content may be subject to copyright.

DRAFT Deep Learning for Spatial Time Series

Geoffrey Fox, Indiana University 17 November 2020

Abstract

We show that one can study several sets of sequences or time-series in terms of an underlying

evolution operator which can be learned with a deep learning network. We use the language of

geospatial time series as this is a common application type but the series can be any sequence

and the sequences can be in any collection (bag) - not just Euclidean space-time -- as we just

need sequences labeled in some way and having properties consequent of this label (position in

abstract space). This problem has been successfully tackled by deep learning in many ways

and in many fields. The most advanced work is probably in Natural Language processing and

transportation (ride-hailing). The second case with traffic and the number of people needing

rides is a geospatial problem with significant constraints from spatial locality. As in many

problems, the data here is typically space-time-stamped events but these can be converted into

spatial time series by binning in space and time.

Comparing deep learning for such time series with coupled ordinary differential equations used

to describe multi-particle systems, motivates the introduction of an evolution operator that

describes the time dependence of complex systems. With an appropriate training process, we

interpret deep learning applied to spatial time series as a particular approach to finding the time

evolution operator for the complex system giving rise to the spatial time series. Whimsically we

view this training process as determining hidden variables that represent the theory (as in

Newton’s laws) of the complex system.

We formulate this problem in general and present an open-source package FFFFWNPF as a

Jupyter notebook for training and inference using either recurrent neural networks or a variant of

the transformer (multi-headed attention) approach. This assumes an outside data engineering

step that can prepare data for ingest into FFFFWNPF.

We present the approach and a comparison of transformer and LSTM networks for time series

of COVID infection and fatality data from 314 cities as well as hydrology from 671 locations

The paper concludes with a discussion of future applications including earthquake science,

logistics (job scheduling), and epidemiology as well as other important neural networks --

graphs, convolutional, convLSTM. We expect to use this technology with MLPerf datasets. We

intend to understand how complex systems of different types (different membership linkages)

are described by different types of deep learning operators. Geometric structure in space and

multi-scale behavior in both time and space will be important. We anticipate that the current

forecasting formulation is easily extended to sequence to sequence problems.

1. Introduction

There is increasing recognition of the importance of deep learning in data-driven discovery

across a broad range of applications. In this paper, we study time series where the MLPerf [1],

[2] working group led by Cisco technical leader Xinyuan Huang has recently highlighted many

areas and available datasets [3]. Logistics, network intelligence, manufacturing, smart city, and

ride-hailing [4] (transportation) are major Industry areas having important time series while

medical data is often of this form. We note that similar technical approaches (recurrent neural

nets and Transformers) are often used for both time series and “sequence to sequence

mapping” as seen in the major voice and translation areas separately studied at MLPerf. We

focus here on the analysis of time-dependent data for forecasting where approach can be

illustrated by the examples below. There is a class presentation covering this material [5], [6],

which extends an earlier technical report [7].

We compare deep learning for such time series with coupled ordinary differential equations

used to describe multi-particle systems, and this motivates the introduction of an evolution

operator that describes the time dependence of any complex systems. With an appropriate

training process, we interpret deep learning applied to spatial time series as a particular

approach to finding the time evolution operator for the complex system giving rise to the spatial

time series. Whimsically we view this training process as determining hidden variables that

represent the theory (as in Newton’s laws) of the complex system. Below we analyze three

examples focussing on what we term a spatial bag. These are problems consisting of a

collection of points (thought of members in a space) which have the same dynamics but with

different parameters and different initial conditions. The points in the bag have time series and

static data which can be considered as specifying parameters allowing different evolution

operators for each point. In the two methods presented here the static and dynamic data are

treated in identical fashion with static data viewed as a time series with time-independent

values. This is a common approach in the literature but I did not see it articulated definitively

anywhere. We use it as it fits our view of parameterized operators which is not obtained from

alternate methods that treat static data outside the LSTM or Transformer.

We formulate this spatial bag problem in general and present an open-source package

FFFFWNPF as a Jupyter notebook for training and inference using either recurrent neural

networks or modified transformers using multi-headed attention at decoder stage and an LSTM

for the encoder. This assumes an outside data engineering step [8], [9] that can prepare data for

ingest into FFFFWNPF. The software is open source and available [10] but it needs more

attention to clean API’s, further testing and documentation. The notebook is modified from that

prepared by Google [11] for the original transformer [12], so you can easily see the nature of

changes made.

We motivate our approach in section 2 considering a study of Newton's laws for molecular

dynamics and COVID-19 infection and fatality data from 314 cities (space points) using a well

established LSTM approach [13]. In section 3, we describe the spatial bag and the Transformer

model for it. Section 4 looks at LSTM and Transformer models for COVID-19 and a large

hydrology dataset [14]–[18]. In this study the work is motivated by realistic applications but only

uses them to study the technology. In later works we will apply these ideas to answer science

questions. The final section has conclusions

2. Motivating Examples

2.1 Deep Learning as a Particle Dynamics Integrator

Fig. 1. The average error in position updates for 16 particles interacting with an LJ potential, The left

figure is standard MD with error increasing for ∆t as 10, 40, or 100 times robust choice (0.001). On the

right is the LSTM network with modest error up to t = 10

6

even for ∆t = 4000 times the robust MD choice.

Molecular dynamics simulations rely on numerical integrators to solve Newton's equations of

motion. Using a sufficiently small time step to avoid discretization errors, these integrators

generate a trajectory of particle positions as solutions to the equations of motions. In [19]–[21],

the IU team introduces an integrator based on recurrent neural networks that is trained on

trajectories generated using a traditional Verlet integrator and learns to propagate the dynamics

of particles with timestep up to 4000 times larger compared to the Verlet timestep. As shown in

fig. 1 (right) the error does not increase as one evolves the system for the surrogate while the

standard integration in fig. 1 (left) has unacceptable errors even for time steps of just 10 times

that used in an accurate simulation. The surrogate demonstrates a significant net speedup over

Verlet of up to 32000 for few-particle (1 - 16) 3D systems and over a variety of force fields

including the Lennard-Jones (LJ) potential.

We often think of the laws of physics described by operators that evolve the system given

sufficient initial conditions and in this language, we have shown how to represent Newton’s law

operator by a recurrent network. We expect that the time dependence of many complex

systems: Covid pandemics, Southern California earthquakes, traffic flow, security events can be

described by deep learning operators that both capture the dynamics and allow predictions. In

the covid example below for example one can learn an operator that depends on the

demographics and social distancing approach for a given region.

2.2 Deep Learning to describe Covid Daily Data

Fig 2: Deep Learning fits to Covid case and death data from Feb. 1 to May 25, 2020, with predictions 2

weeks out and showing a weekly structure. The data is the square-root of counts for individual counties

normalized between 0 and 1.

There are extensive collections of daily data for the number of Covid reported cases and

deaths. These can be described by epidemiological models plus empirical fits [22] but as

proposed above and illustrated in fig. 2, we developed a deep learning model [23] that learned a

Covid daily evolution operator from (initially) 110 separate time series of curated (by the

University of Pittsburgh) data for different US cities. The time series were 100 days long and the

model was a 2 layer LSTM recurrent network similar to that used to describe the evolution of

molecular dynamics above. Additional features were learning from the demographics (fixed data

for each city) as well as time-dependent data and by predicting ahead for two weeks with each

series as shown in the figure. This capability is important in any application with multiple time

scales. For example, in earthquake forecasting multiscale in time effects are critical and one

might want to combine a general forecast for the next time step (days to months) with the

probability of the big one happening in the next 10 years. For 37 of the 110 cities reliable

empirical (not deep learning) fits are available to the case and death data up to April 15, 2020

[22]. A single deep learning time evolution operator can describe these 37 separate datasets

and smooth fitted data leads to very accurate deep learning descriptions shown in fig. 3. For

both figs. 2 and 3, the data is divided into windows of size 5, 9, or 13, and cases and deaths

were simultaneously trained together with demographic data. The later work in section 4

increases the number of locations to 314 and links with time-dependent mobility and social

distancing data[24].

Fig 3: Deep Learning Fits empirical Covid data descriptions with 37 separate results shown as summed

over cities. The cases and death were learned together in time series for different locations

3. General Formulation of Deep Learning for Time Series

3.1 Spatial Bag Problem

We now generalize the above to a spatial bag of time series shown in fig. 4, where we have a

set of time series where the spatial distances (e.g. locality) between points is not important;

rather they are differentiated by values of properties which can either be static (such as

percentages of population with high blood pressure for Covid example above or minimum

annual temperature for hydrology catchment studies) or dynamic (such as a local social

distancing measure). In later papers we will discuss problems which combine the features of

spatial bags and distance locality where convolutional networks (especially convLSTM) are

clearly useful.

Fig 4: Illustration of a spatial bag with associated time sequences

. t labels time and x location, The

Seq2Seq and forecast modes are illustrated and the structure of input data and predictions.

3.2 Using Transformers for Spatial Bags

The deep learning methods use approaches that were largely originally developed for natural

language processing.

Recurrent Neural Networks

RNN explicitly focus on

sequences and pass them

through a common network with

pretty subtle features. RNN are

designed to gather a history

which allows the time

dependence to be remembered

Attention-based methods [25]

are more modern and perhaps

somewhat easier to understand

as attention is a simple idea.

NLP is basically a classification

problem (look up tokens in a

context sensitive dictionary)

whereas (science) tends to be

numerical and so it is not

immediately obvious how to use

attention in technical time series

and we describe one possible approach in this paper. There have been a few studies of

transformer architectures for numerical time series such as [26]–[32] but there is not a large

literature.

Attention [12], [33] means that you “learn” structure from other related data and look for patterns

using a simple “dot-product” mechanism discussed later matching structure of different

sequences; there are other approaches to match patterns which is a good topic for future work.

Here we use a simple attention mechanism in an initial decoder but use a recurrent net LSTM

for the encoder as shown in fig. 5b). Such mixtures have been investigated and compared [34],

[35]. We compare the two architectures shown in a) and b) of fig. 5; a pure LSTM used in sec. 2

and a hybrid transformer

Scaled Dot-Product Attention and the Vectors Q K V

The basic item for LSTM and

Transformer is the same; a space

point with a time sequence with each

time in the sequence having a set of

static and dynamic values. In an

LSTM the sequence is “hidden” and

you have to unroll the recurrent

network to see it. However in

transformer the different time values

in a sequence are treated directly

and so each item contains W terms

(W is size of time sequence), Each

term is embedded in an input layer

and then mapped by 3 different

layers into vectors Q (query) K (key)

V . As shown in fig. 6, one matches

terms i and j by calculating Q(i)KT(j) and ranking with a soft-max step. This multiplies the

characteristic vector V(j) of this pattern and the total attention A(i) for item i, is calculated as a

weighted sum over values V(j). There are several different attention “heads” (networks

generating Q K V) in each step and the whole process is repeated in multiple encoder layers.

The result of the encoder step is considered separately for each item (each time in a time

sequence at a given location) and the embedded input of this layer is combined with the

attention as input to the LSTM decode step.

Choosing group of items over which Attention Calculated

In natural language processing, you look for patterns among neighbouring sentences but for

science time series you can have larger regions as spatial bags have no locality. This leads to

many choices as to the space over which attention is calculated as one can’t realistically

consider all items simultaneously. Suppose we have Nloc locations; each with Nseq sequences of

length W. Then the space to be searched has size Nloc. Nseq . W which is too large. In COVID-19

example Nloc= 314, Nseq ~ 200 and W up to 13. In the hydrology example in sec. 5, Nloc= 671,

Nseq ~ 7000 and W up to 270. The next subsection describes the 3 search strategies we have

looked at in FFFFWNPF: and depicted in fig. 7. There is

● Temporal search: points in sequence for fixed location

● Spatial search: locations for a fixed position in sequence

● Full search: complete location-sequence space

One will need to sample items randomly as only a small fraction of space is looked at in one

attention step whatever method used. Note that in all cases we used a batch size of 1 as the

attention space was effectively the batch. Actually in the LSTM stage, the different locations in

the attention space were considered separately and the attention search space became the

batch. In the work reported here the attention space (and batch size) was set to Nloc but this is

not required and would not work in some cases with large values of Nloc. Even in examples

considered here, the search space can get so large that one needs to address memory size

issues and we describe this in sec. 3.3. Note the encode step of the transformer is many matrix

multiplications and gets excellent GPU performance and typically LSTM decoder would be a

significant part of the compute time and so the addition of attention is not a major performance

hurdle.

Fig. 7: Three search strategies discussed in paper. N=N

A

is the number of items considered in

attention search- each labelled by a location and starting time - and each a window of length W.

W=5 in diagram. N

A

= N

B

=

Nloc here.

An epoch contains Nloc.Nseq sequences each of length W and consisting of static and dynamic

variables characteristic of location. For an LSTM, you would have a batch which consists of NB ~

Nloc sequences. The number of batches in an epoch is approximately Nseq. For a transformer the

batch is currently unwrapped as discussed above and used as the attention space. Memory use

or compute issues could require a different strategy separately considering batch and attention

space.

Note the attention space choice implies different initial shuffling to form batches and also

different prediction stage approaches. For spatial and temporal searches one can keep all

locations at a particular time value together in both forming batches and in calculating

prediction. For the full search the complete set of Nloc.Nseq sequences must be shuffled. In

practice we combined the spatial and temporal search and accumulated the results of these two

attention searches.

3.3 Comments on FFFFWNPF Implementation

Sequence Memory Use: Suppose that one has Nloc locations, Nseq sequences and time windows

of length W. This requires memory of order NlocNseqW which for hydrology example exceeds

CPU memory for W 10. However the sequences are only needed at the time the sequence is

being used in a batch and so we use “Virtual Windows” where the system is set up with just time

values with predictions calculated for sequences that end at each time value. Then one forms

the sequences inside the deep learning loop dynamically for each batch. This requires care to

be efficient and needs custom Tensorflow training but works well with little overhead. Probably

this approach should always be used for time series.

Attention Search Memory Use: The full search requires matrices of size (NAW)2 and this can

be too large to fit in the GPU memory. As W is chosen for a given scenario, if this limit is seen,

one essentially needs to reduce the number of locations (NA) and this is done (hidden from user)

by breaking matrix into blocks inside the function calculating the scaled dot product attention.

These blocks require tensor slicing which is a little tricky as sliced tensors cannot be assigned in

current tensorflow but appending blocks to a list and using concat achieves this goal. This

approach was used for larger values of W for models that used the mechanism of fig. 7c). In

future work, we can of course look at parallelism to address both memory use and performance.

For W=120 for example, each epoch takes over an hour to calculate on Colab.

Deep Learning Approach: All these results used Tensorflow and Jupyter notebooks but they

can surely use PyTorch or other frameworks. The simplest LSTM can be done with the basic

sequential Keras model with a succession of layers. The Transformer has a more complex

structure that needs Tensorflow’s custom training. We designed a custom monitor that

checkpointed (using Tensorflow’s standard mechanism) weights and monitored the variation of

loss with increasing epoch number. Large increases in loss were rolled back during the training

and at the end of the training, the best checkpointed result was used. The number of rollbacks

increased at end of run when improvement was anyway very small

Real-Time Issues: Often time-series are observed in real-time and need to be responded to

with low latency requiring inference to execute at the edge. This was not relevant for examples

considered here with daily data. However these methods need to be reviewed for real-time

inference scenarios when the large search space that can be used by the transformer can

significantly increase prediction time latency.

Structure of workflow: We can divide data processing for the type of problem considered here

into five stages: pre-notebook; specialized notebook input; generic data pre-processing; training;

visualization. We give details of five steps below

1) The pre-notebook data engineering performs actions specific to a particular domain as,

for example, in forming binned time series from recorded earthquake events.

2) This stage prepares data for the notebook which reads the data (typically csv files) into

a set of numpy arrays. This includes dynamic and static properties as well as metadata

such property and location names.

3) In the important third stage, the notebook performs generic preparation tasks common to

all datasets. This includes normalizing static and dynamic data by taking powers (square

root, cube root, log to reduce standard deviation/mean) and linear transformations so

values lie between 0 and 1. Also one must generate sequences (or prepare the virtual

sequences described above under sequence memory use) and generate predictions

associated with each sequence final time value. The predictions include futures which for

COVID-19 use cases was for two weeks ahead. These cannot be found for sequences

whose endpoint is within two weeks of the final data point; those points are set to NaN.

In general the method accommodates any missing data for predictions but not for the

input sequences where all data must be present. The generic input processing includes

adding of positional space and time encoding for both LSTM and Transformer. We use

simple linear indices for both space and time supplemented by periodic time encoding

which is weekly for the COVID-19 data but annual for hydrology. This periodic structure

corresponds to series such as (cosθ, sinθ) where θ runs from 0 to 2π over the period

length. Note this seems extravagant but it is not and performance of the network is not

significantly impacted by adding either these encoding or by the extra future variables.

4) The next stage is a custom Tensorflow training where we must of course design the

network from a collection of class instances. It is also set up with a monitor that controls

the backup and restores the weights if the optimizer sends the fit into left field and the

loss is significantly increased. We must set the parameters to control fit such as the

search space for the transformer and the usual hyperparameters including number and

size of layers, dropout, epoch, batch, and validation set, The training code must

generate the sequences every batch if virtual windows are used. The long Tensorflow

training runs ( up to over an hour per epoch) sometimes fail for Colab system glitches

and jobs are often set up to restart from the backups. Further the weight architecture is

independent of window size W, so large window runs W 60 can be initialized with

results of faster runs with smaller window size -- typically choosing a stage which was

not fully converged but had a loss that was perhaps 30% above the final value. We use

a simple custom loss function which certainly needs to recognize any missing prediction

data designated by NaN in value. This is easy to test on and as loss function is additive

over predictions it is sufficient to just skip over such points in calculations in the custom

loss function.

5) The last notebook activity is the many post fit visualization and analysis steps where

care is needed to generate accurate efficient predictions.

4. Application to Hydrology Problems

This work was motivated by an NCAR summer school including a video lecture [36] describing

the use of recurrent neural networks in hydrology. There is also a good review [37] of deep

learning in Hydrology with 129 papers. Many of the 129 papers are based on the CAMELS

dataset from NCAR [15] where we focus on one of the most sophisticated analyses by Kratzert

[16], [18]. A summary of this is contained in the resource [38] produced for a summer

undergraduate research project [39]. The CAMELS data has 671 easily used locations

(catchments) where 6 observables (rainfall, runoff, etc.) are defined on each location for 20

years. As well as this dynamic data there are 27 static variables for each location. The goal is to

build a single model describing these 671 locations. This can be posed in many ways with

different input and output choices. It can also be formulated as a sequence to sequence model

or as a sequence to forecast model (as Covid and Kratzert did). The loss function can be MSE

or similar or more subtly the Nash–Sutcliffe efficiency NSE normalized by the measured

standard deviation.

In this paper, we are not aiming at science but rather a technology evaluation of LSTM and the

hybrid transformer. So we asked how well these models described the full dataset rather than

using half the dataset to predict the other half as was done in [16], [18]. We looked at the 671

locations and 7031 daily data and windows W from 5 to 120. We did not see any advantages of

large windows for the CAMELS dataset and present results for W=25. Interestingly the

hydrology analysis preferred the spatial and temporal search strategies, (a) and b) in fig. 7,

while COVID-19 analysis hybrid transformer analysis preferred the full search, c) in fig. 7. In

both cases, the hybrid transformer obtained a final loss MSE that was 20% lower than the

LSTM. Further, we found 2 Encoder layers gave good answers with quick convergence and

either 4 (usual choice) or 8 heads per layer were sound. We explored different choices of the

network embedding for Q K V and input data but a simple single layer with SELU activation for

each embedding worked well. The internal representation of the space time points used 128

bits. However this hyperparameter search was not thorough.

The 27 static variables p_mean, pet_mean, aridity, p_seasonality, frac_snow_daily,

high_prec_freq, high_prec_dur, low_prec_freq, low_prec_dur, elev_mean, slope_mean,

area_gages2, forest_frac, lai_max, lai_diff, gvf_max, gvf_diff, soil_depth_pelletier,

soil_depth_statsgo, soil_porosity, soil_conductivity, max_water_content, sand_frac, silt_frac,

clay_frac, carb_rocks_frac, geol_permeability, were used with details given in [40]. The 6

dynamic data with daily values are given in the table below

Table: CAMELS data recorded daily

Dynamic attribute

description

unit

1

prcp(mm/day)

daily cumulative precipitation

mm/day

2

srad(W/m2)

average short-wave radiation

W/m2

3

tmax(C)

daily maximum air temperature

C

4

tmin(C)

daily minimum air temperature

C

5

vp(Pa)

vapor pressure

Pa

6

QObs(mm/d)

Discharge (cubic meters per day/ basin area)

mm/day

All this data was scaled between 0 and 1 while QObs and prcp had the cube root taken before

this scaling to give greater representation quality by increasing scaled standard deviation.

We investigated pure LSTM and hybrid transformer representations of the hydrology data with a

variety of choices for window size W and the different attention mechanisms described in

section 3.2. We did not see significant dependence on W for values ≧ 21 and typical results are

shown in fig. 8 for W = 25. There are a lot of fluctuations in the rainfall (prcp observable) making

figure 8 a) left) hard to interpret and it is shown with an expanded scale in fig. 8 d). As

mentioned already, we obtained the best representations using the stratified search strategies

across spatial or time directions, a) and b) in fig.7.

Fig. 8(a) Results of the hybrid transformer for the first two daily time series prcp and

srad,predicting one day in the future. This had 4 attention heads and 2 encoder layers and used

search strategies a) and b). The left plot is expanded in fig. 8(d). The data in figures 8 a) to d)

are the sum over locations of normalized data with cube root taken for prcp and QObs.

Fig. 8(b) Results of the hybrid transformer for the middle two daily time series tmax and

tmin,predicting one day in the future. This had 4 attention heads and 2 encoder layers and used

search strategies a) and b)

Fig. 8(a) Results of the hybrid transformer for the final two daily time series vp and QObs,

predicting one day in the future. This had 4 attention heads and 2 encoder layers and used

search strategies a) and b). Note a few measurements of QObs are missing and so summed

data is not available for some days in the right plot. The fit does all 671 locations separately and

so the region where the sum is not plotted is in fact well covered in the fit.

Fig. 8(d) Expansion of the plot in fig. 8(a) left with x-axis expanded a factor of 6. We only record

a third of the days as these illustrate what is going on quite well. The other 4 plots for remaining

days are similar

We also applied the hybrid transformer model to the COVID-19 data introduced in sec. 2.

Typical results are shown in figure 9 that not only gives the summed totals as in figs.2, 3, 8 but

also the particular description of data from New York City. Unlike the hydrology data, the full

attention search fig. 7 c) gave the best description.

Fig. 9(a) Results of the hybrid transformer for the COVID-19 infections (left) and fatalities (right)

for a larger sample than that shown in fig. 2. The network had 4 attention heads and 2 encoder

layers and used search strategy c) and a window size of 9. We show variation in predictions

across different samplings of the attention search space. The figure shows sum over

counties/cities of normalized fitted data which is square root of individual counts

Fig. 9(b) Results of the hybrid transformer for the COVID-19 infections (left) and fatalities (right)

in New York City for the fit presented in fig. 9 a). This figure shows true counts

5.

Conclusions

Above we have given examples of recurrent networks of the time evolution operator for complex

systems and we are extending this to other areas. We see the mix of networks used above as a

base approach applicable to many problems. Some examples need additional features:

earthquakes (with fault lines) and transportation (road systems) need some variant of graph

networks while mixtures of convolutional and recurrent networks (such as convLSTM) are used

in weather and again earthquakes where the time series features can consist of images. Here

we identify a new problem class -- spatial geometry where unlike spatial bag models, the spatial

locality of the points in the problem is important

We intend to study deep learning based time evolution operators for different complex systems

and identify patterns as to which type of network describes which problem classes and the

amount of data needed to get good results. Hopefully, we will also make research advances in

the best networks to use; this is already seen in the move from recurrent networks to

transformer and reformer architectures but this was largely motivated by sequence to sequence

mapping and not by time series. We suggest more research in multiple or hierarchical time

scales as this is needed in many applications.

We intend to build a benchmark set of time series datasets and reference implementations as

playing the same role for time series that ImageNet ILSRVC and AlexNet played for images.

The different implementations establish best practice or get chosen for different application

areas to either suggest an architecture or an initial network by transfer learning. Interesting

complex systems that we can quickly look at, include virtual tissues [41], [42] and

epidemiology[43] for Covid related applications. Such evolution operators are also seen[3] in

finance, networking, security, monitoring of complex systems from Tokamaks[44] to operating

systems, and environmental science.

MLPerf benchmarks aim to quantitatively study the highest performance hardware and software

systems. However, they also serve as examples of best practice and can help advance a field

by documenting best practices. We intend to combine the open datasets and clean reference

implementations available in MLPerf with documentation and tutorials which will allow MLPerf

benchmarks to encourage the broad community to study these examples and use the ideas in

other applications as well improving on our base reference implementations [45]. This work will

be performed in the MLPerf Science Data which was just set up and is led by Fox and Hey

(Chief Data Scientist at the Rutherford Appleton Laboratory). We will build multiple time series of

the “MLPerf tutorial style” starting with initial projects from Indiana and identifying other

examples from either the working group compilation [3] or identified in the meetings of MLPerf

working groups. We will also look at areas including anomaly/failure detections, device metrics

analysis, troubleshooting, and many IoT related data streams; examples have been compiled in

cybersecurity and industrial operation categories by MLPerf [3].

Acknowledgements

This work is partially supported by the National Science Foundation (NSF) through awards

CIF21 DIBBS 1443054, nanoBIO 1720625, CINES 1835598 and Global Pervasive

Computational Epidemiology 1918626. I thank Lijiang Guo, Gregor von Laszewski,

Saumyadipta Pyne, Xinyuan Huang, Bo Feng, Russell Hofmann, JCS Kadupitiya, and Vikram

Jadhao for great discussions and Niranda Perera for preparing the Hydrology dataset. I am

grateful to Cisco University Research Program Fund grant 2020-220491 for supporting this

research.

Software

A recent FFFFWNPF Google Colab notebook can be found at

https://colab.research.google.com/drive/1dUwrxlUq8G8HHTzImJXmJ9qiUFabdIKf?usp=sharing.

This defines hyperparameters and networks and is open source. The name FFFFWNPF

remembers the name of the optimization package that I developed over 50 years ago and used

in physics data analysis such as [46]. FFFF stands for Fee-fi-fo-fum [47] and WNPF is just Well

Nigh Perfect Fit. The original FFFWNPF suffered the fate of computer cards and retired

electronic stores past their glory days [48] at LBNL. This fate was entirely my fault as I was

thinking about other things; FFFWNPF was very grateful for example for the extensive work

done on the CDC 6600 and 7600’s at LBNL.

References

[1] P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Patterson, H. Tang, G.-Y. Wei, P.

Bailis, V. Bittorf, D. Brooks, D. Chen, D. Dutta, U. Gupta, K. Hazelwood, et al.

, “MLPerf Training

Benchmark,” in Proceedings of Machine Learning and Systems 2020

, 2020, pp. 336–349.

[2] “MLPERF benchmark suite for measuring performance of ML software frameworks, ML hardware

accelerators, and ML cloud platforms.” [Online]. Available: https://mlperf.org/. [Accessed:

08-Feb-2019]

[3] X. Huang, G. C. Fox, S. Serebryakov, A. Mohan, P. Morkisz, and D. Dutta, “Benchmarking Deep

Learning for Time Series: Challenges and Directions,” in 2019 IEEE International Conference on Big

Data (Big Data)

, 2019, pp. 5679–5682 [Online]. Available:

http://dx.doi.org/10.1109/BigData47090.2019.9005496

[4] Yan Liu, “Artificial Intelligence for Smart Transportation Video.” [Online]. Available:

https://slideslive.com/38917699/artificial-intelligence-for-smart-transportation. [Accessed:

08-Aug-2019]

[5] Geoffrey Fox, “Class Presentation: Studying Time Series with Deep Learning.” [Online]. Available:

https://docs.google.com/presentation/d/1ddAsHq-uikjPnomnSE21E0R0y7xBIrLGvaP_b6CM0dw/edit

?usp=sharing. [Accessed: 13-Nov-2020]

[6] Geoffrey Fox, “Video of Studying Time Series for Deep Learning.” [Online]. Available:

https://youtu.be/teXeAdX6_Cg. [Accessed: 13 November, 2020]

[7] Geoffrey Fox, “Deep Learning Based Time Evolution.” [Online]. Available:

http://dsc.soic.indiana.edu/publications/Summary-DeepLearningBasedTimeEvolution.pdf. [Accessed:

08-Jun-2020]

[8] C. Widanage, N. Perera, V. Abeykoon, S. Kamburugamuve, T. A. Kanewala, H. Maithree, P.

Wickramasinghe, A. Uyar, G. Gunduz, and G. Fox, “High Performance Data Engineering

Everywhere,” arXiv [cs.DC]

, 19-Jul-2020 [Online]. Available: http://arxiv.org/abs/2007.09589

[9] Vibhatha Abeykoon, Niranda Perera, Chathura Widanage, Supun Kamburugamuve, Thejaka Amila

Kanewala, Hasara Maithree, Pulasthi Wickramasinghe, Ahmet Uyar, Geoffrey Fox, “Data

Engineering for HPC with Python,” in SC20 Proceedings

, Atlanta [Online]. Available:

https://www.researchgate.net/profile/Geoffrey_Fox/publication/344211401_Data_Engineering_for_H

PC_with_Python/links/5f5c2f3892851c07895fdbce/Data-Engineering-for-HPC-with-Python.pdf

[10] Geoffrey Fox, “FFFFWNPF Time Series Deep Learning Jupyter Notebook.” [Online]. Available:

https://colab.research.google.com/drive/1dUwrxlUq8G8HHTzImJXmJ9qiUFabdIKf?usp=sharing.

[Accessed: 14-Nov-2020]

[11] Copyright 2019 The TensorFlow Authors., “Transformer model for language understanding: Google

Colab Tutorial.” [Online]. Available:

https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/transforme

r.ipynb. [Accessed: 13-Nov-2020]

[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I.

Polosukhin, “Attention Is All You Need,” arXiv [cs.CL]

, 12-Jun-2017 [Online]. Available:

http://arxiv.org/abs/1706.03762

[13] Geoffrey C. Fox, Gregor von Laszewski, Fugang Wang, Saumyadipta Pyne, “AICov: An Integrative

Deep Learning Framework for COVID-19 Forecasting with Population Covariates,” Arxiv 2010.03757,

Jul. 2020 [Online]. Available: http://dsc.soic.indiana.edu/publications/paper_covid.pdf,

https://arxiv.org/abs/2010.03757

[14] Kratzert, Frederik, “CAMELS Extended Maurer Forcing Data.” [Online]. Available:

https://www.hydroshare.org/resource/17c896843cf940339c3c3496d0c1c077/. [Accessed:

14-Jul-2020]

[15] N. Addor, A. J. Newman, N. Mizukami, and M. P. Clark, “The CAMELS data set:Catchment attributes

and meteorology for large-sample studies,” Hydrol. Earth Syst. Sci.

, vol. 21, no. 10, p. 21, Oct. 2017

[Online]. Available: https://ueaeprints.uea.ac.uk/id/eprint/65434/. [Accessed: 05-Jul-2020]

[16] Frederik Kratzert, “Catchment-Aware LSTMs for Regional Rainfall-Runoff Modeling GitHub.” [Online].

Available: https://github.com/kratzert/ealstm_regional_modeling. [Accessed: 14-Jul-2020]

[17] A. J. Newman, M. P. Clark, K. Sampson, A. Wood, L. E. Hay, A. Bock, R. J. Viger, D. Blodgett, L.

Brekke, J. R. Arnold, and Others, “Development of a large-sample watershed-scale

hydrometeorological data set for the contiguous USA: data set characteristics and assessment of

regional variability in hydrologic model performance,” Hydrol. Earth Syst. Sci.

, vol. 19, no. 1, p. 209,

2015 [Online]. Available:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.668.2612&rep=rep1&type=pdf

[18] F. Kratzert, D. Klotz, G. Shalev, G. Klambauer, S. Hochreiter, and G. Nearing, “Towards learning

universal, regional, and local hydrological behaviors via machine learning applied to large-sample

datasets,” Hydrology & Earth System Sciences

, vol. 23, no. 12, 2019 [Online]. Available:

https://arxiv.org/abs/1907.08456

[19] JCS Kadupitiya, Geoffrey C. Fox, Vikram Jadhao, “GitHub repository for Simulating Molecular

Dynamics with Large Timesteps using Recurrent Neural Networks.” [Online]. Available:

https://github.com/softmaterialslab/RNN-MD. [Accessed: 01-May-2020]

[20] J. C. S. Kadupitiya, G. C. Fox, and V. Jadhao, “Simulating Molecular Dynamics with Large Timesteps

using Recurrent Neural Networks,” arXiv [physics.comp-ph]

, 12-Apr-2020 [Online]. Available:

http://arxiv.org/abs/2004.06493

[21] J. C. S. Kadupitiya, G. Fox, and V. Jadhao, “Recurrent Neural Networks Based Integrators for

Molecular Dynamics Simulations,” in APS March Meeting 2020

, 2020 [Online]. Available:

http://meetings.aps.org/Meeting/MAR20/Session/L45.2. [Accessed: 23-Feb-2020]

[22] Robert Marsland and Pankaj Mehta, “Data-driven modeling reveals a universal dynamic underlying

the COVID-19 pandemic under social distancing,” arXiv [q-bio.PE]

, 21-Apr-2020 [Online]. Available:

http://arxiv.org/abs/2004.10666

[23] Luca Magri and Nguyen Anh Khoa Doan, “First-principles Machine Learning for COVID-19

Modeling,” Siam News

, vol. 53, no. 5, Jun. 2020 [Online]. Available:

https://sinews.siam.org/Details-Page/first-principles-machine-learning-for-covid-19-modeling

[24] A. Adiga, L. Wang, A. Sadilek, A. Tendulkar, S. Venkatramanan, A. Vullikanti, G. Aggarwal, A.

Talekar, X. Ben, J. Chen, B. Lewis, S. Swarup, M. Tambe, and M. Marathe, “Interplay of global

multi-scale human mobility, social distancing, government interventions, and COVID-19 dynamics,”

medRxiv - Public and Global Health

, 07-Jun-2020 [Online]. Available:

http://dx.doi.org/10.1101/2020.06.05.20123760

[25] A. Galassi, M. Lippi, and P. Torroni, “Attention in Natural Language Processing,” arXiv [cs.CL]

,

04-Feb-2019 [Online]. Available: http://arxiv.org/abs/1902.02181

[26] D. A. Kaji, J. R. Zech, J. S. Kim, S. K. Cho, N. S. Dangayach, A. B. Costa, and E. K. Oermann, “An

attention based deep learning model of clinical events in the intensive care unit,” PLoS One

, vol. 14,

no. 2, p. e0211057, Feb. 2019 [Online]. Available: http://dx.doi.org/10.1371/journal.pone.0211057

[27] T. Gangopadhyay, S. Y. Tan, Z. Jiang, R. Meng, and S. Sarkar, “Spatiotemporal Attention for

Multivariate Time Series Prediction and Interpretation,” arXiv [cs.LG]

, 11-Aug-2020 [Online].

Available: http://arxiv.org/abs/2008.04882

[28] N. Xu, Y. Shen, and Y. Zhu, “Attention-Based Hierarchical Recurrent Neural Network for Phenotype

Classification,” in Advances in Knowledge Discovery and Data Mining

, 2019, pp. 465–476 [Online].

Available: http://dx.doi.org/10.1007/978-3-030-16148-4_36

[29] R. S. Kodialam, R. Boiarsky, and D. Sontag, “Deep Contextual Clinical Prediction with Reverse

Distillation,” arXiv [cs.LG]

, 10-Jul-2020 [Online]. Available: http://arxiv.org/abs/2007.05611

[30] J. Gao, X. Wang, Y. Wang, Z. Yang, J. Gao, J. Wang, W. Tang, and X. Xie, “CAMP: Co-Attention

Memory Networks for Diagnosis Prediction in Healthcare,” in 2019 IEEE International Conference on

Data Mining (ICDM)

, 2019, pp. 1036–1041 [Online]. Available:

http://dx.doi.org/10.1109/ICDM.2019.00120

[31] R. Sen, H.-F. Yu, and I. Dhillon, “Think Globally, Act Locally: A Deep Neural Network Approach to

High-Dimensional Time Series Forecasting,” arXiv [stat.ML]

, 09-May-2019 [Online]. Available:

http://arxiv.org/abs/1905.03806

[32] H. Song, D. Rajan, J. J. Thiagarajan, and A. Spanias, “Attend and diagnose: Clinical time series

analysis using attention models,” in Thirty-second AAAI conference on artificial intelligence

, 2018

[Online]. Available: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/16325

[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. U. Kaiser, and I.

Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems

,

2017, vol. 30, pp. 5998–6008 [Online]. Available:

https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

[34] A. Zeyer, P. Bahar, K. Irie, R. Schlüter, and H. Ney, “A Comparison of Transformer and LSTM

Encoder Decoder Models for ASR,” in 2019 IEEE Automatic Speech Recognition and Understanding

Workshop (ASRU)

, 2019, pp. 8–15 [Online]. Available:

http://dx.doi.org/10.1109/ASRU46091.2019.9004025

[35] Z. Zeng, V. T. Pham, H. Xu, Y. Khassanov, E. S. Chng, C. Ni, and B. Ma, “Leveraging Text Data

Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning,” arXiv [eess.AS]

,

21-May-2020 [Online]. Available: http://arxiv.org/abs/2005.10407

[36] Chaopeng Shen, Penn State University, “D2 2020 AI4ESS Summer School: Recurrent Neural

Networks and LSTMs.” [Online]. Available: https://www.youtube.com/watch?v=vz11tUgoDZc.

[Accessed: 01-Jul-2020]

[37] M. A. Sit, B. Z. Demiray, Z. Xiang, G. Ewing, Y. Sermet, and I. Demir, “A Comprehensive Review of

Deep Learning Applications in Hydrology and Water Resources,” 2020 [Online]. Available:

https://eartharxiv.org/xs36g/

[38] Fugang Wang, “LATEST Short demo-LSTM-Tutorial.ipynb Hydrology on CAMELS Notebook.”

[Online]. Available:

https://colab.research.google.com/drive/1MePoMs1mNmsiPMxUi2RxExgVQKCzhM17?usp=sharing.

[Accessed: August 1,. 2020]

[39] Geoffrey Fox, “Hydrology AIHEC REU Page for summer research 2020.” [Online]. Available:

https://docs.google.com/document/d/1gC7jyhfGjYihZhn8AF-4qhquJlQYUd97sqXrr7FPZfs/edit?usp=s

haring. [Accessed: 01-Jul-2020]

[40] NCAR, “CAMELS: CATCHMENT ATTRIBUTES AND METEOROLOGY FOR LARGE-SAMPLE

STUDIES - DATASET DOWNLOADS.” [Online]. Available:

https://ral.ucar.edu/solutions/products/camels. [Accessed: 17-Nov-2020]

[41] T. J. Sego, J. O. Aponte-Serrano, J. F. Gianlupi, S. Heaps, K. Breithaupt, L. Brusch, J. M. Osborne,

E. M. Quardokus, and J. A. Glazier, “A Modular Framework for Multiscale Spatial Modeling of Viral

Infection and Immune Response in Epithelial Tissue,” BioRxiv

, 2020 [Online]. Available:

https://www.biorxiv.org/content/10.1101/2020.04.27.064139v2.abstract

[42] Yafei Wang, Gary An, Andrew Becker, Chase Cockrell, Nicholson Collier, Morgan Craig, Courtney L.

Davis, James Faeder, Ashlee N. Ford Versypt, Juliano F. Gianlupi, James A. Glazier, Randy Heiland,

Thomas Hillen, Mohammad Aminul Islam, Adrianne Jenner, Bing Liu, Penelope A Morel, Aarthi

Narayanan, Jonathan Ozik, Padmini Rangamani, Jason Edward Shoemaker, Amber M. Smith, Paul

Macklin, “Rapid community-driven development of a SARS-CoV-2 tissue simulator,” BioRxiv

, 2020

[Online]. Available: https://www.biorxiv.org/content/10.1101/2020.04.02.019075v2.abstract

[43] D. Machi, P. Bhattacharya, S. Hoops, J. Chen, H. Mortveit, S. Venkatramanan, B. Lewis, M. Wilson,

A. Fadikar, T. Maiden, C. L. Barrett, and M. V. Marathe, “Scalable Epidemiological Workflows to

Support COVID-19 Planning and Response,” May 2020.

[44] J. Kates-Harbeck, A. Svyatkovskiy, and W. Tang, “Predicting disruptive instabilities in controlled

fusion plasmas through deep learning,” Nature

, vol. 568, no. 7753, pp. 526–531, Apr. 2019 [Online].

Available: https://doi.org/10.1038/s41586-019-1116-4

[45] G. Fox, “SciDatBench AI for Science Benchmark Activity working with MLPerf.” [Online]. Available:

https://github.com/DSC-SPIDAL/SciDatBench/. [Accessed: 01-Aug-2020]

[46] Geoffrey Fox, “Veni, Vidi, Vici Regge Theory; Comments Nucl.Part.Phys. 3 (1969) 190-197. (journal

disappeared).” [Online]. Available:

https://www.dsc.soic.indiana.edu/sites/default/files/Veni%2C%20Vidi%2C%20Vici%20Regge%20The

ory.pdf. [Accessed: 17-Nov-2020]

[47] “Wikipedia Fee-fi-fo-fum.” [Online]. Available: https://en.wikipedia.org/wiki/Fee-fi-fo-fum. [Accessed:

17-Nov-2020]