Content uploaded by Tim O'Shea

Author content

All content in this area was uploaded by Tim O'Shea on Mar 20, 2017

Content may be subject to copyright.

Recurrent Neural Radio Anomaly Detection

Timothy J. O’Shea

Bradley Department of Electrical

and Computer Engineering

Virginia Tech, Arlington, VA

Email: oshea@vt.edu

T. Charles Clancy

Bradley Department of Electrical

and Computer Engineering

Virginia Tech, Arlington, VA

Email: tcc@vt.edu

Robert W. McGwier

Bradley Department of Electrical

and Computer Engineering

Virginia Tech, Arlington, VA

Email: rwmcgwi@vt.edu

Abstract—We introduce a powerful recurrent neural network

based method for novelty detection to the application of detecting

radio anomalies. This approach holds promise in signiﬁcantly

increasing the ability of naive anomaly detection to detect small

anomalies in highly complex complexity multi-user radio bands.

We demonstrate the efﬁcacy of this approach on a number of

common real over the air radio communications bands of interest

and quantify detection performance in terms of probability of

detection an false alarm rates across a range of interference to

band power ratios and compare to baseline methods.

I. INTRODUCTION

Anomaly detection is an important time-series function

which is widely used in network security monitoring, medical

sensor monitoring, ﬁnancial change modeling, and any number

of other applications. There is a signiﬁcant body of existing

work on the subject in theory and applications [4] [13], we

focus here primarily on reconstruction based novelty detec-

tion. In radio, applications of anomaly detection have been

discussed [7] but not widely used outside of a few small niche

applications.

Radio anomaly detection has been leveraged somewhat in

wireless sensor networks such as in [11] [8] [3] [9], but most

of these applications focus on detecting changes in sensor data

(temperature, pressure, etc), or expert features rather than on

anomalies occurring in the high rate raw physical layer radio

signal itself. We are unaware of any currently widely used or

investigated applications of naive anomaly detection within the

raw unmodiﬁed radio physical layer rather than on an expert

feature such as a detection statistic. The focus on raw RF rather

than a specialized statistic is based on the hope that such a

technique may be able to generalize well to numerous types

of signals, rather than relying on speciﬁc analysis to help a

single scenario.

There are a number of driving motivations for such a

capability; for instance for spectrum access enforcement by

regulatory bodies in which the appearance of new radio users

of any kind of emission on unauthorized bands or with

unauthorized equipment or techniques presents an enforcement

event which should be rapidly addressed. In commercial and

defense communications applications radio anomaly detection

also presents the opportunity to rapidly recognize interfering

emitters, malfunctioning equipment, or malicious users within

their licensed bands and take action. Each of these applications

currently requires expensive high maintenance expert systems

which often perform a series of steps of energy detection, lo-

calization, classiﬁcation, and comparison to baseline databases,

and alerting with large amounts of specialization and tuning

to the band and signals of interest and taking signiﬁcant com-

putational horsepower and implementation expense to deploy.

By shifting these systems to more generally applicable naive

methods using neural networks which can be highly optimized

on concurrent architectures such as graphics processing units

(GPUs) in a very general way, we offer the potential to make

such systems much easier to realize, adapt to new domains,

and run leveraging economies of scale on computing power

and model primitive optimization.

II. AP PROACH

Recently, the use of recurrent neural networks to form

a predictive reconstruction as part of a novelty detector has

been proposed [17] and demonstrated to function quite well

on several time series datasets including electrocardiograms,

physical telemetry signals, and power consumption metrics.

In this work a long short term memory [2] (LSTM) based

recurrent network is used to train a time series predictor on a

training set which is then used to compute an expected error

distribution vs the real signal which is well characterized for

non-anomalous behavior. The sequence learning capacity of

LSTM has in some cases been shown to exceed that of Hid-

den Markov model [10] and Kalman based linear predictors,

because it is able to take into account a much more complex

nonlinear representation of state (does not make the Markov

assumption), short and long term transition dependencies, and

complex nonlinear output mapping dependencies than either of

these prior models are capable of. An overview of this system

level model is shown in ﬁgure 1.

In this case we consider a sampled radio data time series

X={x1, x2, ..., xM}, our training support, where each point

is a complex base-band sample in R2. We train a predictor f

with learned parameters θsuch that for any start offset k, input

samples count N, and prediction length `, we form a sequence

regression problem shown in equation 1.

{ˆxk+N, ..., ˆxk+N+`−1}=f({xk, ..., xk+N−1, θN N )(1)

This function fcan take the form of any number of

different prediction methods, but in this case we evaluate a

recurrent neural network (RNN) based predictor leveraging

LSTM layers.

A predictor error vector can then be obtained on the

training set by computing the difference ei=xi−ˆxifor each

predicted value given the known (non-causal) actual values of

xover [k+N, k +N+`−1]. This error vector is used to

arXiv:1611.00301v1 [cs.LG] 1 Nov 2016

Fig. 1. Neural Radio Anomaly Detection System Model

estimate of distribution {e0, ..., e`−1} ∼ Pe(e(`), θE)through

some form of parametric or non-parametric density estima-

tion. In this case, we model using a parametric multivariate

Gaussian distribution.

After ﬁtting both the predictive model parameters θNN

and the error distribution parameters θEon the dataset, we

can now use this model to perform novelty detection by

observing regions within the signal x(n)where predictor

error deviates signiﬁcantly from the expected error distribution

De,p(e(`), θE). That is to say, we wish to compute

p(xi∼De).

This can be done by thresholding the log-likelihood of ei

in Deat some level to make a decision, or by combining

Vsequential samples of the likelihood and thresholding the

aggregate statistic. Our expression used looks roughly like that

given below in equation 2.

τ

H1

≷

H0

10 log10

V

Y

v=0

p(e(`)

v)(2)

Here H1is the hypothesis that the current values of

xi, ..., xi+Nfor each of the Vobservations are drawn from

a distribution matching that of our training set, representing

’normal’ behavior, while H0is the hypothesis that the current

values of xi, ..., xi +Nare not drawn from the distribution

of the training set, representing an anomaly or novelty being

present. Our threshold τcan be ﬁt using an Fβor typical

constant false alarm rate (CFAR) sorts of analysis [1] if a

decent dataset of ”anomalous” behavior is known, or a false

alarm rate on ”normal” behavior is used as the metric.

III. WIDE-BAN D RAD IO COMMUNICATIONS TIME-SERIES

Time-series in the radio domain are quite complex models,

especially when considering wideband aggregates of numerous

channels on separate frequencies, each with their own tem-

poral and spectral channel access scheme, and each carrying

whitened randomized data-bits which typically do not repeat

aside from reference tones such as preamble, low-entropy-

headers and pilots. A time sequence prediction model for such

an aggregate signal must then be able to account for short-time

expected symbol transitions and pulse shaping of a each carrier

and its channel variations as well as the symbols forming

higher level trafﬁc and application sequences representing

behavior of users. At both layers we must be able to model the

sum of all users and emitters combined into a single shared

medium on one or more channels. Differing levels of predictor

complexity and predictive capacity deﬁne how many of these

effects of ’normal’ behavior are modeled, deﬁning the scale,

complexity, or distance from the norm of the anomaly which

can be detected using that model.

Fig. 2. Spectrogram plots of excerpts from radio example sequences

In ﬁgure 2 we show spectrogram plots demonstrating a

variety of radio time-series complexities of real world signals

which we further consider in thi paper. Here we see examples

ranging from the most static, continuously modulated analog

FM broadcast carriers, through relatively well structured on

a macro-scale but extremely complexly coded and changing

on a micro-scale, cellular band carriers of both GSM and

LTE with its rapidly changing resource block (RBs) allo-

cations in OFDMA time-frequency slots, and ﬁnally to the

most chaotic ISM band environment comprised by CSMA/CD

WiFi/IEEE802.11 bursts occurring at random times, and fre-

quency hopped BlueTooth/IEEE802.15.1 bursts occurring at

random times and frequencies among other emitters.

Each of these bands presents different complexities which

a temporal sequence prediction model needs to capture to

accurately form a predictive model for signal behavior. We will

consider each of these examples while evaluating our anomaly

detection performance and train using these real recorded

over the air RF datasets including harsh urban channel fading

conditions. Recordings are conducted with an Ettus Research

B200-mini [6] which uses the Analog Devices AD9361 RFIC

front-end and stored to disk for analysis using GNU Radio [5].

IV. PRE DI CT OR MO DE LS

Here we describe each model f({xk, ..., xk+N−1, θNN )

which we use to predict our next samples of the time-series

sequence. For fairness of comparison we normalize each pre-

dictor to 32 samples of input and 4 predicted output samples.

We include a several baseline models as well as a number

of models modeled after state of the art time series learning

neural network capabilities.

A. Kalman Sequence Predictor

We use a 3rd order Unscented Kalman Filter/Predictor

similar to that described in [12]. This is implemented using the

FilterPy module [18] and forms our performance benchmark

for this paper. This implements a traditional Kalman Novelty

Detector as one might do without a learned predictive model.

In this case the adaptive ﬁlter is tuned online while running

and the error distribution is characterized on this.

B. DNN Sequence Predictor

Fig. 3. DNN Predictor Network Architecture

In our Dense Neural Network (DNN) model (shown in

ﬁgure 3), we train a naive fully-connected network as a neural

network baseline with a high number of free parameters and

heavy dropout allowing it to learn a completely unconstrained

mapping between input and output samples. This will allow

us to compare other specialized/constrained architectures such

as convolutional and recurrent varieties for model ﬁtting ap-

propriateness.

C. Raw LSTM Sequence Predictor

In the LSTM based sequence predictor model (Figure 4),

we implement a 2-layer LSTM followed by 2 fully connected

layers culminating in a linear activation for regression of com-

plex continuous valued sample output values. We regularize

between each layer with dropout of 0.5 and using proper

LSTM weight and activation dropout as described in [14] and

implemented in Keras [15].

D. DCNN1 Sequence Predictor

In the Dilated Convolutional Neural Network 1 (DCNN1)

(Figure 5), we introduce a simple dilated convolution layer on

the front end of a simple fully connected neural network to

allow for learning of convolutional features at a stride of 2.

Fig. 4. LSTMArchitecture used for evaluation recurrent neural prediction

network

Fig. 5. DCNN1 Predictor Network Architecture

Fig. 6. DCNN2 Predictor Network Architecture

E. DCNN2 Sequence Predictor

We model our Dilated Convolutional Neural Network

2 (DCNN2) architecture on a vastly simpliﬁed version of

Google’s WaveNet architecture [19] which has demonstrated a

strong ability to learn raw time series representations on acous-

tic voice data. Here we use two levels of dilated convolutions

where each is a residual block [16] containing identical layers

with hyperbolic tan (TanH) and sigmoid activations merged

multiplicatively, followed by a 1x1 convolutional layer for

dimensionality reduction.

V. MO DE L OPTIMIZATION

Before evaluating detection performance, we simply at-

tempt to minimize the mean squared error of the predictor

function to select our network parameters θNN and our ar-

chitecture and hyper-parameters for the predictor. From initial

experimentation, optimal network architectures seem to vary

slightly from dataset to dataset, but for now we seek to use a

single set of network parameters for all datasets.

A much more extensive hyper-parameter search is really

desired here to ﬁnd best suited network structures. We hope in

future work to do this more extensively, within the scope of

this work we only try a handful of architectures derived from

proven architectures in prior work on similar tasks.

VI. PERFORMANCE EVALUATIO N

To evaluate performance, we introduce a number of differ-

ent classes of synthetic anomalies into each recorded sampled

RF dataset and measure detection and false alarm rates using

various methods of novelty detection. Our anomaly types

considered here span the range of time-frequency support from

an instantaneous wide-band pulse, to a narrow-band tone at a

single frequency, but are each normalized by total power of

the same time support over the window [ts, te). The anomaly

classes considered are:

•Pulsed Complex Sinusoid: expressed as n(t) =

exp(j2πtFc/Fs)for t∈[ts, te)where Fc∼

Uniform(−Fs/2, Fs/2).

•Short-time Broadband Bursts (Sinc pulse): expressed

as n(t) = sinc(2π(t−(ts+te)/2)F c/F s)for t∈

[ts, te).

•Brief Periods of Signal Non-Linear Compression: ap-

proximated as n(t) = 13x(t)−3x3(t)for t∈[ts, te).

•Pulsed QPSK Signals: where symrate ∼

Uniform(Fs/250, Fs/2),Fc∼U nif orm(−(Fs−

symrate/2)/2,(Fs−symrate/2)/2), and a root-

raised cosine pulse shaping ﬁlter of α= 0.3and

N= 11 is applied at the baudrate.

•Pulsed Chirp Events: n(t) = exp(j2πtFc/Fs)for

t∈[ts, te)where Fcvaries linearly in time

from Fc1∼Uniform(−Fs/2, Fs/2) to Fc2∼

Uniform(−Fs/2, Fs/2)

Fig. 7. Example synthetic anomalies on the FM broadband dataset, from left

to right: compression, chirp, tone, qpsk, pulse

We characterize each of these anomalies by its interference-

to-band-power ratio (IBR) in dB. We refer to this in some

instances as signal-to-noise ratio (SNR) for convenience but

the signal of interest here is the anomaly and the ”noise” in

this case is the power of the non-anomalous band including

all signals therein.

Inspecting ﬁgure 7 we can see that performance does vary

based on the anomaly type. For instance, performance on non-

linear compression, chirp detection, tone detection, and QPSK

burst detection, all appear to be quite a bit stronger in the

LLR detection metric than the wideband pulsed noise which

is very short in time and results in an anomaly spike which is

likewise extremely short in time. We include several runs of

bands below in ﬁgures 8, 9, and 10 for visual inspection.

Fig. 8. LSTM Anomaly Detector on FM Band

Fig. 9. LSTM Anomaly Detector on LTE Band

Fig. 10. LSTM Anomaly Detector on GSM Band

In ﬁgure 11 we show the performance of the tone detector

across 100,000 samples of FM recording inserting 50 random

Fig. 11. Probability of Detection and False Alarm for sinusoidal tones on

FM Broadcast Band

tone events of length 250 samples. As the IBR approaches

-5dB, we have nearly perfect Pd/Pfa performance, while in

ﬁgure 12, we see for the wideband pulse tone, which has very

large instantaneous peak power but a very narrow time-support,

the IBR does not have nearly as signiﬁcant an impact on Pd/Pfa

performance at these IBR levels. In this case, our probability of

detection represents the probability of detecting all anomalies

present in the time range, while our probability of false alarm

represents the probability of a false alarm being triggered in

any 250-sample window.

Fig. 12. Probability of Detection and False Alarm for wide-band pulses on

FM Broadcast Band

We can repeat these experiments across a range of

interference-to-band power ratios to observe the efﬁcacy in

a range of different modulation and multi-access schemes.

Results for this are shown below in ﬁgure 13. Here we can

see that for all channel types are relatively effective once we

approach 0-5dB IBR. The most difﬁcult here is the ISM band,

where our predictive model is used to seeing bursty CSMA/CD

kinds of trafﬁc from WiFi and blue-tooth frequency hopped

Fig. 13. Probability of Detection and False Alarm for Tones in each band

type

bursts across the band. In this case our anomaly detection

ability is the most challenged of all the other bands.

Fig. 14. Probability of Detection and False Alarm for Chirps in each band

type

Repeating this experiment with chirp interference instead

of pulse interference, we show performance in ﬁgure 14. Again

we see excellent performance above 0-5dB in most cases,

although the ISM band continues to be the most difﬁcult.

To summarize these performance behaviors into a more

concise performance number, we ﬁx a constant false alarm rate

for comparison of detection performance. In ﬁgure 15 we show

how detection performance varies across a range of constant

false alarm rates for the LTE band using the LSTM model.

By repeating this for all models on all band-types, we can

then pick a constant false alarm rate to compare performance

Fig. 15. LTE Detector Constant False Alarm Rate for LSTM Model

Fig. 16. Constant False Alarm Rate Comparison of Prediction Models

across our different models. Doing this allows us to compare

model performance in different types of emitter and channel

environments.

Looking at these results, we see that in most cases the

neural network based predictors outperform the Kalman based

predictors slightly. In the case of cellular networks, both

GSM and LTE where a much more regular and structured

temporal pattern on each carrier exists, we see slightly larger

improvements in performance, likely due to having better

learned a temporal predictive model suited to this behavior.

VII. CONCLUSION

In this paper we have shown how the neural network

reconstruction-based anomaly detector can be used on several

real wideband over the air radio bands of interest to detect

anomalies occurring within band. The results have shown

that especially in structured radio signal environments where

temporal sequence model prediction performs best, we obtain

our best performance advantage over Kalman novelty detector

methods.

We believe this is an important result that shows viability

of this form of spectrum change monitoring and provides some

starting points for improvements on more traditional methods

for time series change detection. We have evaluated several

neural predictor models and have shown that both the LSTM

model and potentially the DCNN model are viable at low SNR

levels, while for an analog modulation (FM Broadcast), there

was less difference between the performance of the detectors

with these candidate networks.

In future work we hope to perform much more extensive

architecture and hyper-parameter searches, evaluate longer

runs, larger datasets and additional types of anomalies and

mixtures of anomalies. We would like to evaluate hybrid

architectures such as the LSTM with convolutional features on

the front end, including both the use of dilated convolutional

layers and residual units combining a number of the promising

techniques which have largely been evaluated separately here.

In the area of spectrum sensing for communications system

failure, interference, security, or monitoring, we hope that

this method helps imagine a promising path forward towards

general learning of non-signal and non-band speciﬁc methods

which can be used rapidly on a wide range of systems and

deployment models without needing specialized expert prior

knowledge of the system of interest.

ACKNOWLEDGMENT

The authors would like to thank the Bradley Department

of Electrical and Computer Engineering at the Virginia Poly-

technic Institute and State University, the Hume Center, and

DARPA all for their generous support in this work.

This research was developed with funding from the De-

fense Advanced Research Projects Agency’s (DARPA) MTO

Ofﬁce under grant HR0011-16-1-0002. The views, opinions,

and/or ﬁndings expressed are those of the author and should

not be interpreted as representing the ofﬁcial views or policies

of the Department of Defense or the U.S. Government.

REFERENCES

[1] V. G. Hansen, “Constant false alarm rate processing

in search radars(receiver output noise control),” Radar-

Present and future, pp. 325–332, 1973.

[2] S. Hochreiter and J. Schmidhuber, “Long short-term

memory,” Neural computation, vol. 9, no. 8, pp. 1735–

1780, 1997.

[3] Y. Zhang and W. Lee, “Intrusion detection in wireless

ad-hoc networks,” in Proceedings of the 6th annual

international conference on Mobile computing and net-

working, ACM, 2000, pp. 275–283.

[4] S. Marsland, “Novelty detection in learning systems,”

Neural computing surveys, vol. 3, no. 2, pp. 157–195,

2003.

[5] E. Blossom, “Gnu radio: tools for exploring the radio

frequency spectrum,” Linux journal, vol. 2004, no. 122,

p. 4, 2004.

[6] M. Ettus, “Usrp users and developers guide,” Ettus

Research LLC, 2005.

[7] J. Mitola III, “Cognitive radio architecture,” in Coopera-

tion in Wireless Networks: Principles and Applications,

Springer, 2006, pp. 243–311.

[8] A. Patcha and J.-M. Park, “An overview of anomaly

detection techniques: existing solutions and latest tech-

nological trends,” Computer networks, vol. 51, no. 12,

pp. 3448–3470, 2007.

[9] S. Rajasegarar, C. Leckie, and M. Palaniswami,

“Anomaly detection in wireless sensor networks,” IEEE

Wireless Communications, vol. 15, no. 4, pp. 34–40,

2008.

[10] A. Graves, M. Liwicki, S. Fern´

andez, R. Bertolami,

H. Bunke, and J. Schmidhuber, “A novel connection-

ist system for unconstrained handwriting recognition,”

IEEE transactions on pattern analysis and machine

intelligence, vol. 31, no. 5, pp. 855–868, 2009.

[11] M. Xie, S. Han, B. Tian, and S. Parvin, “Anomaly

detection in wireless sensor networks: a survey,” Journal

of Network and Computer Applications, vol. 34, no. 4,

pp. 1302–1325, 2011.

[12] V. Vittaldev, R. P. Russell, N. Arora, and D. Gaylor,

“Second-order kalman ﬁlters using multi-complex step

derivatives,” American Astronomial Society, vol. 204,

2012.

[13] M. A. Pimentel, D. A. Clifton, L. Clifton, and L.

Tarassenko, “A review of novelty detection,” Signal

Processing, vol. 99, pp. 215–249, 2014.

[14] W. Zaremba, I. Sutskever, and O. Vinyals, “Recur-

rent neural network regularization,” arXiv preprint

arXiv:1409.2329, 2014.

[15] F. Chollet, Keras, https : / / github . com / fchollet / keras,

2015.

[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep resid-

ual learning for image recognition,” arXiv preprint

arXiv:1512.03385, 2015.

[17] P. Malhotra, L. Vig, G. Shroff, and P. Agarwal, “Long

short term memory networks for anomaly detection in

time series,” in Proceedings, Presses universitaires de

Louvain, 2015, p. 89.

[18] R. R. Labbe. (2016). Kalman and bayesian ﬁlters in

python, [Online]. Available: https://github.com/rlabbe/

Kalman - and- Bayesian - Filters - in- Python/ (visited on

10/28/2016).

[19] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O.

Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and

K. Kavukcuoglu, “Wavenet: a generative model for raw

audio,” arXiv preprint arXiv:1609.03499, 2016.