Conference PaperPDF Available

Linear feature extractors based on mutual information

Authors:

Abstract and Figures

This paper presents and evaluates two linear feature extractors based on mutual information. These feature extractors consider general dependencies between features and class labels, as opposed to well known linear methods such as PCA which does not consider class labels and LDA, which uses only simple low order dependencies. As evidenced by several simulations on high dimensional data sets, the proposed techniques provide superior feature extraction and better dimensionality reduction while having similar computational requirements
Content may be subject to copyright.
Linear Feature Extractors Based on Mutual Information
Kurt D. Bollacker and Joydeep Ghosh
Department of Electrical and Computer Engineering,
University of Texas, Austin, Texas 78712, USA
kdb@pine.ece.utexas.edu
,
ghosh@pine.ece.utexas.edu
Abstract
This paper presents and evaluates two linear feature
extractors based on mutual information. These feature
extractors consider general dependencies between fea-
tures and class labels, as opposed to well known linear
methods such as PCA which does not consider class la-
bels and LDA, which uses only simple low order depen-
dencies. As evidenced by several simulations on high
dimensional data sets, the proposed techniques provide
superior feature extraction and better dimensionality
reduction while having similar computational require-
ments.
1. Introduction
The capabilities of a classier are ultimately lim-
ited by the quality of the features in each input vec-
tor. In particular, when the measurement space is high-
dimensional but the number of samples is limited, one
is faced with the \curse of dimensionality" problem
during training [3]. Feature extraction is often used
to alleviate this problem. Although linear feature ex-
tractors are ultimately less exible than the more gen-
eral non-linear extractors, they have some useful prop-
erties that can make them highly desirable. Linear
projections tend to be structure preserving and have
only small, predictable computational demands. Also,
under certain conditions linear transforms preserve all
useful information in the original feature set.
One of the most commonly used unsupervised lin-
ear feature extractors, the Karhunen-Loeve Expansion,
performs principal component analysis (PCA) using co-
variance between original features as an extraction cri-
terion. A leading
supervised
linear feature extractor is
linear discriminant analysis (LDA), a generalization for
c
classes of Fisher's linear discriminant [7][11], where
the extraction criterion used is class mean separation.
In this paper, we present two supervised linear fea-
ture extractors which use mutual information as a fea-
ture extraction criterion. The p erformance of these
feature extractors is compared empirically with PCA,
LDA, and a mutual information feature selector, using
four separate classication problems. The results are
summarized along with a discussion of computational
complexity.
2. Mutual Information Feature Extrac-
tion Criteria
One denition of an optimal mutual information fea-
ture extractor
f
(
) is
f
(
~
X
) = max
f
2F
I
(
f
(
~
X
);
Y
) (1)
where
X
is the input vector,
Y
is the output vector, and
F
is the space of all considered feature extractors. If
F
is the space of all linear transforms, then this equation
becomes
A
~
X
= max
A
2F
I
(
A
~
X
;
Y
) (2)
If the
A
matrix is
n
n
and simply rotates or ips
the coordinate system without scaling, then the linear
transform will not destroy information. More exactly,
for an
n
dimensional input vector if
A
is chosen to be
a real,
n
n
, non-singular matrix with
k
A
~
X
k
=
k
~
X
k
8
~
X
, then it is true that
I
(
~
X ; Y
) =
I
(
A
~
X
;
Y
) (3)
A good feature extractor would be one that allows for
m
:
m < n
of the extracted features to be chosen while
minimizing the information loss. One of several stan-
dard non-linear optimization techniques [6] could be
used to directly solve for an optimal
m
n
matrix
A
directly from the distributions of
~
X
and
Y
. However,
the density of
p
points in an
n
dimensional input space
is proportional to
p
(
1
n
)
, making the sampling density
of
A
~
X
in high dimensional space very low. The result-
ing poor condence for a numerical approximation of
I
(
A
~
X
;
Y
) suggests that an alternate approach must be
used.
Comparison of Feature Extraction Criteria:
PCA uses covariance as a feature extractor criterion,
This relation to correlation only measures the linear
dependence among features. LDA uses the simple rst
order statistic of distance of feature vector means cri-
terion. Mutual information, measures a general dep en-
dence of class labels and the extracted features [5] and
can be used to measure this dependence when the class
labels are unordered. Thus, a mutual information mea-
sure can be a more p owerful feature extractor criterion.
Previous MI based Feature Selectors:
Bat-
titi has developed a mutual information feature selec-
tion (MIFS) algorithm using input feature distribu-
tions and the class distribution [1]. In this mo del,
features and the class labels are treated as (sam-
pled) random variables. The
i
th
best feature
f
i
se-
lected in the MIFS algorithm is that which satises
f
i
= max
X
i
I
(
Y
;
X
i
)
?
P
i
?
1
j
=1
I
(
X
i
;
X
j
) where
Y
is
the class label random variable,
X
i
is the
i
th
input fea-
ture random variable and
is a tunable paramenter.
This criterion greedily selects the set of features with
high mutual information with the class labels while try-
ing to minimize the mutual information among chosen
features.
Also, Sheinvald, Dom and Niblack developed a fea-
ture selection algorithm based on Minimum Descrip-
tion Length, a criterion related to mutual informa-
tion [10]. However, like MIFS, this is only a feature
selector, and thus is less powerful than a general fea-
ture extractor.
3. Mutual Information Feature Extrac-
tors
This section presents two feature extractors which
are designed to maximize the mutual information of
the extracted features with the output. These feature
extractors do not require a mutual information calcula-
tion of an
n
dimensional feature vector and both satisfy
the requirements such that Equation 3 holds.
Maximum Mutual Information Projection Fea-
ture Extractor(MMIP):
This feature extractor at-
tempts to nd successive orthogonal normalized pro-
jections of the input vector which maximize mutual in-
formation with the output distribution. The rst such
projection
~a
is dened by
~a
= max
~a
I
(
~a
T
~
X
;
Y
) (4)
where
~a
T
~
X
is the extracted feature which has the
highest mutual information with the class label
Y
. The
mutual information was numerically approximated us-
ing the method described by Battiti [1] with 25 equal
sized intervals. The polytop e algorithm [4] was used to
nd an approximately maximal projection. This pro-
jection is then removed from the input vectors, and the
maximal mutual information projection of the residuals
is found again
m
?
1 times in this manner. It should
be noted that this feature extraction method suers
from the problem of overlapping mutual information
contributed by each feature, which b ecomes worse as
more features are extracted.
A Separated Mutual Information Feature Ex-
tractor(SMIFE):
A heuristic similar in form to PCA
attempts to extract features which have high mutual in-
formation with the output. In PCA, the eigenvectors
of large eigenvalues are uncorrelated projections [2] in
which variance is high. In place of covariance values in
the matrix, three-variable mutual information values
are used. Three-variable mutual information is dened
as
I
(
X
i
;
X
j
;
Y
) =
H
(
X
i
; X
j
; Y
)
?
H
(
X
i
)
?
H
(
X
j
)
?
H
(
Y
) +
I
(
X
i
;
Y
) +
I
(
X
j
;
Y
) +
I
(
X
i
;
X
j
) (5)
where
X
i
and
X
j
are features,
Y
is the class lab el
and
H
(
) is the entropy function. The eigenvectors
of this mutual information matrix are found and or-
dered by decreasing eigenvalue. Following an analogy
with PCA, the principal components should be direc-
tions of high mutual information with the class label
and should minimize common mutual information with
the class label. As with PCA, Equation 3 holds for this
feature extractor. Two versions, SMIFE1 (above) and
SMIFE2 were constructed. For SMIFE2, the terms of
Equation 5 are rewritten in the form
I
(
X
i
;
X
j
;
Y
) =
I
(
X
i
;
Y
) +
I
(
X
j
;
Y
)
?
I
(
X
i
; X
j
;
Y
) (6)
Instead of nding eigenvectors of a matrix of
I
(
X
i
;
X
j
;
Y
) values ordered by decreasing eigenvalue,
the eigenvectors of a matrix of
I
(
X
i
; X
j
;
Y
) values were
found and due to the negative sign, ordered by increas-
ing eigenvalue.
4. Exp eriments
Methods:
Four dierent classication problem data
sets from the UC Irvine database were used for exper-
imentation with the feature extractors (Table 1). The
chosen data sets were: letter recognition (LR), a seg-
mented image classier (SEG), a satellite image clas-
sier (SAT), and a vehicle silhouette classier (VEH).
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Number of Extracted Features
0
0.2
0.4
0.6
Classification Test Rate
SMIFE1
MMIP
SMIFE2
PCA
LDA
MIFS ß=0.0
MIFS ß=0.5
MIFS ß=1.0
VEH
0 3 6 9 12 15 18 21 24 27 30 33 36
Number of Extracted Features
0.2
0.4
0.6
0.8
Classification Test Rate
SMIFE1
MMIP
SMIFE2
PCA
LDA
MIFS ß=0.0
MIFS ß=0.5
MIFS ß=1.0
SAT
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of Extracted Features
0
0.2
0.4
0.6
0.8
Classification Test Rate
SMIFE1
MMIP
SMIFE2
PCA
LDA
MIFS ß=0.0
MIFS ß=0.5
MIFS ß=1.0
LR
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Number of Extracted Features
0.4
0.6
0.8
Classification Test Rate
SMIFE1
MMIP
SMIFE2
PCA
LDA
MIFS ß=0.0
MIFS ß=0.5
MIFS ß=1.0
SEG
Figure 1. Test Rate versus number of features for the four data sets.(LDA is not included for the SEG
data set because the between class scatter matrix was almost singular.)
The examples were randomly divided into evenly sized
training and test data sets. A fully connected back-
propagation multilayer perceptron (MLP) network was
trained for each data set and an attempt was made to
rene the architectural and learning parameters to op-
timize the test performance on the original feature set.
Several feature extractors were used on the four dier-
ent data sets. PCA, LDA and three variations of Bat-
titi's feature selector (
= 0
:
0,
= 0
:
5,
= 1
:
0) as well
as the MMIP, SMIFE1, and SMIFE2 feature extractors
were compared. The LDA considered was Okada and
Tomita's[8] extension which does not share traditional
LDA limit on the number of features extracted. For
each feature extractor and data set with
n
original fea-
Table 1. The data sets used for feature extrac-
tor comparison
Data Set # Features # Classes # Examples
LR 16 26 20000
SEG 18 7 2310
SAT 36 6 6435
VEH 18 4 846
tures,
n
MLP's were trained with the
m
best extracted
features where
m
= 1
: : : n
. The classication test rate
was the measure of performance used. For three of the
data sets test results from 10 runs of each MLP were
Table 3. The required number of features (#f) required for best classification performance and the
associated testing rates
Data Set
Feature LR SEG SAT VEH
Extractor #f Test Rate #f Test Rate #f Test Rate #f Test Rate
MMIP 16 0.7820 18 0.9110 3 0.8036 8 0.5130
SMIFE1 15 0.7940 14 0.9193 34 0.7923 12 0.5793
SMIFE2 15 0.8026 12 0.9182 7 0.8088 12 0.5977
PCA 16 0.7804 11 0.9124 17 0.7938 17 0.4904
LDA 16 0.7814 N/A N/A 12 0.8002 12 0.5917
MIFS
= 0
:
0 16 0.7832 16 0.9125 25 0.7979 12 0.4966
MIFS
= 0
:
5 14 0.7310 17 0.9096 29 0.7942 13 0.5111
MIFS
= 1
:
0 15 0.7828 18 0.9108 34 0.7921 13 0.5111
averaged to comp ensate for performance variance while
50 runs were used in the VEH data set due to its higher
variance.
Table 2. The best feature extractor for each
data set using the
m
best extracted features
Data Set
m LR SEG SAT VEH
1 MMIP MIFS1.0 SMIFE2 MMIP
2 MMIP MIFS0.0 SMIFE2 LDA
3 MMIP MMIP MMIP LDA
4 MMIP SMIFE2 MMIP SMIFE2
5 MMIP SMIFE1 SMIFE2 LDA
6 MMIP SMIFE2 SMIFE2 LDA
7-8 MMIP SMIFE2 SMIFE2 SMIFE2
9 SMIFE2 SMIFE2 SMIFE2 SMIFE2
10 SMIFE2 SMIFE1 SMIFE2 SMIFE2
11 SMIFE2 SMIFE2 SMIFE2 LDA
12-13 SMIFE2 SMIFE2 SMIFE2 SMIFE2
14 SMIFE2 SMIFE1 SMIFE2 SMIFE2
15 SMIFE2 SMIFE2 SMIFE2 SMIFE2
16 SMIFE2 SMIFE1 SMIFE2 SMIFE2
17 SMIFE1 SMIFE2 SMIFE2
18 SMIFE2 SMIFE2 SMIFE2
19-35 SMIFE2
36 MIFS0.5
Results:
Two measures of test performance were
made from the results shown in gure 1. First, for each
quantity of extracted inputs the best performing fea-
ture extractor was listed. This is compiled in Table 2.
Second, the maximum performance and the number of
features required for that performance were measured
Table 4. Computational complexity of each
feature extractor (
n
s
= Number of samples,
n
= Original feature dimensionality)
Feature Extractor Computational Complexity
SMIFE1
O
(
n
s
n
2
)
SMIFE2
O
(
n
s
n
2
)
MMIP
O
(
n
s
n
Avg
#
iterations
)
PCA
O
(
n
s
n
2
)
LDA
M AX
(
O
(
n
4
)
; O
(
n
s
n
2
))
MIFS
O
(
n
s
n
2
)
for each feature extractor as shown in Table 3.
Discussion:
From Table 2 the best performing fea-
ture extractor versus the
m
best extracted features can
be seen. MMIP does best for low numbers of features
in the LR data sets while several feature extractors
are close in the SAT data set. MIFS do es well in the
SEG data set and LDA does well in the VEH data set.
MMIP's best performance only with a small number of
features seems to give evidence that as more features
are used, less new information is added for classication
because of the information overlap between features.
Because the within class matrix is almost singular for
the SEG data set, LDA cannot be used. However, LDA
performs very well on the VEH data set, being the best
performer for small numbers of features. When enough
extracted features are included, SMIFE2 performs b est
on all of the data sets except SEG, where SMIFE1
does just about as well and VEH where LDA is a very
close performer. The SMIFEs' performance gives evi-
dence that information separation occurs which allows
for better p erformance when more extracted features
are added, despite the increased dimensionality.
Table 3 shows the best performance achieved by each
feature extractor, and how many features were needed
to reach that performance. In the LR, SEG, and SAT
data sets, the best p erformance is very close for all of
the feature extractors. However, in the VEH data set
both SMIFEs and LDA enjoy a signicant performance
advantage over the other extractors, but SMIFE2 out-
performs LDA slightly.
Computational Complexity of Feature Extrac-
tion:
One calculation of covariance for PCA had a
complexity of
O
(
n
s
) where
n
s
is the number of samples.
PCA requires
O
(
n
2
) such calculations plus the eigen-
vector and eigenvalue calculations for a
n
n
matrix,
typically
O
(
n
3
)[9]. Okada and Tomita's LDA requires
O
(
n
s
n
2
) calculations to generate the scatter matri-
ces and
O
(
n
4
) calculations to nd the
n
optimal projec-
tions. The numerical calculation of mutual information
for the MIFS and MMIP feature extractors is complex-
ity
M AX
(
O
(
n
i
n
c
)
; O
(
n
s
)) where
n
c
is the number of
classes and
n
i
is the number of intervals, chosen to be a
constant 25. Dropping
n
i
out, and given that
n
c
is al-
most certainly always much less than
n
s
, a complexity
of
O
(
n
s
) is left. MIFS requires
O
(
n
2
) such calculations
but the iterative nature of MMIP varies the number of
requires mutual information calculations. In the four
data sets, somewhere b etween 200 and 700 iterations
was typical per feature extracted. Both the calcula-
tions of three variable mutual information for SMIFE1
and the joint mutual information for SMIFE 2 require
M AX
(
O
(
n
2
i
n
c
)
; O
(
n
s
)), but since
n
i
is a constant,
the complexity is still
O
(
n
s
). A summary of this can
be seen in Table 4. MIFS, PCA and SMIFE all have
the same order of complexity but, but LDA had the
potential highest, being the maximum of
O
(
n
4
) and
O
(
n
s
n
2
)) while MMIP's complexity varies with the
data set. However, it should be kept in mind that
a better optimization algorithm would likely improve
upon the inecient p olytop e. Even with polytop e, the
MMIP algorithm required (as an example) only 2.2
minutes on a CPU capable of 100 SPECfp92 to nd
all 18 SEG data set features.
5. Conclusions
Mutual information as a feature extraction criterion
is used to guide the design of two linear feature extrac-
tors. These feature extractors p erformed b etter than
PCA and Battiti's MIFS feature selectors while having
the same order of computational complexity. The p er-
formance of SMIFE was better than or about equal to
LDA while having the lower computational complex-
ity. Since PCA is an optimal variance extractor, there
is thus empirical evidence that mutual information is
a better linear feature extractor criterion. LDA has a
higher computational complexity and can fail on some
data sets. Also, the general feature extractors SMIFE
and MMIP performed better than the mutual informa-
tion based feature selector MIFS. Future work should
include a more rigorous argument for the usefulness
of the SMIFE feature extractors and a proof of opti-
mality as well as comparison with other linear feature
extractors.
Acknowledgements:
This research is supp orted in
part by AFOSR grant F49620-93-1-0307 and ARO con-
tracts DAAH 04-94-G0417 and 04-95-10494. K. Bol-
lacker is an NSF Graduate Fellow and is also supp orted
by the the Cockrell Foundation. We also thank Bryan
Stiles and Viswanath Ramamurti for their help.
References
[1] R. Battiti. Using mutual information for selecting fea-
tures in supervised neural net learning.
IEEE Trans-
actions on Neural Networks
, 5:537{550, July 1994.
[2] P. A. Devijer and J. Kittler.
Pattern Recognition: A
Statistical Approach
. Prentice Hall International, En-
glewood Clis, New Jersey, 1982.
[3] J. H. Friedman. An overview of predictive learning and
function approximation. In J. Friedman and W. Stuet-
zle, editors,
From Statistics to Neural Networks,Proc.
NATO/ASI Workshop
. Springer Verlag, 1994.
[4] P. E. Gill, W. Murray, and M. H. Wright.
Practical
Optimization
. Harcourt Brace and Company, London,
1981.
[5] W. Li. Mutual information functions versus correla-
tion functions.
Journal of Statistical Physics
, 60:823{
837, 5/6 1990.
[6] D. G. Luenberger.
Linear and Nonlinear Program-
ming
. Addison Wesley, Massachusetts, 1984.
[7] J. Mao and A. Jain. Aritcial neural networks for
feature extraction and multivariate data pro ejction.
IEEE Transactions on Neural Networks
, 6:296{317, 2
1995.
[8] T. Okada and S. Tomita. An optimal orthonormal
system for discriminant analysis.
Pattern Recognition
,
18:139{144, 2 1985.
[9] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and
B. P. Flannery, editors.
Numerical Recipes in C: The
Art of Scientic Computing
. Cambride University
Press, Cambridge, 1992.
[10] J. Sheinvald, B. Dom, and W. Niblack. A modeling ap-
proach to feature selection. In
10th International Con-
ference on Pattern Recognition
, pages 535{539, june
1990.
[11] W. Siedlecki, K. Siedlecka, and J. Sklansky. An
overview of mapping techniques for exploratory pat-
tern analysis.
Pattern Recognition
, 21:411{429, 5 1988.
... Note that feature selection is a research topic highly active in the Machine Learning community, especially on non-physics-related topics and classification tasks, with plenty of articles dedicated to the question (to name a few, [16,17,18,19,20,21]). Researchers have gathered existing methods into five main categories: filter methods, wrapper methods, embedded methods, ensemble methods, and integrative methods (see, for instance, [22,21,23] for details). ...
... These approaches allow for the elimination of redundant information within features. Among other works, one may cite [20], which used mutual information between class variables to form optimal linear combinations of inputs. However, as mentioned by [19], these methods, which deal with continuous variables in classification tasks, demand large amounts of data and high-computational complexity. ...
Preprint
Full-text available
This work introduces a novel methodology to derive physical scalings for input features from data. The approach developed in this article relies on the maximization of mutual information to derive optimal nonlinear combinations of input features. These combinations are both adapted to physics-related models and interpretable (in a symbolic way). The algorithm is presented in detail, then tested on a synthetic toy model. The results show that our approach can effectively construct relevant combinations by analyzing a strongly noisy nonlinear dataset. These results are promising and may significantly help training data-driven models. Finally, the last part of the paper introduces a way to perform automatic dimensional analysis from data. The test case is a synthetic dataset inspired by the Law of the Wall from turbulent boundary layer theory. Once again, the algorithm shows that it can recover relevant nondimensional variables from data.
... The cost functions are based on very different heuristics. For example, mutual information and information theoretic approaches are analyzed in (4), (8), (87), papers (10), (69) assumes that the data lie in some low dimensional manifold, (11) extends the manifold assumption to a discriminative scenario, meanwhile, articles (8), (17), (18), (80), (87), and (99) focus on approaches related to dependence maximization. ...
... The idea of using dependence as a criterion of feature relevance is not new. However, traditionally, dependence-based feature construction relies on information theoretic cost functions (4), (8), (86). Measuring and optimizing such a cost functions often poses serious inconveniences. ...
... Mutual information has proven to be a valuable tool for feature selection at training time leveraging dimensionality reduction [7]; it can be represented as: ...
... Our approach extends the approach in [11] and decomposes their approximation of I(X ; Z) to give us I(X i ; Q k ), where X i is a dimension of X (specifically the i th feature of the input vector), and Q k is any quantity of interest (perhaps the k th neuron in a hidden layer or a class of the output vector). Assuming that mutual information is composable and entropy is non-decreasing, we can calculate the mutual information for any feature Q k by leveraging a tractable approximation from [7]: ...
... Mutual information has proven to be a valuable tool for feature selection at training time leveraging dimensionality reduction (Bollacker and Ghosh 1996). More recent work has represented deep neural networks as Markovian chains to create an information bottleneck theory for deep learning (Shwartz-Ziv and Tishby 2017). ...
... Assuming that mutual information is composable and entropy is non-decreasing, we can calculate the mutual infor-mation for any feature Q k ) by leveraging a tractable approximation from (Bollacker and Ghosh 1996): ...
Preprint
Full-text available
In this paper, we present a new approach to interpreting deep learning models. More precisely, by coupling mutual information with network science, we explore how information flows through feed forward networks. We show that efficiently approximating mutual information via the dual representation of Kullback-Leibler divergence allows us to create an information measure that quantifies how much information flows between any two neurons of a deep learning model. To that end, we propose NIF, Neural Information Flow, a new metric for codifying information flow which exposes the internals of a deep learning model while providing feature attributions.
... Information theory also plays a role in feature extraction, since feature extraction can be seen as a special case of lossy compression in information theory. More broadly, several previous works have taken an information-theoretic approach to feature extraction, such as [14]- [16], which studied the supervised setting and [17], which proposed to directly minimize the mutual information of the data and the extracted features heuristically using gradient-descent based methods. ...
... This feature selection allows the reduction of dimensions of the training set by extracting the importance of each feature involved, and thereby selecting those which contribute more in the inference process. They applied different feature selection methods, such as Random Forest algorithm (Ho, 1995), a newly developed method based on the Fast Function Extraction (FFX) algorithm, as well as information based methods such as Maximum Relevance Minimum Redundancy (MRMR) Peng et al., 2005), and Mutual Information Maximization (MIM) (Bollacker and Ghosh, 1996) (see Fig. 1 in Zhelavskaya et al. (2019)). The results shown in Fig. 3 of Zhelavskaya et al. (2019) show that all models performed similarly, when evaluated using two different metrics, the Root Mean Squared Error (RMSE) of the Pearson Correlation Coefficient (CC). ...
Article
Full-text available
Accurate forecasts of thermosphere densities, realistic calculation of aerodynamic drag, and propagation of the uncertainty on the predicted orbit positions are required for conjunction analysis and collision avoidance decision making. The main focus of the Committee on Space Research (COSPAR) International Space Weather Action Teams (ISWAT) involved in atmosphere variability studies is satellite drag, and this paper reviews our current capabilities and lists recommendations. The uncertainty in the density of thermosphere models is due to the combined effect of employing simplified or incomplete algorithms, inconsistent and sparse density data, incomplete drivers for upper atmosphere heating processes (proxies for solar and geomagnetic activity), and forecast error of said drivers. When calculating drag, the uncertainty is amplified due to the satellite shape and aerodynamic model. The sources of uncertainty are reviewed in this paper, and possible and promising ways forward are proposed. Data assimilation models/approaches have demonstrated superior skill in reproducing the thermosphere’s state and are the most promising way forward. However, data to drive the models is generally lacking, and they require significant computational resources. Substantial progress can only be made by means of setting up a full-blown observing system, including not only density and composition measurements, but equally the necessary model drivers.
... Feature selection methods allow us to find a subset of the most important inputs that contain a sufficient amount of information to model the target variable (here, Kp) and, at the same time, to achieve good accuracy of the predictions. We investigate feature selection procedures based on the following methods: Fast Function Extraction (FFX) (McConaghy, 2011), Random Forest (RF) (Ho, 1995), Mutual Information Maximization (MIM) (Bollacker and Ghosh, 1996), and Maximum Relevancy Minimum Redundancy (MRMR) Ding and Peng, 2005). The Random Forest algorithm is often used for feature selection and is implemented in many machine learning packages. ...
Thesis
The plasmasphere is a dynamic region of cold, dense plasma surrounding the Earth. Its shape and size are highly susceptible to variations in solar and geomagnetic conditions. Having an accurate model of plasma density in the plasmasphere is important for GNSS navigation and for predicting hazardous effects of radiation in space on spacecraft. The distribution of cold plasma and its dynamic dependence on solar wind and geomagnetic conditions remain, however, poorly quantified. Existing empirical models of plasma density tend to be oversimplified as they are based on statistical averages over static parameters. Understanding the global dynamics of the plasmasphere using observations from space remains a challenge, as existing density measurements are sparse and limited to locations where satellites can provide in-situ observations. In this dissertation, we demonstrate how such sparse electron density measurements can be used to reconstruct the global electron density distribution in the plasmasphere and capture its dynamic dependence on solar wind and geomagnetic conditions. First, we develop an automated algorithm to determine the electron density from in-situ measurements of the electric field on the Van Allen Probes spacecraft. In particular, we design a neural network to infer the upper hybrid resonance frequency from the dynamic spectrograms obtained with the Electric and Magnetic Field Instrument Suite and Integrated Science (EMFISIS) instrumentation suite, which is then used to calculate the electron number density. The developed Neural-network-based Upper hybrid Resonance Determination (NURD) algorithm is applied to more than four years of EMFISIS measurements to produce the publicly available electron density data set. We utilize the obtained electron density data set to develop a new global model of plasma density by employing a neural network-based modeling approach. In addition to the location, the model takes the time history of geomagnetic indices and location as inputs, and produces electron density in the equatorial plane as an output. It is extensively validated using in-situ density measurements from the Van Allen Probes mission, and also by comparing the predicted global evolution of the plasmasphere with the global IMAGE EUV images of He+ distribution. The model successfully reproduces erosion of the plasmasphere on the night side as well as plume formation and evolution, and agrees well with data. The performance of neural networks strongly depends on the availability of training data, which is limited during intervals of high geomagnetic activity. In order to provide reliable density predictions during such intervals, we can employ physics-based modeling. We develop a new approach for optimally combining the neural network- and physics-based models of the plasmasphere by means of data assimilation. The developed approach utilizes advantages of both neural network- and physics-based modeling and produces reliable global plasma density reconstructions for quiet, disturbed, and extreme geomagnetic conditions. Finally, we extend the developed machine learning-based tools and apply them to another important problem in the field of space weather, the prediction of the geomagnetic index Kp. The Kp index is one of the most widely used indicators for space weather alerts and serves as input to various models, such as for the thermosphere, the radiation belts and the plasmasphere. It is therefore crucial to predict the Kp index accurately. Previous work in this area has mostly employed artificial neural networks to nowcast and make short-term predictions of Kp, basing their inferences on the recent history of Kp and solar wind measurements at L1. We analyze how the performance of neural networks compares to other machine learning algorithms for nowcasting and forecasting Kp for up to 12 hours ahead. Additionally, we investigate several machine learning and information theory methods for selecting the optimal inputs to a predictive model of Kp. The developed tools for feature selection can also be applied to other problems in space physics in order to reduce the input dimensionality and identify the most important drivers. Research outlined in this dissertation clearly demonstrates that machine learning tools can be used to develop empirical models from sparse data and also can be used to understand the underlying physical processes. Combining machine learning, physics-based modeling and data assimilation allows us to develop novel methods benefiting from these different approaches.
... Moreover, it becomes 928 more difficult to evaluate the importance of each of the inputs for the model. Mutual Information Maximization (MIM) (Bollacker & Ghosh, 1996). The Random Forest is widely 934 used for ML due its fast training speed. ...
Article
Full-text available
Space weather driven atmospheric density variations affect low Earth orbit (LEO) satellites during all phases of their operational lifetime. Rocket launches, re-entry events and space debris are also similarly affected. A better understanding of space weather processes and their impact on atmospheric density is thus critical for satellite operations as well as for safety issues. The Horizon 2020 project Space Weather Atmosphere Model and Indices (SWAMI) project, which started in January 2018, aims to enhance this understanding by: Developing improved neutral atmosphere and thermosphere models, and combining these models to produce a new whole atmosphere model. Developing new geomagnetic activity indices with higher time cadence to enable better representation of thermospheric variability in the models, and improving the forecast of these indices. The project stands out by providing an integrated approach to the satellite neutral environment, in which the main space weather drivers are addressed together with model improvement. The outcomes of SWAMI will provide a pathway to improved space weather services as the project will not only address the science issues, but also the transition of models into operational services. The project aims to develop a unique new whole atmosphere model, by extending and blending the Unified Model (UM), which is the Met Office weather and climate model, and the Drag Temperature Model (DTM), which is a semi-empirical model which covers the 120–1500 km altitude range. A user-focused operational tool for satellite applications shall be developed based on this. In addition, improved geomagnetic indices shall be developed and shall be used in the UM and DTM for enhanced nowcast and forecast capability. In this paper, we report on progress with SWAMI to date. The UM has been extended from its original upper boundary of 85 km to run stably and accurately with a 135 km lid. Developments to the UM radiation scheme to enable accurate performance in the mesosphere and lower thermosphere are described. These include addition of non-local thermodynamic equilibrium effects and extension to include the far ultraviolet and extreme ultraviolet. DTM has been re-developed using a more accurate neutral density observation database than has been used in the past. In addition, we describe an algorithm to develop a new version of DTM driven by geomagnetic indices with a 60 minute cadence (denoted Hp60) rather than 3-hourly Kp indices (and corresponding ap indices). The development of the Hp60 index, and the Hp30 and Hp90 indices, which are similar to Hp60 but with 30 minute and 90 minute cadences, respectively, is described, as is the development and testing of neural network and other machine learning methods applied to the forecast of geomagnetic indices.
... Feature selection methods allow us to find a subset of the most important inputs that contain a sufficient amount of information to model the target variable (here, Kp) and, at the same time, to achieve good accuracy of the predictions. We investigate feature selection procedures based on the following methods: Fast Function Extraction (FFX) (McConaghy, 2011), Random Forest (RF) (Ho, 1995), Mutual Information Maximization (MIM) (Bollacker & Ghosh, 1996), and Maximum Relevancy Minimum Redundancy (MRMR) Peng et al., 2005). The RF algorithm is often used for feature selection and is implemented in many machine learning packages. ...
Article
Full-text available
The Kp index is a measure of the midlatitude global geomagnetic activity and represents short-term magnetic variations driven by solar wind plasma and interplanetary magnetic field. The Kp index is one of the most widely used indicators for space weather alerts and serves as input to various models, such as for the thermosphere and the radiation belts. It is therefore crucial to predict the Kp index accurately. Previous work in this area has mostly employed artificial neural networks to nowcast Kp, based their inferences on the recent history of Kp and on solar wind measurements at L1. In this study, we systematically test how different machine learning techniques perform on the task of nowcasting and forecasting Kp for prediction horizons of up to 12 hr. Additionally, we investigate different methods of machine learning and information theory for selecting the optimal inputs to a predictive model. We illustrate how these methods can be applied to select the most important inputs to a predictive model of Kp and to significantly reduce input dimensionality. We compare our best performing models based on a reduced set of optimal inputs with the existing models of Kp, using different test intervals, and show how this selection can affect model performance.
Article
This work introduces a novel methodology to derive physical scalings for input features from data. The approach developed in this article relies on the maximization of mutual information to derive optimal nonlinear combinations of input features. These combinations are both adapted to physics-related models and interpretable (in a symbolic way). The algorithm is presented in detail, then tested on a synthetic toy model. The results show that our approach can effectively construct relevant combinations by analyzing a strongly noisy nonlinear dataset. These results are promising and may significantly help training data-driven models. Finally, the last part of the paper introduces a way to account for the physical dimension of data. The test case is a synthetic dataset inspired by the Law of the Wall from turbulent boundary layer theory. Once again, the algorithm shows that it can recover relevant nondimensional variables for data-base modeling.
Article
Full-text available
This paper investigates the application of the mutual information criterion to evaluate a set of candidate features and to select an informative subset to be used as input data for a neural network classifier. Because the mutual information measures arbitrary dependencies between random variables, it is suitable for assessing the “information content” of features in complex classification tasks, where methods bases on linear relations (like the correlation) are prone to mistakes. The fact that the mutual information is independent of the coordinates chosen permits a robust estimation. Nonetheless, the use of the mutual information for tasks characterized by high input dimensionality requires suitable approximations because of the prohibitive demands on computation and samples. An algorithm is proposed that is based on a “greedy” selection of the features and that takes both the mutual information with respect to the output class and with respect to the already-selected features into account. Finally the results of a series of experiments are discussed
Article
Predictive learning has been traditionally studied in applied mathematics (function approximation), statistics (nonparametric regression), and engineering (pattern recognition). Recently the fields of artificial intelligence (machine learning) and connectionism (neural networks) have emerged, increasing interest in this problem, both in terms of wider application and methodological advances. This paper reviews the underlying principles of many of the practical approaches developed in these fields, with the goal of placing them in a common perspective and providing a unifying overview.
Article
A book which concentrates on the computer pattern recognition problems of feature evaluation, pattern classification, performance estimation and unsupervised clustering from a statistical and mathematical viewpoint. Eleven chapters cover the following themes. An introduction to pattern recognition and statistical approaches to pattern similarity, variability, classification and feature extraction. Bayesian decision theory, the nearest-neighbor decision rule and discriminant functions are three initial statistical discussions. Various techniques in feature extraction and selection take 5 chapters. Themes here include general introductions to searching, interclass distance measurements, probabalistic separability measures and methods based on the Karhunen-Loeve expansion. Performance estimation and nonsupervised learning are two final chapters. -M. Blakemore
Article
This paper studies one application of mutual information to symbolic sequences: the mutual information functionM(d). This function is compared with the more frequently used correlation function(d). An exact relation betweenM(d) and(d) is derived for binary sequences. For sequences with more than two symbols, no such general relation exists; in particular,(d)=0 may or may not lead toM(d)=0. This linear, but not general, independence between symbols separated by a distance is studied for ternary sequences. Also included is the estimation of the finite-size effect on calculating mutual information. Finally, the concept of symbolic noise is discussed.
Article
We describe two computer-based experiments evaluating the effectiveness of several mapping techniques for exploratory pattern analysis. The first experiment compares various mappings and classical clustering techniques as aids to people whose objective is to find clusters in the data. The second experiment evaluates the effectiveness of two-dimensional displays produced by analytic mappings for people designing linear and piecewise linear classifiers. The performance of the classifiers designed by the people aided by these displays is compared with automatically trained classifiers.Based on these experiments we selected three best mapping methods. Even the untrained users who took part in our experiments achieved very good results with the aid of these best mappings. In fact, these results were superior by a significant margin to those obtained from renowned classical pattern recognition procedures.Another valuable result of our experiments is that they allowed us to identify the sets of parameters most often used by the participants and, consequently, suggest guidelines for the best use of mapping techniques.
Article
This paper proposes a new discriminant analysis with orthonormal coordinate axes of the feature space. In general, the number of coordinate axes of the feature space in the traditional discriminant analysis depends on the number of pattern classes. Therefore, the discriminatory capability of the feature space is limited considerably. The new discriminant analysis solves this problem completely. In addition, it is more powerful than the traditional one in so far as the discriminatory power and the mean error probability for coordinate axes are concerned. This is also shown by a numerical example.
Article
We present an extensive review of mapping techniques for exploratory pattern analysis. We place these techniques in eight major groups. Among them is our own innovation—the least squares mapping. We also introduce a method that accelerates and increases the precision of mappings based on the Fisher discriminant.In a sequel, to be published in this journal, we describe experiments that show that mapping techniques can be a versatile and powerful tool for cluster analysis and classifier design.
Conference Paper
An information-theoretic approach is used to derive a new feature selection criterion capable of detecting features that are totally useless. Since the number of useless features is initially unknown, traditional class-separability and distance measures are not capable of coping with this problem. The useless feature-subset is detected by fitting a probability model to a given training set of classified feature-vectors using the minimum-description-length criterion (MDLC) for model selection. The resulting criterion for the Gaussian case is a simple closed-form expression, having a plausible geometric interpretation, and is proved to be consistent, i.e., it yields the true useless subset with probability 1 as the size of the training set grows to infinity. Simulations show excellent results compared to the cross-validation method and other information-theoretic criteria, even for small-sized training sets
Article
Classical feature extraction and data projection methods have been well studied in the pattern recognition and exploratory data analysis literature. We propose a number of networks and learning algorithms which provide new or alternative tools for feature extraction and data projection. These networks include a network (SAMANN) for J.W. Sammon's (1969) nonlinear projection, a linear discriminant analysis (LDA) network, a nonlinear discriminant analysis (NDA) network, and a network for nonlinear projection (NP-SOM) based on Kohonen's self-organizing map. A common attribute of these networks is that they all employ adaptive learning algorithms which makes them suitable in some environments where the distribution of patterns in feature space changes with respect to time. The availability of these networks also facilitates hardware implementation of well-known classical feature extraction and projection approaches. Moreover, the SAMANN network offers the generalization ability of projecting new data, which is not present in the original Sammon's projection algorithm; the NDA method and NP-SOM network provide new powerful approaches for visualizing high dimensional data. We evaluate five representative neural networks for feature extraction and data projection based on a visual judgement of the two-dimensional projection maps and three quantitative criteria on eight data sets with various properties