## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

The large number of spectral variables in most data sets encountered in spectral chemometrics often renders the prediction of a dependent variable uneasy. The number of variables hopefully can be reduced, by using either projection techniques or selection methods; the latter allow for the interpretation of the selected variables. Since the optimal approach of testing all possible subsets of variables with the prediction model is intractable, an incremental selection approach using a nonparametric statistics is a good option, as it avoids the computationally intensive use of the model itself. It has two drawbacks however: the number of groups of variables to test is still huge, and colinearities can make the results unstable. To overcome these limitations, this paper presents a method to select groups of spectral variables. It consists in a forward–backward procedure applied to the coefficients of a B-spline representation of the spectra. The criterion used in the forward–backward procedure is the mutual information, allowing to find nonlinear dependencies between variables, on the contrary of the generally used correlation. The spline representation is used to get interpretability of the results, as groups of consecutive spectral variables will be selected. The experiments conducted on NIR spectra from fescue grass and diesel fuels show that the method provides clearly identified groups of selected variables, making interpretation easy, while keeping a low computational load. The prediction performances obtained using the selected coefficients are higher than those obtained by the same method applied directly to the original variables and similar to those obtained using traditional models, although using significantly less spectral variables.

To read the full-text of this research,

you can request a copy directly from the authors.

... There are several possible ways to determine the degree of smoothing and thus the number of data functions (splines). One can use a leave one out procedure that considers the approximation error for a wavelength variable left out of the functional fitting procedure [21]. Other possible approaches consist of applying a scree plot procedure similar to the way the number of latent variables is selected for PLS or the construction of a B-spline basis from expert knowledge of the underlying absorption peaks. ...

... It should, however, be noted that the B-spline basis functions take advantage of the correlation between neighbour variables which is totally ignored in the other projection methods (PCR and PLS). The fact that these basis functions only run over a limited wavelength interval also creates interesting possibilities with respect to variable selection and interpretability [21]. This functional regression procedure will now be illustrated on two examples: the prediction of dry matter content from NIR spectra of hog manure (motivating example) [19] and the prediction of the cetane number of Diesel [22]. ...

... The choice of the number of B-splines based on the cross-validation error might therefore be suboptimal for spectra with high curvature such as the Diesel spectra. In this case variable selection techniques such as a forward-backward procedure [21] or a further dimensionality reduction by performing PCR or PLS on the B-spline regression coefficients [23] might be good options which have to be further investigated. The selection of the degree of smoothness and the amount of dimensionality reduction could then be optimized independently by using a cross-model validation scheme. ...

In spectroscopy the measured spectra are typically plotted as a function of the wavelength (or wavenumber), but analysed with multivariate data analysis techniques (multiple linear regression (MLR), principal components regression (PCR), partial least squares (PLS)) which consider the spectrum as a set of m different variables. From a physical point of view it could be more informative to describe the spectrum as a function rather than as a set of points, hereby taking into account the physical background of the spectrum, being a sum of absorption peaks for the different chemical components, where the absorbance at two wavelengths close to each other is highly correlated. In a first part of this contribution, a motivating example for this functional approach is given. In a second part, the potential of functional data analysis is discussed in the field of chemometrics and compared to the ubiquitous PLS regression technique using two practical data sets. It is shown that for spectral data, the use of B-splines proves to be an appealing basis to accurately describe the data. By applying both functional data analysis and PLS on the data sets the predictive ability of functional data analysis is found to be comparable to that of PLS. Moreover, many chemometric datasets have some specific structure (e.g. replicate measurements, on the same object or objects that are grouped), but the structure is often removed before analysis (e.g. by averaging the replicates). In the second part of this contribution, we suggest a method to adapt traditional analysis of variance (ANOVA) methods to datasets with spectroscopic data. In particular, the possibilities to explore and interpret sources of variation, such as variations in sample and ambient temperature, are examined. Copyright © 2008 John Wiley & Sons, Ltd.

... The tecator data set is well-known in the field of chemometry [16]. The input consists of a set of continuous spectra discretized at the frequency interval 950..1050 nm, the dimension of the discretized space being 100. ...

... In Table 13 we have trained models with these inputs using crossvalidation. Compared to the results in [16], the results of the LS-SVM are worse. This may be due to the grid search method used in the optimization phase which may find suboptimal solutions. ...

... This may be due to the grid search method used in the optimization phase which may find suboptimal solutions. Also, in [16], a spline compression method is used to reduce the dimensionality of the input space prior to input selection. However, especially the result of the MLP reveals that the select input variables indeed contain relevant information of the fat content as the prediction accuracy is rather good. ...

The problem of residual variance estimation consists of estimating the best possible generalization error obtainable by any model based on a finite sample of data. Even though it is a natural generalization of linear correlation, residual variance estimation in its general form has attracted relatively little attention in machine learning.In this paper, we examine four different residual variance estimators and analyze their properties both theoretically and experimentally to understand better their applicability in machine learning problems. The theoretical treatment differs from previous work by being based on a general formulation of the problem covering also heteroscedastic noise in contrary to previous work, which concentrates on homoscedastic and additive noise.In the second part of the paper, we demonstrate practical applications in input and model structure selection. The experimental results show that using residual variance estimators in these tasks gives good results often with a reduced computational complexity, while the nearest neighbor estimators are simple and easy to implement.

... The large initial number of features and the high degree of collinearity render the feature selection procedures slow and often unstable [1]. Highly-correlated features furthermore make interpretation uneasy. ...

... Another approach for grouping features is to describe the spectra in a functional basis whose basis functions are 'local' in the sense that they correspond to well-defined portions of the spectra. Splines have been used successfully for this [1]. However, while the clusters always contain consecutive features, they are all forced to have the same size in [1]. ...

... Splines have been used successfully for this [1]. However, while the clusters always contain consecutive features, they are all forced to have the same size in [1]. Also, the contribution of each original feature to the cluster depends on its position on the wavelength range; while interpretation is possible, it is based on an approximate view of the functional features. ...

Spectral data often have a large number of highly-correlated features, making feature selection both necessary and uneasy. A methodology combining hierarchical constrained clustering of spectral variables and selection of clusters by mutual information is proposed. The clustering allows reducing the number of features to be selected by grouping similar and consecutive spectral variables together, allowing an easy interpretation. The approach is applied to two datasets related to spectroscopy data from the food industry.

... For example, "feature-level sensor fusion" involves the extraction and integration of high level features from raw sensor data [1] and hence feature grouping allows the estimation of the importance of different sensors. Feature grouping is also useful for wavelength selection in spectral chemometric applications [2], [3]. In such applications, grouping of contiguous wavelengths into bands and selection of a small number of bands provide information about the regions of the spectrum which are sensitive to the target being monitored, whereas selection of individual wavelengths could result in the selected wavelengths being spread throughout the spectrum, thus losing interpretability. ...

... The second data set considered is selected from "The software shootout" [3] and is publicly available at http:// kerouac.pharm.uky.edu/. It consists of scans and chemistry gathered from fescue grass (Festuca elatior). ...

... PLSR denotes Partial Least Squares Regression which is the most widely used method for spectral chemometric applications [4] but makes use of the entire spectrum. "B-splines+MI+RBFN" denotes the method used in [3] which makes use of mutual-information-based forward selection on the coefficients of a B-spline representation of the spectrum to perform bandwidth selection and then uses a Radial Basis Function Network. ...

In many signal processing applications, grouping of features during model development and the selection of a small number of relevant groups can be useful to improve the interpretability of the learned parameters. While a lot of work based on linear models has been reported to solve this problem, in the last few years, multiple kernel learning has come up as a candidate to solve this problem in nonlinear models. Since all of the multiple kernel learning algorithms to date use convex primal problem formulations, the kernel weights selected by these algorithms are not strictly the sparsest possible solution. The main reason for using a convex primal formulation is that efficient implementations of kernel-based methods invariably rely on solving the dual problem. This work proposes the use of an additional log-based concave penalty term in the primal problem to induce sparsity in terms of groups of parameters. A generalized iterative learning algorithm, which can be used with a linear combination of this concave penalty term with other penalty terms, is given for model parameter estimation in the primal space. It is then shown that a natural extension of the method to nonlinear models using the "kernel trick" results in a new algorithm, called Sparse Multiple Kernel Learning (SMKL), which generalizes group-feature selection to kernel selection. SMKL is capable of exploiting existing efficient single kernel algorithms while providing a sparser solution in terms of the number of kernels used as compared to the existing multiple kernel learning framework. A number of signal processing examples based on the use of mass spectra for cancer detection, hyperspectral imagery for land cover classification, and NIR spectra from wheat, fescue grass, and diesel are given to highlight the ability of SMKL to achieve a very high accuracy with a very few kernels.

... They stated that B-Spline estimated MI outperforms all the other known algorithms for gene expression analysed. Rossi et al. [33] stated that B-Spline estimated MI reduces feature selection. It is a good choice as it is non-parametric and model-independent. ...

... Thus, this method is called B-Spline Mutual Information ICA (BMICA). Being an ICA method BMICA will not only decorrelate signals but also reduce higher-order statistical dependencies [33]. The method will overcome (i) estimating joint densities dependent on samples that grow exponentially to provide accurate estimations and (ii) the choice-of-origin problem by smoothing the effect of transition of data points between bins due to shifts in origin. ...

Mutual Information is one of the most natural criteria when developing independent component analysis (ICA). Although utilized to some level it has always been difficult to calculate. We present a new algorithm which utilizes a contrast function related to Mutual Information based on B-Spline functions. We compared this algorithm with benchmarked ICA algorithms such as FastICA, Infomax and JADE and found it to be very favourable with them in performance.

... Besides feature selection, there is another set of methods known as projection methods that perform the same task but in practice could retain the problems suffered by high dimensional data, presented in Rossi et al. (2007;). Typical projection algorithms are Principal Component Analysis (PCA), Sammon's Mapping, Kohonen maps, Linear Discriminant Analysis (LDA), Partial Least Squares (PLS) or Projection pursuit, amongst others ( see Duda et al. (2009)). ...

... There are different alternatives in relevance criteria, such as the Pearson correlation coefficient, mutual information (MI) and wrapper methodology. Although each method has its advantages and disadvantages, mutual information has proven to be an appropriate measure in several applications such as selection of spectral variables, spectrometric nonlinear modelling and functional data classification, see Verdejo et al. (2009); Rossi et al. (2007;). Moreover, as discussed in Cover & Thomas (1991), correlation does not measure nonlinear relations among features and wrapper approach presents a high computational load. ...

... In unsupervised context, the best basis is obtained by minimizing the entropy of the features (i.e., of the coordinates of the functions on the basis) in order to enable compression by discarding the less important features. Following [12], [14] proposes a different approach, based on Bsplines: a leave-one-out version of Equation (1) is used to select the best B-splines basis. While the orthonormal basis induced by the B-splines does not correspond to compactly supported functions, the dependency between a new feature and the original ones is still localized enough to allow easy interpretation. ...

... For instance, it will never select an interval with only one point whereas this could be the case for the standard solution. As a consequence, the standard solution will likely produce bases with rather bad leave-one-out performances and tend to select a too small number of segments (see Section 4 for an example of this behavior).Figure 1 the function approximation problem is interesting as the smoothness of the spectrum varies along the spectral range and an optimal basis will obviously not consist in functions with supports of equal size.Figure 2 shows an example of the best basis obtained by the proposed approach for k = 16 clusters, whileFigure 3 gives the suboptimal solution obtained by a basis with equal length intervals (as used in [14]). The uniform length approach is clearly unable to pick up details such as the peak on the right of the spectra. ...

Functional data analysis involves data described by regular functions rather
than by a finite number of real valued variables. While some robust data
analysis methods can be applied directly to the very high dimensional vectors
obtained from a fine grid sampling of functional data, all methods benefit from
a prior simplification of the functions that reduces the redundancy induced by
the regularity. In this paper we propose to use a clustering approach that
targets variables rather than individual to design a piecewise constant
representation of a set of functions. The contiguity constraint induced by the
functional nature of the variables allows a polynomial complexity algorithm to
give the optimal solution.

... The underlying rationale is to no longer consider thousands of discrete individual absorbance values to build a predictive model, a spectral fingerprint, but rather a network of patterns that locally describe, or approach, the spectral curvatures at different spectral windows. As quoted previously [13], the ordering of the variables has significance. The analogy with pictures captured at different resolutions makes sense to understand how the B-spline approach could bring different but complementary information [14,10]. ...

MIR spectroscopy is becoming an increasingly important tool potentially useful for diagnosis purposes especially by studying body fluids. Indeed, diseases induce changes in the composition of fluids modifying the MIR spectra. However, such changes can be difficult to capture if the structure of the data is not considered. Our objective was to improve MIR spectra analysis by using approximation of the spectra by B-splines at different specific resolutions and to combine these spectra representations with a machine learning model to predict hepatic steatosis from serum study. The different resolutions make it possible to identify changes in shape over bands of various widths. The multiresolution model helps to improve the hepatic steatosis prediction compared to conventional approaches where the absorbances are considered as unstructured variables. In addition, B-splines provide both localized and compressed information that can be translated into biochemical terms more easily than with other classical approximation methods (wavelets, Fourier transforms).

... Finally, the correlation coefficient may not be appropriate to measure the predictive power of variables when the data distribution is not Gaussian or the model to be constructed is nonlinear. In this case, using the more elaborate mutual information index, 15,16 which evaluates the dependency between two variables on the basis of their joint probability distribution, may be more recommended. ...

... In our propositions, each of the aforementioned selection approaches is integrated to LDA, KNN and PNN classification techniques. Previous studies [30,31] reported better results using nonlinear algorithms when predicting the cetane number, while other authors [32,33] stated that cetane number and NIR spectra can be successfully modeled assuming linear relationship. The inclusion of PNN nonlinear classifier is aimed at addressing possible nonlinear relationships between the cetane number and NIR spectra, as there is no consensus about its nature. ...

In recent years, spectroscopy techniques such as Near infrared (NIR) and Fourier Transform Infrared (FTIR) have been widely adopted as analytical tools in different fields and with several purposes. NIR and FTIR data are typically comprised of hundreds or even thousands of highly correlated wavenumbers, fact that can jeopardize the accuracy of several statistical techniques. In light of that, wavenumber selection emerges as an important step in prediction and classification tasks based on spectroscopy data. This paper proposes a novel framework for wavenumber selection aimed at classifying samples into proper categories, which is applied to two data sets from the petroleum sector. The method relies on two main stages: determination of intervals based on the distance between the average spectra of the classes and selection of the most suitable intervals through cross-validation. An improvement in the misclassification rate was achieved for a NIR spectra data set of diesel, decreasing that metric from 13.90% to 11.63% after the application of the proposed method while retaining 23.19% of the original wavenumbers. As for the biodiesel FTIR data set, the method yielded a misclassification rate of 1.21% while retaining 4.95% of the original variables; misclassification rate was 4.71% when all wavenumbers were used. The proposed method also outperformed traditional approaches for wavenumber selection.

... A common and plausible solution is to pre-select optimal subsets of the spectrum for training. Selection methods based on wavelet coefficient regression and a genetic algorithm [8], mutual information and B-spline compression [9,10], and mutual information and a modified genetic algorithm [11] have been successfully implemented. ...

Near-infrared spectroscopy is a widely adopted technique for characterising biological tissues. The high dimensionality of spectral data, however, presents a major challenge for analysis. Here, we present a second-derivative Beer's law-based technique aimed at projecting spectral data onto a lower dimension feature space characterised by the constituents of the target tissue type. This is intended as a preprocessing step to provide a physically-based, low dimensionality input to predictive models. Testing the proposed technique on an experimental set of 145 bovine cartilage samples before and after enzymatic degradation, produced a clear visual separation between the normal and degraded groups. Reduced proteoglycan and collagen concentrations, and increased water concentrations were predicted by simple linear fitting following degradation (all ). Classification accuracy using the Mahalanobis distance was between these groups.

... Finally, the correlation coefficient may not be appropriate to measure the predictive power of variables when the data distribution is not Gaussian or the model to be constructed is nonlinear. In this case, using the more elaborate mutual information index, 14,15 which evaluates the dependency between two variables on the basis of their joint probability distribution, could be a better option. ...

This chapter addresses the problem of selecting appropriate predictors from an overall set of x variables for linear regression modeling. Criteria for selecting variables with or without the explicit construction of a model are discussed, together with several algorithms that were developed or adapted for use in the variable selection problem. For illustration purposes, two case studies concerning the selection of wavelengths in a multivariate calibration problem and physicochemical descriptors in a quantitative structure-activity relationship investigation are presented. Finally, possible drawbacks associated to variable selection are also discussed. Computational routines employing the Matlab 6.5 software are provided.

... By the definition of information entropy and MI, the probability density distribution of random variables must be approximately estimated before MI calculation. One kind of probability density estimation method based on nearest neighbor is introduced in [16], which has good effect used in [17,18] as well. The advantage of this method is that there is no need to estimate the probability density distribution function for any variables. ...

In clinical medicine, multidimensional time series data can be used to find the rules of disease progress by data mining technology, such as classification and prediction. However, in multidimensional time series data mining problems, the excessive data dimension causes the inaccuracy of probability density distribution to increase the computational complexity. Besides, information redundancy and irrelevant features may lead to high computational complexity and over-fitting problems. The combination of these two factors can reduce the classification performance. To reduce computational complexity and to eliminate information redundancies and irrelevant features, we improved upon a multidimensional time series feature selection method to achieve dimension reduction. The improved method selects features through the combination of the Kozachenko–Leonenko (K–L) information entropy estimation method for feature extraction based on mutual information and the feature selection algorithm based on class separability. We performed experiments on the Electroencephalogram (EEG) dataset for verification and the non-small cell lung cancer (NSCLC) clinical dataset for application. The results show that with the comparison of CLeVer, Corona and AGV, respectively, the improved method can effectively reduce the dimensions of multidimensional time series for clinical data.

... The study of this interaction is called spectroscopy. Energy can be absorbed by matter and the amount of absorption dependents on the type of the compound [23]. However, not all the measured variables are related to the compound of interest. ...

This paper proposes multi-objective genetic algorithm for the problem of variable selection in multivariate calibration. We consider the problem related to the classification of biodiesel samples to detect adulteration, Linear Discriminant Analysis classifier. The goal of the multi--objective algorithm is to reduce the dimensionality of the original set of variables; thus, the classification model can be less sensitive, providing a better generalization capacity. In particular, in this paper we adopted a version of the Non-dominated Sorting Genetic Algorithm (NSGA-II) and compare it to a mono-objective Genetic Algorithm (GA) in terms of sensitivity in the presence of noise. Results show that the mono-objective selects 20 variables on average and presents an error rate of 14%. One the other hand, the multi-objective selects 7 variables and has an error rate of 11%. Consequently, we show that the multi-objective formulation provides classification models with lower sensitivity to the instrumental noise when compared to the mono-objetive formulation

... Rossi et al. [46] stated that B-Spline estimated MI reduces feature selection. It is a good choice as it is non-parametric and model-independent. ...

The information theoretic concept of mutual information provides a general framework to evaluate dependencies between variables. Its estimation however using B-Spline has not been used before in creating an approach for Independent Component Analysis. In this paper we present a B-Spline estimator for mutual information to find the independent components in mixed signals. Tested using electroencephalography (EEG) signals the resulting BMICA (B-Spline Mutual Information Independent Component Analysis) exhibits better performance than the standard Independent Component Analysis algorithms of FastICA, JADE, SOBI and EFICA in similar simulations. BMICA was found to be also more reliable than the "renown" FastICA. .

... Variable selection methods that were originally designed to extract the most pertinent wavelengths from the full spectrum have drawn considerable attention in recent quantitative analyses. Both experimental and theoretical applications have demonstrated that the prediction and interpretation performance of the calibration model can be improved through variable selection [5][6][7][8][9][10][11]. ...

It is imperfect to evaluate a subsampling variable selection method using only its prediction performance. To further assess the reliability of subsampling variable selection methods, dummy noise variables of different amplitudes were augmented to the original spectral data, and the false variable selection number was recorded. The reliabilities of three subsampling variable selection methods including Monte Carlo uninformative variable elimination (MC-UVE), competitive adaptive reweighted sampling (CARS), and stability CARS (SCARS) were evaluated using this dummy noise strategy. The evaluation results indicated that both CARS and SCARS produced more parsimonious variable sets, but the reliabilities of their final variable sets were weaker than those of MC-UVE. On the contrary, only marginal improvement on the prediction performance was obtained using MC-UVE. Further experiments showed that removing white noise-like variables beforehand would improve the reliability of variables extracted by CARS and SCARS. Copyright © 2014 John Wiley & Sons, Ltd.

... In unsupervised context, the best basis is obtained by minimizing the entropy of the features (i.e., of the coordinates of the functions on the basis) in order to enable compression by discarding the less important features. Following [7], [9] proposes a different approach, based on Bsplines: a leave-one-out version of Equation (1) is used to select the best B-splines basis. While the orthonormal basis induced by the B-splines does not correspond to compactly supported functions, the dependency between a new feature and the original ones is still localized enough to allow easy interpretation. ...

Functional data analysis involves data described by regular functions rather by a finite number of real valued variables. While some robust data analy-sis methods can be applied directly to the very high dimensional vectors obtained via a fine grid sampling of functional data, all the methods benefit from a prior sim-plification of the functions that reduces the redundancy induced by the regularity. In this paper we propose to use a variable clustering approach to design a piecewise constant representation of a set of functions. The contiguity constraint induced by the functional nature of the variables leads to an optimal algorithm.

... In these conditions, the most natural solution consists in reducing data dimensionality. Different methods exist in the literature, which can be distinguished either as feature selection methods or projection techniques, e.g., feature selection, mutual information, forward-backward, B-splines and genetic algorithm [4][5][6][7][8]. ...

In this paper, we propose a two-stage regression approach, which is based on the residual correction concept. Its underlying idea is to correct any given regressor by analyzing and modeling its residual errors in the input space. We report and discuss results of experiments conducted on three different datasets in infrared spectroscopy and designed in such a way to test the proposed approach by: 1) varying the kind of adopted regression method used to approximate the chemical parameter of interest. Partial least squares regression (PLSR), support vector machines (SVM) and radial basis function neural network (RBF) methods are considered; 2) adopting or not a feature selection strategy to reduce the dimension of the space where to perform the regression task. A comparative study with another approach which exploits differently estimation errors, namely adaptive boosting for regression (AdaBoost.R), is also included. The obtained results point out that the residual-based correction approach (RBC) can improve the accuracy of the estimation process. Not all the improvements are statistically significant but, at the same time, no case of accuracy decrease has been observed.

... Filtering can therefore be considered as a preprocessing step in which functional data are consistently transformed into vector data [33]. As we are operating now in a finite-dimensional space, it is possible to work with the coefficients instead of working on the approximating functions. ...

In chemometrics, spectral data are typically represented by vectors of features in spite of the fact that they are usually plotted as functions of e.g. wavelengths and concentrations. In the representation, this functional information is thereby not reflected. Consequently, some characteristics of the data that can be essential for discrimination between samples of different classes or any other analysis are ignored. Examples are the continuity between measured points and the shape of curves. In the Functional Data Analysis (FDA) approach, the functional characteristics of spectra are taken into account by approximating the data by real valued functions, e.g. splines. Another solution is the Dissimilarity Representation (DR), in which classifiers are trained in a space built by dissimilarities with training examples or prototypes of each class. Functional information may be incorporated in the definition of the dissimilarity measure. In this paper we compare the feature-based representation of chemical spectral data with three other representations: FDA, DR defined on raw data and DR defined on FDA descriptions. We analyze the classification results of these four representations for five data sets of different types, by using different classifiers. We demonstrate the importance of reflecting the functional characteristics of chemical spectral data in their representation, and we show when the presented approaches are more suitable. Copyright © 2011 John Wiley & Sons, Ltd.

... Moreover the high degree of co-linearity between wavelengths renders the modeling of the spectra slow and unstable. The dimension reduction is achieved by projecting the spectra on a B-Spline basis as in [6]. Indeed, the spectra can be viewed as the discretization of a continuous function. ...

For the third consecutive year and due to the success of the chemometric contests organized within the framework of previous congresses, another data set has been proposed for the organisation committee at the 'Chimiométrie 2006' meeting (http://www.chimiometrie.org/) held in Paris, France (30th November and 1st December). As in the first contest organized in 2004 this data set was selected in order to test the ability of the participants for using regression methods based on NIR data. The data set consists on three different properties characterizing soils coming from the Walloon region in Belgium. This year, unlike previous contests, the data have been not modified by the authors. Only three participants decided to play with the proposed data and presented their own approaches during the conference. As last year's, this paper summarizes the approaches presented during the meeting by the participants and the authors.

... When we are given a set of curves, an unique segmentation can be found in order to represent all the curves on a common piecewise constant basis (see [11] for an optimal solution). This was used as a preprocessing step in e.g. [10, 5]. We propose in this paper to merge the two approaches: we build a k-means like clustering of a set of functions in which each prototype is given by a piecewise constant function. ...

We propose in this paper an exploratory analysis algorithm for functional data. The method partitions a set of functions into K clus- ters and represents each cluster by a piecewise constant prototype. The total number of segments in the prototypes, P , is chosen by the user and optimally distributed into the clusters via two dynamic programming al- gorithms.

... ANN is a sophisticated nonlinear computational tool capable of modeling the complex functions (10). In order to improve the performance of chemometric based models, some wavelength selection techniques, e.g., principal component analysis (11,12), mutual information (13,14), and genetic algorithm (GA; 15), are used to remove the useless data. GA is a feature selection method that overcomes the optimization problems using a fitness function. ...

A method has been established for simultaneous determination of sodium sulfate, sodium carbonate, and sodium tripolyphosphate in detergent washing powder samples based on attenuated total reflectance Fourier transform IR spectrometry in the mid-IR spectral region (800-1550 cm(-1)). Genetic algorithm (GA) wavelength selection followed by feed forward back-propagation artificial neural network (BP-ANN) was the chemometric approach. Root mean square error of prediction for BP-ANN and GA-BP-ANN was 0.0051 and 0.0048, respectively. The proposed method is simple, with no tedious pretreatment step, for simultaneous determination of the above-mentioned components in commercial washing powder samples.

... In 2007, Rossi et al. [102] presented a fast selection of NIR spectral variables with B-spline compression. This implemented a forward-backward procedure as applied to the coefficients of a B-spline representation of the spectra. ...

Near-infrared (NIR) spectroscopy has increasingly been adopted as an analytical tool in various fields, such as the petrochemical, pharmaceutical, environmental, clinical, agricultural, food and biomedical sectors during the past 15 years. A NIR spectrum of a sample is typically measured by modern scanning instruments at hundreds of equally spaced wavelengths. The large number of spectral variables in most data sets encountered in NIR spectral chemometrics often renders the prediction of a dependent variable unreliable. Recently, considerable effort has been directed towards developing and evaluating different procedures that objectively identify variables which contribute useful information and/or eliminate variables containing mostly noise. This review focuses on the variable selection methods in NIR spectroscopy. Selection methods include some classical approaches, such as manual approach (knowledge based selection), "Univariate" and "Sequential" selection methods; sophisticated methods such as successive projections algorithm (SPA) and uninformative variable elimination (UVE), elaborate search-based strategies such as simulated annealing (SA), artificial neural networks (ANN) and genetic algorithms (GAs) and interval base algorithms such as interval partial least squares (iPLS), windows PLS and iterative PLS. Wavelength selection with B-spline, Kalman filtering, Fisher's weights and Bayesian are also mentioned. Finally, the websites of some variable selection software and toolboxes for non-commercial use are given.

... Those methods are less sensitive to overfitting and lead to an easy interpretation of the results, but they are generally quite slow. Although there are some works [6] aimed at improving the computational time of the variable selection meth- ods. Functional Data Analysis (FDA) is an extension of traditional multivariate analysis that is specifically oriented to deal with observations of functional nature [7]. ...

Quantitative analyses involving instrumental signals, such as chromatograms, NIR, and MIR spectra have been successfully applied nowadays for the solution of important chemical tasks. Multivariate calibration is very useful for such purposes and the commonly used methods in chemometrics consider each sample spectrum as a sequence of discrete data points. An alternative way to analyze spectral data is to consider each sample as a function, in which a functional data is obtained. Concerning regression, some linear and nonparametric regression methods have been generalized to functional data. This paper proposes the use of the recently introduced method, support vector regression for functional data (FDA-SVR) for the solution of linear and nonlinear multivariate calibration problems. Three different spectral datasets were analyzed and a comparative study was carried out to test its performance with respect to some traditional calibration methods used in chemometrics such as PLS, SVR and LS-SVR. The satisfactory results obtained with FDA-SVR suggest that it can be an effective and promising tool for multivariate calibration tasks.

... Marx and Eilers [14] proposed for the first time the projection onto some number of the equally spaced B-spline bases. The use of B-splines as the tool for selection of spectral variables was published recently by Rossi et al. [15]. ...

The BFR (Basis Function Regression) is an interesting alternative to common techniques (such as PCR or PLS) in chemometrics. It is based on projecting the spectral information onto some number of equally spaced spline bases, than obtaining integrals of resulted curves. Existing references show that in certain cases it can reduce almost twice the RMSEP values. As this technique is not so popular in chemometrics nor applied in pharmaceutical analysis, it is desirable to present its theoretical considerations and implementation (with example MATLAB/Octave code). As an illustrative example we present the chemometric model for content recognition of a tablet (12 possible compounds in binary or ternary combinations) from the UV spectrum of its methanolic extract. The BFR technique gave lowest prediction error and the estimators obtained have more meritorical meaning than in case of PCR, PLS and other techniques used for comparison. In our opinion this technique should be considered in any chemometric approach as it often shows better performance.

For modeling of multivariate time series, input variable selection is a key problem. Feature selection is to select a relevant subset to reduce the dimensionality of the problem without significant loss of information. This paper presents the estimation of mutual information and its application in feature selection problem. Mutual information is one of the most common strategies borrowed from information theory for feature selection. However, the calculation of probability density function (PDF) according to the definition of mutual information is difficult, especially for high dimensional variables. A k-nearest neighbor (k-NN) method based estimator is widely used to estimate the mutual information between two variables directly from the data set. Nevertheless, this estimator depends on smoothing parameter. There is no theoretically method to choose the parameter. This paper purposes to solve two problems: one is to employ resampling methods to help the mutual information estimator to improve feature selection and the other is to apply these methods to a wind power prediction problem.

Over the past 30 years, near-infrared (NIR) spectroscopy combined with chemometric methods has proved to be one of the most efficient and advanced tools for quantitative and qualitative analysis of food and agricultural products. Although NIR instrumentation produces large volumes of data, it often, as we have described, requires careful and sophisticated processing in order to extract information. Food and agricultural products have their specific composition which allows characteristic NIR spectra considered as “fingerprint.” Chemometric methods have been found to be very useful for extracting information from NIR spectra, and there is great interest for using the NIR technology for measurements of phenomena of different analytes. The chemometric methods, especially the variable selection methods, are highlighted in this chapter. After that, two applications of NIR spectroscopy combined variable selection methods are introduced.

In this paper, we describe the use of various methods of one-dimensional spectral compression by variable selection as well as principal component analysis (PCA) for compressing multi-dimensional sets of spectral data. We have examined methods of variable selection such as wavelength spacing, spectral derivatives, and spectral integration error. After variable selection, reduced transmission spectra must be decompressed for use. Here we examine various methods of interpolation, e.g., linear, cubic spline and piecewise cubic Hermite interpolating polynomial (PCHIP) to recover the spectra prior to estimating at-sensor radiance. Finally, we compressed multi-dimensional sets of spectral transmittance data from moderate resolution atmospheric transmission (MODTRAN) data using PCA. PCA seeks to find a set of basis spectra (vectors) that model the variance of a data matrix in a linear additive sense. Although MODTRAN data are intricate and are used in nonlinear modeling, their base spectra can be reasonably modeled using PCA yielding excellent results in terms of spectral reconstruction and estimation of at-sensor radiance. The major finding of this work is that PCA can be implemented to compress MODTRAN data with great effect, reducing file size, access time and computational burden while producing high-quality transmission spectra for a given set of input conditions.

A novel method for rapid, accurate and nondestructive determination of trimethoprim in complex matrix was presented. Near-infrared spectroscopy coupled with multivariate calibration(partial least-squares and artificial neural networks) was applied in the experiment. The variable selection process based on a modified genetic algorithm with fixed number of selected variables was proceeded, which can reduce the training time and enhance the predictive ability when coupled with artificial neural network model.

A new method named as diverse variables-consensus partial least squares (DV-CPLS) is proposed based on consensus (ensemble) strategy combined with uninformative variable elimination (UVE) technique. In the approach, UVE-PLS is used to construct member models with different numbers of variables (wavelengths) instead of altering training subset in conventional consensus method, and then prediction results of multiple member models are combined by a new weighted averaging way to give ensemble results. DV-CPLS is applied for building quantitative model between diesel near-infrared (NIR) spectra and cetane number (CN), and the results show fine prediction capability in terms of accuracy and robustness. When DV-CPLS was further combined with wavelet transform (WT) method, a more parsimonious model was obtained. The proposed method improves the performance of conventional PLS linear modeling in determination of diesel CN by NIR spectra. So it is hoped that it will help further investigations of consensus modeling and variable selection technique, and as well as applications in the sphere of NIR and even other spectral analysis of sophisticated systems.

Feature selection is an important preprocessing task for many machine learning and pattern recognition applications, including regression and classification. Missing data are encountered in many real-world problems and have to be considered in practice. This paper addresses the problem of feature selection in prediction problems where some occurrences of features are missing. To this end, the well-known mutual information criterion is used. More precisely, it is shown how a recently introduced nearest neighbors based mutual information estimator can be extended to handle missing data. This estimator has the advantage over traditional ones that it does not directly estimate any probability density function. Consequently, the mutual information may be reliably estimated even when the dimension of the space increases. Results on artificial as well as real-world datasets indicate that the method is able to select important features without the need for any imputation algorithm, under the assumption of missing completely at random data. Moreover, experiments show that selecting the features before imputing the data generally increases the precision of the prediction models, in particular when the proportion of missing data is high.

The classification of animal feed ingredients has become a challenging computational task since the food crisis that arose in the European Union after the outbreak of bovine spongiform encephalopathy (BSE). The most interesting alternative to replace visual observation under classical microscopy is based on the use of near infrared reflectance microscopy (NIRM). This technique collects spectral information from a set of microscopic particles of animal feeds. These spectra can be classified using maximum margin classifiers with good results. However, it is difficult to interpret the models in terms of the contribution of features. To gain insight into the interpretability of such classifications, we propose a method that learns accurate classifiers defined on a small set of narrow intervals of wavelengths. The proposed method is a greedy bipartite procedure that may be successfully compared with other state-of-the-art feature selectors and can be scaled up efficiently to deal with other classification tasks of higher dimensionality.

Consensus modeling based on improved Boosting algorithm (Boosting-PLS, BPLS) combined with wavelength (variable) selection by MC-UVE (Monte Carlo-Uninformative Variable Elimination) method is applied to determination of cetane number (CN) of diesel. MC-UVE is firstly used to select characteristic variables from Near-infrared (NIR) spectra of diesel based on principles of MC simulation and UVE, and then the selected variables instead of the full spectra are used for BPLS modeling to predict results. From predicted results, the proposed MC-UVE-BPLS algorithm improves the performance of conventional linear PLS modeling in terms of accuracy and robustness, so it is more efficient and parsimonious with few numbers of useful variables when applied to the relationship between CN and diesel NIR spectra. Simultaneously, the prediction results of MC-UVE-BPLS compared with those of MC-UVE-PLS, BPLS and CPLS (Consensus modeling based on Bagging) show that MC-UVE-BPLS is superior to other models, and also verifies the efficiency of MC-UVE and improved BPLS. So the proposed MC-UVE-BPLS method provides a new approach for determination of diesel CN by NIR spectra.

Feature selection is an important preprocessing step for many high-dimensional regression problems. One of the most common strategies is to select a relevant feature subset based on the mutual information criterion. However, no connection has been established yet between the use of mutual information and a regression error criterion in the machine learning literature. This is obviously an important lack, since minimising such a criterion is eventually the objective one is interested in. This paper demonstrates that under some reasonable assumptions, features selected with the mutual information criterion are the ones minimising the mean squared error and the mean absolute error. On the contrary, it is also shown that the mutual information criterion can fail in selecting optimal features in some situations that we characterise. The theoretical developments presented in this work are expected to lead in practice to a critical and efficient use of the mutual information for feature selection.

In this paper, a novel chemometric method was developed for rapid, accurate, and quantitative analysis of cefalexin. The experiment was carried out by using the near-infrared spectrometry coupled to multivariate calibration (partial least squares and artificial neural nets). The wavelength selection through a modified genetic algorithm with fixed number of select variables would enhance the predictive ability when applying artificial neural networks model. (c) 2009 Elsevier B.V. All rights reserved.

Synchronous 2D correlation spectroscopy was firstly proposed to select informational spectral intervals in PLS calibration. The proposed method could extract the spectral intervals related to analyte. The results of its application to NIR/PLS determination of quercetin in extract of Ginkgo biloba leaves showed that the proposed method could find out an optimized region with which one could improve the performance of the corresponding PLS model, in terms of low prediction error, root mean square error of prediction (RMSEP), comparing with the result obtained using whole spectra and interval PLS.

In this paper, a novel chemometric method was developed for rapid, accurate, and quantitative analysis of cefalexin in samples. The experiments were carried out by using the short near-infrared spectroscopy coupled with artificial neural networks. In order to enhancing the predictive ability of artificial neural networks model, a modified genetic algorithm was used to select fixed number of wavelength.

Global optimisation and search problems are abundant in science and engineering, including spectroscopy and its applications. Therefore, it is hardly surprising that general optimisation and search methods such as genetic algorithms (GAs) have also found applications in the area of near infrared INIRI spectroscopy. A brief introduction to genetic algorithms, their objectives and applications in NIR spectroscopy, as well as in chemometrics, is given. The most popular application for GAs in NIR spectroscopy is wavelength, or more generally speaking, variable selection. GAs are both frequently used and convenient in multi-criteria optimisation; for example, selection of pre-processing methods, wavelength inclusion, and selection of Latent variables can be optimised simultaneously. Wavelet transform has recently been applied to pre-processing of NIR data. In particular, hybrid methods of wavelets and genetic algorithms have in a number of research papers been applied to pre-processing, wavelength selection and regression with good success. In all calibrations and, in particular, when optimising, it is essential to validate the model and to avoid over-fitting. GAs have a Large potential when addressing these two major problems and we believe that many future applications will emerge. To conclude, optimisation gives good opportunities to simultaneously develop an accurate calibration model and to regulate model complexity and prediction ability within a considered validation framework.

In this paper, we propose a new approach for the construction of a hybrid color-texture space by using mutual information. Feature extraction is done by the co-occurrence matrix with SVM (support vectors machine) as a classifier. We apply our approach to the VisTex database and to the classification of a SPOT HRV (XS) image representing two forest areas in the region of Rabat in Morocco. We compare the result of classification obtained in this hybrid space with the one in the RGB color space.

Feature selection for spectral data can be highly beneficial both to improve the predictive ability of the model and to greatly enhance its interpretation. This paper presents an efficient approach based on regularized orthogonal forward selection. The selection procedure is a direct optimization of model generalization capability by sequentially minimizing the leave-one-out (LOO) test error. Moreover, a regularization method is incorporated in order to further enforce model sparsity and generalization capability. The introduced algorithm is computationally very efficient, yet obtains a good feature subset that ensures the model generalization and interpretation. Comparisons with some of the existing state-of-art feature selection methods on several real data sets show that our algorithm performs fairly well with respect to computational efficiency and predict accuracy.

A PLS-bootstrap-VIP approach is proposed as a simple wavelength selection method, yet having the ability to identify relevant spectral intervals. This approach is particularly attractive for wavelength selection within hyperspectral images due to its simplicity and relatively low computational cost compared to more sophisticated interval search methods. The method was tested on four visible-NIR spectral imaging datasets taken from the polymer, oil and pulp and paper industries. The results were compared with those obtained using PLS regression coefficients as well as with two more sophisticated methods involving several metrics or search for wavelength intervals. It is shown that a small number of well defined relevant spectral intervals are identified with the proposed approach, providing easy spectral interpretation in agreement with more complex interval search methods. Before final use, fine adjustments to the VIP threshold may be tested to verify whether predictive power can be improved.

We propose in this paper an exploratory analysis algorithm for functional data. The method partitions a set of functions into K clusters and represents each cluster by a simple prototype (e.g., piecewise constant). The total number of segments in the prototypes, P, is chosen by the user and optimally distributed among the clusters via two dynamic programming algorithms. The practical relevance of the method is shown on two real world datasets.

Chemical process installations are exposed to aggressive chemicals and conditions leading to corrosion. The damage from corrosion can lead to an unexpected plant shutdown and to the exposure of people and the environment to chemicals. Due to changes within and on the surface of materials subjected to corrosion, energy is released in the form of acoustic waves. This acoustic activity can be captured and used for corrosion monitoring in chemical process installations. Wavelet packet coefficients extracted from the acoustic activity have been considered to determine whether corrosion occurs, and to identify the type of corrosion process, at least for the most important corrosion processes in the chemical process industry. Feature subset selection is then applied to these wavelet coefficients to achieve a much higher accuracy in the identification of different corrosion processes than when no feature subset selection is applied to the acoustic waves. However, due to the statistical dependencies that potentially exist between the wavelet coefficients, the latter should not be selected independently from each other. Local discriminant basis selection algorithms do not take the statistical dependencies between wavelet coefficients into account. In this paper, we have used several mutual information-based approaches that take these dependencies into account and compared them to the wavelet-specific local discriminant basis selection algorithm. Furthermore, a hybrid filter–wrapper genetic algorithm, which uses a relevance–redundancy approach as a local search procedure, was designed. The highest classification accuracies are obtained with the hybrid filter–wrapper genetic algorithm, for all classifiers used in this paper. Furthermore, the proposed algorithm easily outperformed one of the most commonly used classifiers in chemometrics: partial least squares discriminant analysis (PLS-DA). A naïve Bayes classifier that uses the features selected by the hybrid filter–wrapper genetic algorithm was able to identify the absence of corrosion, uniform corrosion, pitting and stress corrosion cracking, with an accuracy of up to 87.20%.

A common problem found in statistics, signal processing, data analysis and image processing research is the estimation of
mutual information, which tends to be difficult. The aim of this survey is threefold: an introduction for those new to the
field, an overview for those working in the field and a reference for those searching for literature on different estimation
methods. In this paper comparison studies on mutual information estimation is considered. The paper starts with a description
of entropy and mutual information and it closes with a discussion on the performance of different estimation methods and some
future challenges.

Prediction problems from spectra are largely encountered in chemometry. In addition to accurate predictions, it is often needed to extract information about which wavelengths in the spectra contribute in an effective way to the quality of the prediction. This implies to select wavelengths (or wavelength intervals), a problem associated to variable selection. In this paper, it is shown how this problem may be tackled in the specific case of smooth (for example infrared) spectra. The functional character of the spectra (their smoothness) is taken into account through a functional variable projection procedure. Contrarily to standard approaches, the projection is performed on a basis that is driven by the spectra themselves, in order to best fit their characteristics. The methodology is illustrated by two examples of functional projection, using Independent Component Analysis and functional variable clustering, respectively. The performances on two standard infrared spectra benchmarks are illustrated.

In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To a c hieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach a n d s h o w a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a lter approach to feature subset selection. Signiicant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and Naive-Bayes.

In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach and show a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and Naive-Bayes.

We present two classes of improved estimators for mutual information M(X,Y), from samples of random points distributed according to some joint probability density mu(x,y). In contrast to conventional estimators based on binnings, they are based on entropy estimates from k -nearest neighbor distances. This means that they are data efficient (with k=1 we resolve structures down to the smallest possible scales), adaptive (the resolution is higher where data are more numerous), and have minimal bias. Indeed, the bias of the underlying entropy estimates is mainly due to nonuniformity of the density at the smallest resolved scale, giving typically systematic errors which scale as functions of k/N for N points. Numerically, we find that both families become exact for independent distributions, i.e. the estimator M(X,Y) vanishes (up to statistical fluctuations) if mu(x,y)=mu(x)mu(y). This holds for all tested marginal distributions and for all dimensions of x and y. In addition, we give estimators for redundancies between more than two random variables. We compare our algorithms in detail with existing algorithms. Finally, we demonstrate the usefulness of our estimators for assessing the actual independence of components obtained from independent component analysis (ICA), for improving ICA, and for estimating the reliability of blind source separation.

This paper investigates the application of the mutual information
criterion to evaluate a set of candidate features and to select an
informative subset to be used as input data for a neural network
classifier. Because the mutual information measures arbitrary
dependencies between random variables, it is suitable for assessing the
“information content” of features in complex classification
tasks, where methods bases on linear relations (like the correlation)
are prone to mistakes. The fact that the mutual information is
independent of the coordinates chosen permits a robust estimation.
Nonetheless, the use of the mutual information for tasks characterized
by high input dimensionality requires suitable approximations because of
the prohibitive demands on computation and samples. An algorithm is
proposed that is based on a “greedy” selection of the
features and that takes both the mutual information with respect to the
output class and with respect to the already-selected features into
account. Finally the results of a series of experiments are discussed

This book is based on the author's experience with calculations involving polynomial splines. It presents those parts of the theory which are especially useful in calculations and stresses the representation of splines as linear combinations of B-splines. After two chapters summarizing polynomial approximation, a rigorous discussion of elementary spline theory is given involving linear, cubic and parabolic splines. The computational handling of piecewise polynomial functions (of one variable) of arbitrary order is the subject of chapters VII and VIII, while chapters IX, X, and XI are devoted to B-splines. The distances from splines with fixed and with variable knots is discussed in chapter XII. The remaining five chapters concern specific approximation methods, interpolation, smoothing and least-squares approximation, the solution of an ordinary differential equation by collocation, curve fitting, and surface fitting. The present text version differs from the original in several respects. The book is now typeset (in plain TeX), the Fortran programs now make use of Fortran 77 features. The figures have been redrawn with the aid of Matlab, various errors have been corrected, and many more formal statements have been provided with proofs. Further, all formal statements and equations have been numbered by the same numbering system, to make it easier to find any particular item. A major change has occured in Chapters IX-XI where the B-spline theory is now developed directly from the recurrence relations without recourse to divided differences. This has brought in knot insertion as a powerful tool for providing simple proofs concerning the shape-preserving properties of the B-spline series.

The 1064-nm excited Fourier transform (FT) Raman spectra have been measured in situ for various foods in order to investigate the potential of near-infrared (NIR) FT-Raman spectroscopy in food analysis. It is demonstrated here that NIR FT-Raman spectroscopy is a very powerful technique for (1) detecting selectively the trace components in foodstuffs, (2) estimating the degree of unsaturation of fatty acids included in foods, (3) investigating the structure of food components, and (4) monitoring changes in the quality of foods. Carotenoids included in foods give two intense bands near 1530 and 1160 cm−1 via the pre-resonance Raman effect in the NIR FT-Raman spectra, and therefore, the NIR FT-Raman technique can be employed to detect them nondestructively. Foods consisting largely of lipids such as oils, tallow, and butter show bands near 1658 and 1443 cm−1 due to C=C stretching modes of cis unsaturated fatty acid parts and CH2 scissoring modes of saturated fatty acid parts, respectively. It has been found that there is a linear correlation for various kinds of lipid-containing foods between the iodine value (number) and the intensity ratio of two bands at 1658 and 1443 cm−1 (I 1658/I 1443), indicating that the ratio can be used as a practical indicator for estimating the unsaturation level of a wide range of lipid-containing foods. A comparison of the Raman spectra of raw and boiled egg white shows that the amide I band shifts from 1666 to 1677 cm−1 and the intensity of the amide III band at 1275 cm−1 decreases upon boiling. These observations indicate that most α-helix structure changes into unordered structure in the proteins constituting egg white upon boiling. The NIR FT-Raman spectrum of old-leaf (about one year old) Japanese tea has been compared with that of its new leaf. The intensity ratio of two bands at 1529 and 1446 cm−1 (I 1579/I 1446), assignable to carotenoid and proteins, respectively, is considerably smaller in the former than in the latter, indicating that the ratio is useful for monitoring the changes in the quality of Japanese tea.

The use of NIR diffuse reflectance spectrometry and partial least squares (PLS) multivariate calibration for determining the finishing oil content in acrylic fibres is described. Various PLS calibration models for predicting the finishing oil content of the samples from the NIR spectral data were constructed by considering all sources of sample variability and using the wavelength region where the finishing oil absorbed significantly. The best model was selected in terms of performance (lowest prediction error) and number of PLS factors. A mathematical treatment (standard normal variate) was applied to the NIR spectra to reduce the effect of fibre linear density on the PLS calibration matrix and hence provide a simpler and more robust model. The method was used for quality control analysis at a production line.

The B-spline zero (B0) compression method has been investigated for three different raw spectrometric data sets, namely NIR reflectance. FT-IR reflectance and UV-VIS transmittance data. The data were modelled with partial least squares (PLS) regression and we have investigated the compression ratio and model dimensionality versus the ability to preserve the information of the original data as a function of knot sequence guide type, i.e. mean, standard deviation (SD) and relative standard deviation (RSD) of the variables.
Generally we conclude that the investigated data can be compressed to about 20% of the original number of variables without loss of prediction ability or change in model dimensionality (number of pc's) in comparison with the original data.

Three-mode PCA is very computer demanding. It requires a large amount of storage space and many floating point operations (FLOPS). By using three-mode B-spline compression of three-mode data arrays, the original data array can be replaced by a smaller coefficient array. Three-mode principal component analysis (PCA) is then performed on the much smaller coefficient array instead of on the original array. For the compression approach to be efficient the three-mode data array is assumed to be well approximated by smooth functions. The smoothness affects the dimensions of the coefficient array. It is always possible to approximate the data to any precision but the reward in reduced computation time and storage is lost when the dimensions of the coefficient array approach the dimensions of the original array.

A general framework for manipulating spectra as functions in traditional multivariate methods such as PCA and PLS is described. The functional representation is very convenient for compression, ensuring smoothness and continuity. There are two fundamentally different types of representations: (a) by functions and (b) by function coefficients. The use of coefficients is the most practical way of analysis.

For efficient handling of very large data arrays, pretreatment by compression is mandatory. In the present paper B-spline methods are described as good candidates for such data array compression. The mathematical relation between the maximum entropy method for compression of data tables and the B-spline of zeroth degree is described together with the generalization of B-spline compression to nth-order data array tables in matrix and tensor algebra.

In order to improve the storage and CPU time in the numerical analysis of large two-dimensional (hyphenated, second-order) infrared spectra, a data-preprocessing technique (compression) is presented which is based on B-splines. B-splines have been chosen as the compression method since they are wellsuited to model smooth curves. There are two primary goals of compression: a reduction of file size and a reduction of computation when analyzing the compressed representation. The compressed representation of the spectra is used as a substitute for the original representation. For the particular example used here, approximately 0.16 bit per data element was required for the compressed representation in contrast with 16 bits per data element in the uncompressed representation. The compressed representation was further analysed using principal component analysis and compared with a similar analysis on the original data set. The results shows that the principal compotent model of the compressed representation is directly comparable with the principal component model of the original data.

Half-title pageSeries pageTitle pageCopyright pageDedicationPrefaceAcknowledgementsContentsList of figuresHalf-title pageIndex

Scitation is the online home of leading journals and conference proceedings from AIP Publishing and AIP Member Societies

Information theory answers two fundamental questions in communication theory: what is the ultimate data compression (answer: the entropy H), and what is the ultimate transmission rate of communication (answer: the channel capacity C). For this reason some consider information theory to be a subset of communication theory. We will argue that it is much more. Indeed, it has fundamental contributions to make in statistical physics (thermodynamics), computer science (Kolmogorov complexity or algorithmic complexity), statistical inference (Occam's Razor: “The simplest explanation is best”) and to probability and statistics (error rates for optimal hypothesis testing and estimation). The relationship of information theory to other fields is discussed. Information theory intersects physics (statistical mechanics), mathematics (probability theory), electrical engineering (communication theory) and computer science (algorithmic complexity). We describe these areas of intersection in detail.

A method for the analytical control of different pharmaceutical production steps involving various types of sample (blended products, cores and coated tablets) is proposed. The measurements are made by using a near infrared (NIR) diffuse reflectance spectrophotometer furnished with a fibre-optic module that enables expeditious, flexible analyses with no sample manipulation.Calibration for the active principle in the pharmaceutical is done by applying partial least-squares regression to first-derivative spectra over the wavelength range 1100–2200 nm. Various spectral pretreatments for minimizing spectral scattering were tested. All samples studied were analysed by using a single calibration including spectra of laboratory samples in order to expand the concentration range spanned by the production samples, as well as samples obtained in different steps of the production process in order to include its variability sources (variations in origin of raw materials, particle size distribution, compactness, granulation, etc.). Production samples for inclusion in the calibration set were chosen by principal component analysis. The partial least squares model based on first-derivative spectra provided a relative standard error of prediction lower than 1% for production samples.

Spectrophotometric data often comprise a great number of numerical components or variables that can be used in calibration models. When a large number of such variables are incorporated into a particular model, many difficulties arise, and it is often necessary to reduce the number of spectral variables. This paper proposes an incremental (Forward–Backward) procedure, initiated using an entropy-based criterion (mutual information), to choose the first variable. The advantages of the method are discussed; results in quantitative chemical analysis by spectrophotometry show the improvements obtained with respect to traditional and nonlinear calibration models.

Functional data analysis (FDA) is an extension of traditional data analysis to functional data, for example spectra, temporal series, spatio-temporal images, gesture recognition data, etc. Functional data are rarely known in practice; usually a regular or irregular sampling is known. For this reason, some processing is needed in order to benefit from the smooth character of functional data in the analysis methods. This paper shows how to extend the radial-basis function networks (RBFN) and multi-layer perceptron (MLP) models to functional data inputs, in particular when the latter are known through lists of input–output pairs. Various possibilities for functional processing are discussed, including the projection on smooth bases, functional principal component analysis, functional centering and reduction, and the use of differential operators. It is shown how to incorporate these functional processing into the RBFN and MLP models. The functional approach is illustrated on a benchmark of spectrometric data analysis.

Most statistical analyses involve one or more observations taken on each of a number of individuals in a sample, with the aim of making inferences about the general population from which the sample is drawn. In an increasing number of fields, these observations are curves or images. Curves and images are examples of functions, since an observed intensity is available at each point on a line segment, a portion of a plane, or a volume. For this reason, we call observed curves and images ‘functional data,’ and statistical methods for analyzing such data are described by the term ‘functional data analysis.’ It is the smoothness of the processes generating functional data that differentiates this type of data from more classical multivariate observations. This smoothness means that we can work with the information in the derivatives of functions or images. This article includes several illustrative examples.

Several recent machine learning publications demonstrate the utility of using feature selection algorithms in supervised learning tasks. Among these, sequential feature selection algorithms are receiving attention. The most frequently studied variants of these algorithms are forward and backward sequential selection. Many studies on supervised learning with sequential feature selection report applications of these algorithms, but do not consider variants of them that might be more appropriate for some performance tasks. This paper reports positive empirical results on such variants, and argues for their serious consideration in similar learning tasks. 19.1 Motivation Feature selection algorithms attempt to reduce the number of dimensions considered in a task so as to improve performance on some dependent measures. In this paper, we restrict our attention to supervised learning tasks, where our dependent variables are classification accuracy, size of feature subset, and computational ...

Data from spectrophotometers form vectors of a large number of exploitable variables. Building quantitative models using these variables most often requires using a smaller set of variables than the initial one. Indeed, a too large number of input variables to a model results in a too large number of parameters, leading to overfitting and poor generalization abilities. In this paper, we suggest the use of the mutual information measure to select variables from the initial set. The mutual information measures the information content in input variables with respect to the model output, without making any assumption on the model that will be used; it is thus suitable for nonlinear modelling. In addition, it leads to the selection of variables among the initial set, and not to linear or nonlinear combinations of them. Without decreasing the model performances compared to other variable projection methods, it allows therefore a greater interpretability of the results.

Radial basis functions for multivariable interpolation: a review

- Powell

Powell, M., 1987. Radial basis functions for multivariable interpolation: a review. In: J.C. Mason, M. C. E. (Ed.), Algorithms for Approximation. Clarendon Press, New York, NY, USA, pp. 143-167.

A comparative evaluation of sequential feature selection algorithms

- Aha

Aha, D. W., Bankert, R. L., 1996. A comparative evaluation of sequential
feature selection algorithms. In: Fisher, D., Lenz, H.-J. (Eds.), Learning
from Data : AI and Statistics V. Springer-Verlag, Ch. 4, pp. 199-206.

Compression of first-order spectral data using the B-spline zero compression method

- Olsson