ArticlePDF Available

Deep chemometrics: Validation and transfer of a global deep near-infrared fruit model to use it on a new portable instrument

Authors:

Abstract

Recently, a large near-infrared spectroscopy data set for mango fruit quality assessment was made available online. Based on that data, a deep learning (DL) model outperformed all major chemometrics and machine learning approaches. However, in earlier studies, the model validation was limited to the test set from the same data set which was measured with the same instrument on samples from a similar origin. From a DL perspective, once a model is trained it is expected to generalise well when applied to a new batch of data. Hence, this study aims to validate the generalisability performance of the earlier developed DL model related to DM prediction in mango on a different test set measured in a local laboratory setting, with a different instrument. At first, the performance of the old DL model was presented. Later, a new DL model was crafted to cover the seasonal variability related to fruit harvest season. Finally, a DL model transfer method was performed to use the model on a new instrument. The direct application of the old DL model led to a higher error compared to the PLS model. However, the performance of the DL model was improved drastically when it was tuned to cover the seasonal variability. The updated DL model performed the best compared to the implementation of a new PLS model or updating the existing PLS model. A final root-mean-square error prediction (RMSEP) of 0.518% was reached. This result supports that, in the availability of large data sets, DL modelling can outperform chemometrics approaches. K E Y W O R D S artificial intelligence, calibration transfer, deep learning, fruit-chemistry, spectroscopy
RESEARCH ARTICLE
Deep chemometrics: Validation and transfer of a global
deep near-infrared fruit model to use it on a new portable
instrument
Puneet Mishra
1
|D
ario Passos
2
1
Wageningen Food and Biobased
Research, Wageningen University and
Research, Wageningen, The Netherlands
2
CEOT | Physics Department,
Universidade do Algarve, Faro, Portugal
Correspondence
Puneet Mishra, Wageningen Food and
Biobased Research, Wageningen
University and Research, Bornse
Weilanden 9, P.O. Box 17, 6700AA,
Wageningen, The Netherlands.
Email: puneet.mishra@wur.nl
Abstract
Recently, a large near-infrared spectroscopy data set for mango fruit quality
assessment was made available online. Based on that data, a deep learning
(DL) model outperformed all major chemometrics and machine learning
approaches. However, in earlier studies, the model validation was limited to
the test set from the same data set which was measured with the same instru-
ment on samples from a similar origin. From a DL perspective, once a model
is trained it is expected to generalise well when applied to a new batch of data.
Hence, this study aims to validate the generalisability performance of the
earlier developed DL model related to DM prediction in mango on a different
test set measured in a local laboratory setting, with a different instrument. At
first, the performance of the old DL model was presented. Later, a new DL
model was crafted to cover the seasonal variability related to fruit harvest
season. Finally, a DL model transfer method was performed to use the model
on a new instrument. The direct application of the old DL model led to a
higher error compared to the PLS model. However, the performance of the DL
model was improved drastically when it was tuned to cover the seasonal
variability. The updated DL model performed the best compared to the
implementation of a new PLS model or updating the existing PLS model. A
final root-mean-square error prediction (RMSEP) of 0.518% was reached. This
result supports that, in the availability of large data sets, DL modelling can
outperform chemometrics approaches.
KEYWORDS
artificial intelligence, calibration transfer, deep learning, fruit-chemistry, spectroscopy
1|INTRODUCTION
Rapid prediction of fruit quality is of wide interest to access the ripeness and readiness to eatstage of fresh fruit.
14
Such a rapid prediction of fruit quality relies on decision making at several stages of the fresh fruit supply chain, such
as the best date for harvest and monitoring of fruit during controlled ripening or storage.
5,6
Furthermore,
Received: 11 April 2021 Revised: 28 June 2021 Accepted: 6 July 2021
DOI: 10.1002/cem.3367
This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any
medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.
© 2021 The Authors. Journal of Chemometrics published by John Wiley & Sons Ltd.
Journal of Chemometrics. 2021;e3367. wileyonlinelibrary.com/journal/cem 1of12
https://doi.org/10.1002/cem.3367
non-destructive prediction of fruit quality reduces the experimental time and costs as well as it minimises the
food losses along the supply chain.
5,6
A popular non-destructive technique for rapid analysis of fresh fruit traits is
VisNIR portable spectroscopy.
68
Vis-NIR spectroscopy explores the interaction of Vis-NIR electromagnetic radiation
and their correlation with several chemical and physical properties of fruit.
7,8
Particularly, the visible (Vis) part of
the spectrum explains mainly the fruit skin pigments which in some cases are correlated with of fruit ripeness stage,
while the near-infrared (NIR) part captures the signal related to the chemical components
9,10
on the pulp such as
moisture and sugars which are a good proxy for fruit maturity.
6,11,12
In some works, Vis-NIR spectroscopy has also been
used to explain physical properties such as firmness which is a good indicator of ripeness levels of fruit such as
avocados.
13
Motivated by the reliable performance of portable Vis-NIR spectroscopy to estimate several physicochemical traits
in fruit, several user-friendly low-cost commercial portable spectrometers have appeared in the market.
5,6
The portabil-
ity of spectrometers ranges from mobile phone connected pocket spectrometers
14
to handheld spectrometers with
embedded computing systems.
5,11,12
Apart from the market level readiness of the hardware, the main challenge related
to the distributed calibration of the spectrometer persists.
15,16
This is because Vis-NIR spectrometers are highly
dependent on the pre-calibration of the sensors before their deployment in practice.
7
At present, in most of the practical
cases, the users must carry out their calibrations based on the application of interest by measuring a few sets of new
samples.
57
These calibrations also suffer failure when met with samples having unmodeled variability.
1719
For fresh
fruit, the model failure is commonly due to a high biological variability.
20,21
Biological variability can be related to
different cultivars, the season of harvest, storage conditions and ripening stages of fruit.
5,6,20,21
Hence, a natural solution
to deal with this calibration failure problem is to measure a wide range of samples from different cultivars, seasons of
harvest, storage conditions and ripening stages to calibrate global models.
5,6
These global models are expected to work
well in the face of the unseen variability that is expected to be met when predicting a new sample.
6
Measurement of
samples covering a wide variability is the first step to achieve global models; however, the second step is to properly
model the data to develop robust models that not only learn the relationship between spectra and the property of
interest but also learn the complex pattern such as the causes for seasonal, cultivar and ripeness related variabilities.
22,23
Several authors have widely tried linear, non-linear, global and local modelling approaches to achieve robust models
and have achieved satisfactory performance.
6,22,23
A key feature of portable spectroscopy is its wide-scale usage compared to the scientific laboratory-based spectrome-
ters.
14,15,24
Such a wide-scale usage has an advantage as it can generate enormous information-rich data sets. Such large
data sets covering a wide variety of samples can allow development of robust models, otherwise limited to the data
generated by a single user. In fruit spectroscopy, such data sets are gaining popularity. For example, recently a mango
data set having 11,691 spectra and reference dry matter (DM) measurements was made publicly available.
25
The data
set was explored in several studies with traditional chemometric and machine learning approaches and the
best root-mean-square error prediction (RMSEP) was noted as 0.84%.
22,23
Taking advantage of the size of such a
large data set, recently a deep learning model was proposed.
26
That DL model outperformed all the earlier linear,
non-linear, local, and global modelling approaches and reached a RMSEP of 0.79% on the same test set reported in
earlier studies. The performance of the DL model was improved by removing remaining outliers and a RMSEP of 0.76%
was reached.
A DL model trained on a large data set as well with the best predictive performance is expected to be used in
practice. For example, users who bought a new portable spectrometer and interested in predicting DM in mango will be
highly interested to use the existing DL model covering wide variability of samples, instead of recalibrating the sensors
from scratch. In earlier works related to DL modelling of the mango data set,
26
the main aim was to optimise and reach
a robust DL model which improves on the RMSEP previously reported on the similar data set. In that regard, to have a
fair comparison, the DL model in the earlier study
26
was tested on the predefined test set available in the mango data
set, that is, data corresponding to the harvest season from 2018.
22,23
However, in a practical scenario, a new user will be
interested in using the DL model in a different instrument compared to the instrument used for measurement of the
mango data set. Hence, currently, it is unclear if the DL model developed in Mishra and Passos
26
can be directly used
on a new-instrument or if it requires some form of model adaptation or recalibration before it can be used with the new
instrument. Furthermore, since the DL model was trained on the samples from the harvest of the year 2015, 2016 and
2017, it is important to explore if the model performs well when applied to samples from a new harvest season
measured on a different spectrometer. To the best of our literature search and experience with the development of the
primary DL model,
26
this work is the first to validate the mango DL model
26
for its use on a new instrument and on
mango samples of harvest from a new season, that is, the year 2020.
2of12 MISHRA AND PASSOS
The study has three main aims:
1. To provide independent validation of earlier DL model
26
related to mango DM prediction on a new test set measured
in a local laboratory setting with a new spectral instrument.
2. To retrain/optimise the model concerning the season variability to make it robust to seasonal differences.
3. Since the measurements in this study were performed with a different spectrometer, the DL model was updated with
transfer learning (TL)
27
to compensate for the instrument differences. The TL approach can be assumed as the
calibration transfer approach for DL models just as calibration transfer between spectrometer is required for
standard chemometric calibrations.
28
2|MATERIALS AND METHODS
2.1 |Data sets
Two data sets were used for modelling and independent testing. The modelling data set was the freely available mango
data set, and the independent testing data set was generated experimentally for this study. More details on data sets are
as follow.
2.1.1 | Open-access mango data set
The first data set has spectral and reference DM measurements on mango from 4 harvest seasons 2015, 2016, 2017 and
2018.
25
Data were acquired with a Portable F750 instrument (Felix Instruments, Camas, USA). Drying for DM measure-
ments was performed with an oven dryer. Detailed information on data set can be assessed at previous works,
22,23
and
the data can be assessed at Anderson et al.
25
At the online data source,
25
there are total 11,691 spectra, which are pre-
partitioned as calibration (10,243 from year 2015, 2016 and 2017) and independent test set (1448 from year 2018). The
data spectral range was reduced to 684990 nm (103 variables) to keep the analysis comparable to earlier work.
26
The original data has several outliers in the calibration set, hence, in this study, a more stringent outlier removed
version of data was used as suggested in Mishra and Passos.
26
Furthermore, based on the suggestion of this recent
study, prior to any analysis the data were augmented with spectral pre-processing.
26
The final data set has 9914 samples
for model training and validation and 1448 samples for an independent test of the model. However, after the data
augmentation suggested in Mishra and Passos,
26
the total number of variables were 618 by the stacking of five
pre-processed data blocks to the original absorbance data block. The five different pre-processing's were 1st derivative
and 2nd derivative with SavitzkyGolay
29
filter with window size 13 and polynomial order of 2, standard normal variate
(SNV),
30
SNV +1st derivative and SNV +2nd derivative.
2.1.2 | New mango data set
To perform an independent validation of the DL model proposed in Mishra and Passos,
26
and to adapt it for future use
on a new instrument, new mango samples were measured with a new portable spectrometer. The mango samples were
from a new harvest, that is, the year 2020, and have the origin in Brazil. Furthermore, the mangoes were of two
cultivars, that is, Keittand Kent. A total of 540 (270 Kiettand 270 Kent) mangoes were bought from a local
vendor in The Netherlands. The spectrometer was a F750 spectrometer (Felix Instruments, Camas, USA). The
spectrometer used was a similar but not the same instrument as used for the generation of open mango data set. Since
most fruit were hard green at the time of purchase, to create a wide variation in the ripeness level, the mangoes were
stored in different temperatures, that is, 12C, 17C and 23C at 85% RH (Relative Humidity). The low-temperature
storage hinders the fruit ripening while hot temperature storage accelerates the ripening. 90 Kentand 90 Keitt
mangoes were stored at each temperature level. Mangoes were stored for 1 week in the fruit ripening facilities and
before the experiment day was stored for 1 day in the same temperature condition (23C at 85% RH) to reduce any
effect of temperature on the spectral measurements. On the day, the experiment took place, the spectral measurements
were performed first and after that, a part of mango fruit was collected from the spot of spectral measurement. The peel
MISHRA AND PASSOS 3of12
of the part was removed and later the fresh weight of the fruit's flesh was measured using an electronic balance and
dried in a hot-air oven. After drying, the dried fruit was measured, and the DM was estimated and expressed in %. In
total, 540 spectral and DM reference measurements were generated. The spectra were at first trimmed to match the
spectral range (684990 nm total 103 variables) as used for the development of DL model. Later, outlier removal was
performed using the T
2
and Q statistics plots obtained with the PLS decomposition of spectra with the DM measure-
ments. The outlier's removal was performed interactively by plotting T
2
and Q statistics plots and later dropping any
observed outlying samples. Later, the spectral data were augmented as suggested in Mishra and Passos,
26
that is, by
pre-processing the same data with five different pre-processing approaches. After outlier removal and data augmenta-
tion, the total remaining samples were 510, each having 618 variables. All data analysis was performed with the
augmented data as needed for the application of the old DL model
26
as well due to a improved performance of the
model on the augmented data compared to raw data.
26
2.2 |Data modelling
In this study, there were three main directions for data modelling. The first was the direct application of the DL and
PLS
26
models developed on the open-access mango data set
25
to the new mango data set measured in this study. The first
part aimed to study if the previously developed DL and PLS models
26
generalise well to data from a new harvest measured
with a different instrument. The second was the development of new DL and PLS models by combining all the data of the
open-access mango data and testing it on the new mango data set. A key point to note is that in the earlier study, the DL
model was based on data from seasons 2015, 2016 and 2017 and the data from season 2018 was used as independent test
set. In this study, the second part aimed to exploit the full potential of the open-access mango data set, hence, modelling
was performed by combining data from all four seasons, that is, 2015, 2016, 2017 and 2018. The third part was related to
the adaption and update of the DL model to compensate for the differences in the instrument. The third part was
necessary as the primary DL model was made on the open-access mango data where the spectral measurements were
performed with a similar but not the same spectrometer. Hence, it was a classic case of calibration transfer but without
standards as the two spectrometers belongs to different scientific research groups based on different continents. All models
were implemented using the Python (3.6) language and the Tensorflow 2.1 framework.
2.2.1 | Direct application of old models on a new independent test set
In Mishra and Passos,
26
DL and PLS models based on open-access mango data set were proposed. The models were
based on the calibration set of the open-access mango data set. Previously, the models were tested on the test set of the
open-access mango data only which corresponds to the mango of the year 2018 harvest. However, in this study to study
their generalisability, the DL and PLS models were tested on the new independent test set measured. The DL model
was applied by defining the model architecture (1D CNN) and loading the pre-saved weights corresponding to the
optimal DL model presented in Mishra and Passos.
26
The DL model was applied using the model.predict()function
from TensorFlow/Keras. The application of PLS included the multiplication of the regression coefficients with the
spectra from the new independent test set. The PLS was applied using the pls.predict()function from the Scikit Learn
library (https://scikit-learn.org/stable/). Prior to model application, data from the independent test set was scaled with
respect to the calibration set of the open-access mango data set. The scaling was performed by estimating the SNV
30
for
each variable of the independent test set with respect to the calibration set used for DL model training. In SNV,
30
at
first, the means response is subtracted and later divided by the standard deviation. To scale the test data with respect to
the calibration data set, the mean and standard deviation of calibration set were used.
2.2.2 | New deep learning model development
In the DL model presented in Mishra and Passos,
26
the modelling was limited to the calibration set as the data from the
year 2018 harvest was used as the test set. However, since this study has a new independent test set, a new DL model can
be made to learn variability related to seasonal differences. This can be achieved by using a calibration set of the
open-access mango data for model training and the data from season 2018 as the validation/tuning set. Modelling in
4of12 MISHRA AND PASSOS
such a way should improve the learning process associated to the variability of seasonal differences. Finally, the new DL
model can be tested on the new independent test set measured in this study for a new season and with a new instrument.
To develop the new DL model, the same DL architecture as used in Mishra and Passos
26
and first proposed in Cui
and Fearn,
31
was implemented. The architecture is composed of a five-layer convolutional neural network. The first
layer of the network was a reshape layer which transform the spectra into a tensor, after that the tensor passes through
a 1D-convolutional layer having a single filter, and later the convoluted signal is passed through three dense layers with
36, 18 and 12 neurons, respectively, and a final linear output layer with a neuron. Three key hyperparameters, that is,
the width of the convolution filter (kernel width), the strength of the L2 regularisation (β) and the batch size, were
optimised using a grid search approach. The grid search approach was used as presented in,
26
as it allows to sequen-
tially optimise the hyperparameter and allows understanding the effect of different hyperparameters explored for model
tuning. See Table 1.
Based on the possible combinations, a total 210 models were generated. To find the optimal model, an exploratory
method used in the DL area was used. The hyperparameter that has a higher influence in the solution space is β(see
supporting information for details). Therefore, a best βwas chosen by finding the value that shows the smallest over-
fitting (the lower difference between validation and calibration RMSE) while keeping a low validation RMSE. Models
prone to less overfitting tend to generalise better. Once βwas found, the combinations of kernel and batch sizes were
explored by analysing the calibration and validation RMSE maps. The strategy was to look for common minima
(or common broader basins of attraction) in both maps to find which models belong in these intersection areas and dis-
criminate between them by choosing the one(s) least affected by overfitting (the lowest difference between the RMSE of
calibration and validation set). The final model was then tested on the new independent test set. As a baseline, a new
PLS model was also developed by combining the calibration and validation sets of the open-access mango data. For PLS
latent variables (LVs) optimisation a 10-fold cross-validation procedure was used The elbow point of the error plot was
used for selecting the LVs.
2.2.3 | Standard free deep calibration transfer
The main difference between the open-access mango data set and the new test set measured in this study was the
different instruments used for spectral measurements. Usually, to use a spectral model with data from a different instru-
ment a calibration transfer step is needed before model application. In the case of DL modelling, calibration transfer
can be performed by fine-tuning the weights of the old DL model using some new samples measured on the new
instrument.
26
This recent article explains more broadly the TL approach as used in this study for calibration transfer.
27
The calibration transfer with TL is a standard free procedure to transfer DL models between instruments; hence, it does
not require any measurements from the primary instrument and is well suited for this study as the primary instrument
used to generate the open-access mango data was not available. To perform TL, the data generated in this study were
divided into 60%/40% as fine-tune and external test set. The fine-tune set (60%) was used for TL, and later, the updated
model was tested on the external test set (40%). As a baseline, a new PLS model was recalibrated combining the open-
access mango data set and the fine-tune set. Furthermore, a PLS model solely based on the fine-tune data was also set
up to see if updating the old model provided any benefit compared to developing a new PLS model with new data
measured on the new instrument. All model performances were assessed using the RMSE. The RMSEP was used to
access performance of all models.
3|RESULTS AND DISCUSSION
A summary of DM distributions from different data sets is presented in Figure 1. The calibration set (Figure 1A) of the
open-access mango data set used for the development of the primary DL model presented in Mishra and Passos.
26
TABLE 1 Hyperparameters intervals used in the grid search optimisation
Hyperparameter Interval
Batch size 32, 64, 128, 256, 512
Filter sizes 5, 10, 15, 20, 25, 30
L2 regularisation (β) 0.001, 0.003, 0.008, 0.01, 0.015, 0.02, 0.03
MISHRA AND PASSOS 5of12
Figure 1B shows the DM distribution of the test set of the open-access mango data set as used for model validation in
Mishra and Passos.
26
Figure 1C shows the DM distribution of the independent test set measured in this study. The DM
distribution of the independent test set has most of the samples with low DM (Figure 1C) compared to the DM range of
the open-access mango data (Figure 1A,B). In earlier work, the DL model was tested on the test set of the open-access
mango data set, which included mangoes with on average high DM (Figure 1B) compared to the DM distribution of
calibration set (Figure 1A). Due to such a high DM, the DL model led to systematic higher RMSE on the test set of the
open-access mango data set. In this study, the new measured samples have on average low DM (Figure 1C) compared
to the DM distribution of calibration set (Figure 1A), how were in the similar DM range compared to the
calibration set.
3.1 |Performance of old deep learning and PLS models
The performances of the direct application of the DL and PLS models presented in Mishra and Passos
26
are shown in
Figure 2. At first, the performance of DL (Figure 2A) and PLS (Figure 2B) models is presented for the test set of the
open-access mango data set. These results are the same as reported in Mishra and Passos,
26
and here presented to have
a baseline comparison of the model with the testing performed on the new independent test set measured in this study.
The DL model performance decreased on the independent test set compared to the earlier results reported for the test
set of the open-access mango data set. The RMSE was increased from 0.787% to 1.053%. The PLS model, which previ-
ously performed worse, showed a lower RMSE of 0.754% on the independent test set. This showed that although the
PLS model did not perform the best in the earlier study, it showed a better generalisation ability on a completely new
test set measured on a completely different instrument. Furthermore, such an inferior performance of DL model shows
that the DL model developed in Mishra and Passos
26
was sensitive to the instrument change compared to the PLS
model presented in Mishra and Passos.
26
In this study, we considered two hypothesis for the inferior performance of
the DL model on the new test set measured on a new instrument. The first hypothesis was that the new measurements
were performed on different instrument compared to the measurements on which the primary DL model was
developed. The instrument change plays a significant role in NIR spectroscopy and requires a pre-adjustment of the
instrumental differences. The second hypothesis was that in earlier study,
26
a random data partition was used for DL
model training. Random data partition can be inefficient in learning the seasonal variability and a better choice could
be to retrain the model by using data from a complete year as the validation set. Both the limitations were dealt with in
this study and the results were as follows.
3.2 |New deep learning model
Based on the open-access mango data set, a new DL model was trained with the calibration set and the test set for
model tuning. Please note that the test set mentioned here was the data from the open-access mango data set and it was
not the independent test set data measured in this study as used for the final evaluation of models. A key difference in
the new model compared to the old model presented in Mishra and Passos,
26
was the use of the old test set for model
FIGURE 1 Distribution of dry matter in difference data sets. Open-access mango data set: (A) calibration set and (B) test set. (C) Mango
data set measured in this study
6of12 MISHRA AND PASSOS
tuning. Since, the old test set of the open-access mango data set was from a new season of harvest, that is, the year
2018, using it for model tuning can allow the model to better capture the seasonal variabilities. It was hypothesised that
the model tuned to learn seasonal variability could generalise well to a new season data, that is, the year 2020 harvest
data measured in this study.
To reach the new DL model three different hyperparameters were optimised sequentially as explained in
Section 2.2.2. Figure 3 shows the evolution of the mean RMSE of tuning and the difference between the means of RMSE
(calibration and the tuning set) for different β.Aβ=0.02 (dashed vertical line in Figure 3) was selected as optimal
because it provides a good compromise between low overfitting (the difference between the mean RMSE of calibration
and tuning set) and as well a low tuning set RMSE compared to other values.The premise for this choice was that less
overfitting leads to better generalisation capacity of the model.
Once the optimal β=0.02 was found, the best kernel and batch sizes were explored. Figure 4 shows the RMSE
maps with respect to kernel and batch sizes for the calibration (Figure 4A) and tuning (Figure 4B) sets and for
the difference between both (Figure 4C). The black contour lines on the two first maps show the contour level
corresponding to 1/3 of the scale, and it was used to empirically define basins of attraction. Since there was no
clear one-to-one match between minima in the calibration and tuning maps, we used these criteria to broadly
define the basins of attraction or areas of the hyperparameter space where the model performance tends to be
more consistent. By overlapping/intersecting these maps (see Supplementary Materials for details) we found
8 points/models that were common to both areas (white dots in the figure) and we pick the 3 (in red) with the
lowest tuning-cal RMSE. For the chosen models (2, 3 and 6), their hyperparameters and performance on the
original and new test sets were summarised in Table 2.
FIGURE 2 Model performances, (A) DL model developed in Mishra and Passos
26
and tested on the test of the open mango data set.
(B) PLS model developed in Mishra and Passos
26
and tested on the test set of the open mango data set. (C) DL model developed for the open-
access mango data
26
and tested on the new test set related to the mango harvest of year 2020. (D) PLS model developed for the open-access
mango data
26
and tested on the new test set related to the mango harvest of year 2020
MISHRA AND PASSOS 7of12
For Model 6, the performance was drastically improved by remodelling using data from the year 2018 as the tuning
data set. A reason could be that this model was able to reach a model by learning information related to season
variability. However, it was unclear what exact information led to such an improvement in the model performance
because NIR technology is a non-specific technique where most of the information is captured as highly overlapping
spectral responses of underlying physicochemical processes. The RMSE was decreased from 1.053% to 0.619% and was
also better than the recalibration of the PLS model, that is, RMSE =0.68% (Figure 5A). The difference in the RMSEP's
of DL and PLS model may not be of high significance considering the high-uncertainty associated with the NIR
measurements, however, in a practical use the user usually prefers models with lower errors. A key point to note was
that until this stage the reported model accuracies were without any calibration transfer/model adaptation and were
FIGURE 3 Criteria for selecting the strength of the weights regularisation. In black, tuning-cal root-mean-square error (RMSE) and in
red tuning RMSE as a function of regularisation parameter β. The vertical blue dashed line represents the points where the best compromise
between the cal-tuning RMSE and tuning RMSE was found
FIGURE 4 The grid search root-mean-square error (RMSE) plots for exploring the effect of batch and filter size. (A) Calibration set,
(B) tuning set and (C) RMSE calibration minus RMSE tuning set to judge overfitting. The black contours in (a) and (B) indicate 1/3 of the
scale and are used to define basins of attraction in hyperparameter space. From the identified models in these areas, the three models with
lowest tuning-cal (marked as red dots) were identified
TABLE 2 The three models found during the hyperparameters optimisation and their performance on the independent test set
Model number βBatch size Kernel size Test RMSE (%)
2 0.02 64 25 1.053
3 0.02 32 15 0.814
6 0.02 256 15 0.619
8of12 MISHRA AND PASSOS
solely based on the open-access mango data measured on the different instrument compared to the one used in this
study. In the next section, the effect of a deep calibration transfer on the model performance was presented. From this
point onward, due to the best performance, Model 6 was used in the following part of study. The hyperparameters of
Model 6 were kernel size =15, batch size =256 and β=0.02.
3.3 |Deep calibration transfer and PLS recalibration
The open-access mango data set was measured on a different instrument compared to the instrument used in this study.
Hence, according to the necessity of chemometrics, a calibration transfer was needed to remove the instrumental
differences before model application. Since the primary instrument was not available, traditional standard-based cali-
bration transfer chemometric approaches cannot be used. In this scenario, a possibility could be to develop a new
calibration with data from the new instrument. The performance of a new PLS calibration (5 LVs) developed using
solely the data of the new instrument was shown in Figure 6C. Additionally, for comparison purposes, another PLS
(7 LVs) model was developed by combining the open-access mango data with some data from the new instrument. The
performance of the updated PLS was shown in Figure 6D. Finally, the new DL model based on open-access mango data
was also updated using some data measured on the new instrument using the TL approach. The TL used in this study
consisted in initializing the first five model layers with pre-trained values (of Model 6) and training these layers using
60% of the new test set (fine-tune set). The final layer of the model was trained from scratch (i.e., He_Normalweight
initialisation). This procedure allows for a model that has already learned some important patterns in the original data,
to evolve to encompass the variability on the new data set, in this study, the new season and new instrument related
variability. The final model evaluation was performed by using the retrained model on the remaining 40% of the new
test set.
The updating of the model with TL reached the lowest RMSE =0.518% (Table 3). The new PLS model developed
solely on the data from the new instrument showed the second lowest RMSE =0.598%, while the recalibrated PLS
showed the highest RMSE =0.663% (Table 3). Such a superior performance of the DL model shows that updating the
DL model was needed to compensate for the instrument-specific variability and should be performed in practice when
a DL model must be used in a new instrument. Although in this current study, we do not have access to the primary
Felix F750 equipment used for acquisition of the global mango fruit data set and were not able to show a clear differ-
ence between the response of the two Felix 750 instruments. However, the improvement reached after the TL suggests
that the TL had to compensate for the instrumental differences. Such instrument differences in NIR spectrometer exist
due to several reasons such as the sensitivity of the detector, the light source and many different electric and optic
components used in spectrometers.
28
In the earlier part of this study, the performance of TL was showed using a fine-tune set of 60% of data measured on
the new instrument. Those 60% of data were roughly 300 spectral and reference DM measurements. However, in
FIGURE 5 (A) Performance of PLS model recalibrated by combining data from calibration and test set of mango data set and tested on
the new season harvest. Performance of the new DL model developed by treating test set of mango data set as model validation set and
tested on the new season harvest (B) model 6 (see Table 2)
MISHRA AND PASSOS 9of12
FIGURE 6 (A) Evolution of cross-validation MSE for development of new PLS model based on data from new harvest, five latent
variables (LVs) were selected, (B) evolution of cross-validation MSE for development of new PLS model based on combining mango data set
and some data from new harvest, seven LVs were selected, (C) PLS model made from new harvest and tested new harvest data, (D) PLS
model combining open-access mango data with some data from new harvest and tested on test set from new harvest and (E) fine-tuned DL
model on test set from new harvest
TABLE 3 A summary of the performance of DL and PLS models after updating with some new data measured on new instrument
DL model (RMSEP DM %) PLS model (RMSEP DM %)
Before transfer learning 0.619 0.754
After transfer learning 0.518 0.663
FIGURE 7 A summary of posterior analysis on the effect of samples size on the RMSEP for transfer learning modelling
10 of 12 MISHRA AND PASSOS
practice, the user could be interested to measure as minimum samples as possible to make TL a profitable and time-
saving task. Therefore, in that regard, a posterior analysis on the effect of the sample size on the performance of the TL
model on the independent test set measured on the new instrument was performed. In Figure 7, it can be noted that
although in this work nearly 300 spectral measurements were used for the TL modelling. However, the reduction in the
RMSEP of TL model stabilised at nearly 150 samples, suggesting that the TL can be performed with a smaller number
of samples.
4|CONCLUSIONS
This study performed independent validation of a global DL mango NIR model
26
with a local experiment on mangoes
of a new season and measured with a different instrument. The primary results showed that a direct application of the
DL model in the new season and new-instrument data performed poorer compared to its performance reported
previously on the test set of open-access mango data set. On the contrary, the PLS model performed better than the DL
model when used on data measured with a new instrument. Such an inferior performance of DL model shows that the
DL model was much more sensitive to the seasonal/instrument change compared to traditional latent variable based
approaches such as PLS. There were two main causes hypothesised for the inferior performance of the DL model. The
first was that the instrument used in this study was different compared to what used to measure the data on which
the primary DL model was developed. The second reason was that the primary model
26
was not tuned to cover the sea-
sonal variability. In this study, to cover the seasonal variability, a new DL model was developed using the open-access
mango data set but using the data from the year 2018 season as the tuning set. Later, the new DL model was transferred
to the new instrument in a standard free manner using transfer learning. The DL model reached the lowest
RMSEP =0.518%, which was better than making a new PLS model or recalibrating the old PLS model. Although, the
difference in the RMSEP's of DL and PLS model may not be of high scientific significance considering the high-
uncertainty associated with the NIR measurements, however, in a practical use, the user usually prefers models with
lower errors. The DL modelling can become a practical tool to perform NIR modelling of fresh fruit property. For
example, the DL model developed in this study can be used on any similar NIR instrument with a basic step of transfer
learning to adjust the model to the local features of the new instrument. DL modelling can increase the scalability and
wider usage of NIR models, especially it can support portable spectroscopy where recalibrating each instrument from
scratch can be impractical. Future studies will include validating and updating the model in future harvest seasons.
PEER REVIEW
The peer review history for this article is available at https://publons.com/publon/10.1002/cem.3367.
DATA AVAILABILITY STATEMENT
Made available on genuine request.
ORCID
Puneet Mishra https://orcid.org/0000-0001-8895-798X
D
ario Passos https://orcid.org/0000-0002-5345-5119
REFERENCES
1. Monago-Maraña O, Domínguez-Manzano J, Muñoz de la Peña A, Dur
an-Mer
as I. Second-order calibration in combination with fluores-
cence fibre-optic data modelling as a novel approach for monitoring the maturation stage of plums. Chemom Intel Lab Syst. 2020;199:
103980.
2. Liu C, Yang SX, Li X, Xu L, Deng L. Noise level penalizing robust Gaussian process regression for NIR spectroscopy quantitative
analysis. Chemom Intel Lab Syst. 2020;201:104014.
3. Jia B, Wang W, Ni X, et al. Essential processing methods of hyperspectral images of agricultural and food products. Chemom Intel Lab
Syst. 2020;198:103936.
4. Ye D, Sun L, Tan W, Che W, Yang M. Detecting and classifying minor bruised potato based on hyperspectral imaging. Chemom Intel
Lab Syst. 2018;177:129-139.
5. Walsh KB, McGlone VA, Han DH. The uses of near infra-red spectroscopy in postharvest decision support: a review. Postharvest Biol
Technol. 2020;163:111139.
MISHRA AND PASSOS 11 of 12
6. Walsh KB, Blasco J, Zude-Sasse M, Sun X. Visible-NIR pointspectroscopy in postharvest fruit and vegetable assessment: the science
behind three decades of commercial use. Postharvest Biol Technol. 2020;168:111246.
7. Saeys W, Do Trong NN, Van Beers R, Nicolai BM. Multivariate calibration of spectroscopic sensors for postharvest quality evaluation: a
review, Postharvest Biology and Technology, 158. Vol. 158; 2019:110981.
8. Nicolai BM, Beullens K, Bobelyn E, et al. Nondestructive measurement of fruit and vegetable quality by means of NIR spectroscopy: a
review. Postharvest Biol Technol. 2007;46(2):99-118.
9. Mishra P, Lohumi S, Ahmad Khan H, Nordon A. Close-range hyperspectral imaging of whole plants for digital phenotyping: recent
applications and illumination correction approaches. Comput Electron Agric. 2020;178:105780.
10. Mishra P, Asaari MSM, Herrero-Langreo A, Lohumi S, Diezma B, Scheunders P. Close range hyperspectral imaging of plants: a review.
Biosyst Eng. 2017;164:49-67.
11. Mishra P, Woltering E, Brouwer B, Hogeveen-van Echtelt E. Improving moisture and soluble solids content prediction in pear fruit using
near-infrared spectroscopy with variable selection and model updating approach. Postharvest Biol Technol. 2021;171:111348.
12. Mishra P, Marini F, Brouwer B, et al. Sequential fusion of information from two portable spectrometers for improved prediction of mois-
ture and soluble solids content in pear fruit. Talanta. 2021;223:121733.
13. Mazhar M, Joyce D, Lisle A, Collins R, Hofman P. Comparison of firmness meters for measuring 'Hass' avocado fruit firmness, Interna-
tional Society for Horticultural Science (ISHS). Belgium: Leuven; 2016:163-170.
14. Li M, Qian ZQ, Shi BW, Medlicott J, East A. Evaluating the performance of a consumer scale SCiO (TM) molecular sensor to predict
quality of horticultural products. Postharvest Biol Technol. 2018;145:183-192.
15. Crocombe RA. Portable Spectroscopy. Appl Spectrosc. 2018;72(12):1701-1751.
16. Porep JU, Kammerer DR, Carle R. On-line application of near infrared (NIR) spectroscopy in food production. Trends Food Sci Technol.
2015;46(2):211-230.
17. Mishra P, Roger JM, Rutledge DN, Woltering E. Two standard-free approaches to correct for external influences on near-infrared spectra
to make models widely applicable. Postharvest Biol Technol. 2020;170:111326.
18. Mishra P, Roger JM, Marini F, Biancolillo A, Rutledge DN. FRUITNIR-GUI: a graphical user interface for correcting external influences
in multi-batch near infrared experiments related to fruit quality prediction. Postharvest Biol Technol. 2020;175:111414.
19. Mishra P, Nikzad-Langerodi R. Partial least square regression versus domain invariant partial least square regression with application to
near-infrared spectroscopy of fresh fruit. Infrared Phys Technol. 2020;111:103547.
20. Zheng W, Bai Y, Luo H, Li Y, Yang X, Zhang B. Self-adaptive models for predicting soluble solid content of blueberries with biological
variability by using near-infrared spectroscopy and chemometrics. Postharvest Biol Technol. 2020;169:111286.
21. Bobelyn E, Serban A-S, Nicu M, Lammertyn J, Nicolai BM, Saeys W. Postharvest quality of apple predicted by NIR-spectroscopy: study
of the effect of biological variability on spectra and model performance. Postharvest Biol Technol. 2010;55(3):133-143.
22. Anderson NT, Walsh KB, Flynn JR, Walsh JP. Achieving robustness across season, location and cultivar for a NIRS model for intact
mango fruit dry matter content. II. Local PLS and nonlinear models. Postharvest Biol Technol. 2021;171:111358.
23. Anderson NT, Walsh KB, Subedi PP, Hayes CH. Achieving robustness across season, location and cultivar for a NIRS model for intact
mango fruit dry matter content. Postharvest Biol Technol. 2020;168:111202.
24. Pasquini C. Near infrared spectroscopy: a mature analytical technique with new perspectivesa review. Anal Chim Acta. 2018;1026:
8-36.
25. Anderson N, Walsh K, Subedi P, Mango DMC, Spectra Anderson, et al. Mendley. Mendley Data. 2020;2020.
26. Mishra P, Passos D. A synergistic use of chemometrics and deep learning improved the predictive performance of near-infrared spectros-
copy models for dry matter prediction in mango fruit. Chemom Intel Lab Syst. 2021;212:104287.
27. Mishra P, Passos D. Realizing transfer learning for updating deep learning models of spectral data to be used in a new scenario. In:
Chemometrics and Intelligent Laboratory Systems; 2021:104283.
28. Mishra P, Nikzad-Langerodi R, Marini F, et al. Are standard sample measurements still needed to transfer multivariate calibration
models between near-infrared spectrometers? The answer is not always. TrAC Trends Anal Chem. 2021;143:116331.
29. Savitzky A, Golay MJE. Smoothing and Differentiation of Data by Simplified Least Squares Procedures. Anal Chem. 1964;36(8):
1627-1639.
30. Barnes RJ, Dhanoa MS, Lister SJ. Standard Normal Variate Transformation and De-Trending of Near-Infrared Diffuse Reflectance
Spectra. Appl Spectrosc. 1989;43(5):772-777.
31. Cui C, Fearn T. Modern practical convolutional neural networks for multivariate regression: applications to NIR calibration. Chemom
Intel Lab Syst. 2018;182:9-20.
SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of this article.
How to cite this article: Mishra P, Passos D. Deep chemometrics: Validation and transfer of a global deep near-
infrared fruit model to use it on a new portable instrument. Journal of Chemometrics. 2021;e3367. https://doi.org/
10.1002/cem.3367
12 of 12 MISHRA AND PASSOS
... Development and validation of precalibrated models stored in cloud services should be mentioned, where the robustness (i.e., universality) of the models is a critical factor; however, this development remains mostly the proprietary intellectual property of instrument vendors. On the other hand, recent attention in NIR spectroscopy has been increasingly directed toward the application of deep neural networks (i.e., deep learning or "deep chemometrics") for prediction and classification purposes [74][75][76]. However, to date, only scarce literature has appeared that studied the potential benefits of applying deep-learning methods to reinforce the analytical framework of miniaturized NIR sensors [75]. ...
... On the other hand, recent attention in NIR spectroscopy has been increasingly directed toward the application of deep neural networks (i.e., deep learning or "deep chemometrics") for prediction and classification purposes [74][75][76]. However, to date, only scarce literature has appeared that studied the potential benefits of applying deep-learning methods to reinforce the analytical framework of miniaturized NIR sensors [75]. However, it seems plausible that deep networks could provide benefits in processing challenging data sets; e.g., those of complex samples measured by ultraminiaturized NIR spectrometers. ...
Article
Full-text available
The ongoing miniaturization of spectrometers creates a perfect synergy with the common advantages of near-infrared (NIR) spectroscopy, which together provide particularly significant benefits in the field of food analysis. The combination of portability and direct onsite application with high throughput and a noninvasive way of analysis is a decisive advantage in the food industry, which features a diverse production and supply chain. A miniaturized NIR analytical framework is readily applicable to combat various food safety risks, where compromised quality may result from an accidental or intentional (i.e., food fraud) origin. In this review, the characteristics of miniaturized NIR sensors are discussed in comparison to benchtop laboratory spectrometers regarding their performance, applicability, and optimization of methodology. Miniaturized NIR spectrometers remarkably increase the flexibility of analysis; however, various factors affect the performance of these devices in different analytical scenarios. Currently, it is a focused research direction to perform systematic evaluation studies of the accuracy and reliability of various miniaturized spectrometers that are based on different technologies; e.g., Fourier transform (FT)-NIR, micro-optoelectro-mechanical system (MOEMS)-based Hadamard mask, or linear variable filter (LVF) coupled with an array detector, among others. Progressing technology has been accompanied by innovative data-analysis methods integrated into the package of a micro-NIR analytical framework to improve its accuracy, reliability, and applicability. Advanced calibration methods (e.g., artificial neural networks (ANN) and nonlinear regression) directly improve the performance of miniaturized instruments in challenging analyses, and balance the accuracy of these instruments toward laboratory spectrometers. The quantum-mechanical simulation of NIR spectra reveals the wavenumber regions where the best-correlated spectral information resides and unveils the interactions of the target analyte with the surrounding matrix, ultimately enhancing the information gathered from the NIR spectra. A data-fusion framework offers a combination of spectral information from sensors that operate in different wavelength regions and enables parallelization of spectral pretreatments. This set of methods enables the intelligent design of future NIR analyses using miniaturized instruments, which is critically important for samples with a complex matrix typical of food raw material and shelf products.
... Beyond these two classical algorithms, basically every other machine learning (ML) algorithm can be, and has been, used to model spectral data. A more modern modelling alternative is provided by deep learning (DL) algorithms that, for many tasks, have recently outperformed other machine learning and chemometric approaches in terms of model performances [19][20][21][22][23][24][25][26][27]. Recent works in this field show that DL has the potential to cut the need for pre-processing optimization and variable selection [28,29], both key steps in any Vis-NIR spectral analysis pipeline [30][31][32]. ...
... fully connected, FC) layers. This model is similar to the usual CNNs used in computer vision for image processing, is "1D" because the input is spectra (vectors) and was chosen for its simplicity and its proven predictive power when applied to several spectroscopic and chemometric problems [19][20][21][22][23][26][27][28]. It is important to highlight that the level of complexity of a DL model should be adapted to the task at hand. ...
Article
Full-text available
Deep spectral modelling for regression and classification is gaining popularity in the chemometrics domain. A major topic in the deep learning (DL) modelling of spectral data is the choice and optimization of the deep neural network architecture suitable for the specific task of spectral modelling. Although there are several recent research articles already available in the chemometric domain showing advanced approaches to deep spectral modelling, currently, there is a lack of hands-on tutorial articles in this space that supply the non-expert user with practical tools to learn and implement advanced DL optimization methodologies aimed at spectral data. Hence, this tutorial article aims at reducing the gap between the non-expert user of DL in the chemometric community and the implementation of DL models for daily usage. This tutorial supplies a quick introduction to the state-of-the-art deep spectral modelling and related DL concepts and presents a set of methodologies aimed at DL hyperparameters’ optimization. To this end, this tutorial shows two practical examples on how to implement and optimize two DL models for spectral regression and classification tasks. The models are implemented in python and Tensorflow and the complete code is supplied in the form of two complementary notebooks.
... Please note that the mango data set is based on mango samples measured in Australia. The local mango data set measured with a similar but different instrument was the same as used in Ref. [43] and consisted of spectral and moisture measurements performed on 501 mango fruit of 'Keitt' and 'Kent' cultivar performed at Wageningen University & Research, The Netherlands. The instrument model used for the acquisition of the open-access and local data set was a portable fruit spectrometer (Felix F-750, Camas, WA, USA). ...
... After drying, the dried fruit weight was measured, and the MC was estimated and expressed in %. The spectral range used for modelling was the same (684-990 nm) as recommended in earlier publications [40,42,43] on the open-access mango data set. Please note that the original free access mango data set has more than 10k measurements. ...
Article
Full-text available
Multivariate spectral signals are highly correlated. Often, variable selection techniques are deployed, aiming at model optimization, identification of key variables to explore the underlying physicochemical system or development of a cheap multi-spectral system based on key variables. However, many times the selected variables do not supply a good estimate of properties when tested on a new setting such as new measurements performed on a different spectrometer, different physical or chemical state of the samples and difference in the environmental factors around the experiment. Often the model based on variables selected in the first domain (specific conditions/instrument) does not generalize on the new domain (specific conditions/instrument). To deal with it, in the present work a new method to variable selection called domain invariant covariate selection (di-CovSel) is proposed. The method selects the most informative variables which are invariant to the differences in the instruments, physical or chemical state of the samples and the differences in the environmental factors around the experiment. The method is inspired by domain invariant partial least-square (di-PLS) and the covariate selection (CovSel). The potential of the method is demonstrated on four real cases related to the calibration of near-infrared (NIR) spectroscopy on agri-food materials. The results show that in all the cases, the domain invariant features selected by the di-CovSel have low prediction error compared to the standard variable selection with the CovSel approach when the models are tested on a new data domain. In summary, domain invariant features selected across domains support the development of calibration models with good generalization and supply a better understanding of the system by bypassing the external factors originating from differences in the instruments, physical or chemical states of the samples and the differences in the environmental factors around the experiment. Note that one key feature of the proposed method is that the most important variables which generalize well across domains can be identified without requiring reference measurements in the target domain.
... In some earlier studies deep learning based approaches such as stacked autoencoders were used (Yu et al., 2018b,a), however, those studies mainly used deep learning to perform predictive modelling to replace PLS based analysis. However, in some recent studies, it has been shown that for spectral data modelling PLS approach is more generalised compared to the current state-of-the-art deep learning approaches such as one-dimensional convolutional neural networks (Mishra and Passos, 2021) and stacked autoencoders (Mishra et al., 2021c). This was also the motivation that this study only used deep learning for object detection and not for predictive modelling. ...
Article
Full-text available
A novel case of combining deep learning and chemometrics for spectral image processing is presented. The case involved the application of deep transfer learning for detecting and locating the fruit centroid to extract pixels for spectral model development and application. The selected fruit case involved a non-symmetrical fruit pear where the interesting area for spectral model application is not the centroid of the whole fruit unlike fruit such as apples but the centroid of the belly part of the pear fruit. Hence, the task of object detection is replaced with the task of symmetrical region (fruit belly) detection on the pear fruit such that the spectral model can be applied in the centroid pixels of the symmetrical region. For spectral modelling, the latent variables based regression technique called partial least-square (PLS) regression was used. For spectral modelling, PLS was preferred over deep learning as there was a low number of samples points to train a deep spectral model. The deep transfer learning allowed 100 % correct detection of the pear fruit belly part with the intersection over union score of 0.82. Furthermore, the RMSEP = 0.77 % was attained with the PLS modelling to predict dry matter. The presented approach can support the wide application of spectral imaging for fresh fruit analysis, particularly when imaging is performed simultaneously on multiple objects and the objects are non-symmetrical in shape.
... For four out of five data sets, the data has already been used in earlier publications by different authors around the world. For example, the Mango data set (cultivar Keitt and Kent) was the same as the validation set used in Ref. [47], the Pear data set (cultivar Conference) was the same as used in Ref. [48], the Apple data set (cultivar Cripps Pink, Fuji, Gala, Golden Delicious and Honeycrisp) was sourced from Ref. [49] and the Olive data set was sourced from Ref. [50]. The Avocado (cultivar Hass) data set was a new data set measured in this study and involved measurement of NIR spectra followed by dry matter measurements using the oven drying method of the avocado flesh. ...
Article
Full-text available
Ensemble pre-processing is emerging as a potential tool to avoid the tiring pre-processing selection and optimization task in near-infrared (NIR) spectral modelling. Furthermore, differently pre-processed data may carry complementary information, hence, ensemble pre-processing may represent the best suited modelling option to extract all the useful information from differently pre-processed data. Recently, multi-block techniques such as sequential (SPORT) and parallel (PORTO) orthogonalized partial least squares regression were proposed to extract complementary information present in differently pre-processed data. Although such multi-block techniques allowed efficient modelling of differently pre-processed data blocks, depending on the approach, challenges related to choosing block order, parameter tuning, block scaling and optimization time requirements still must be dealt with. To cope with such issues, the present study proposes the use of a recently developed faster, block order independent and scale independent, multi-block data modelling technique called response-oriented sequential alternation (ROSA) to process the multi-block data generated by differently pre-processing the same NIR data. This new method is called PROSAC, i.e., pre-processing ensembles with ROSA calibration. The potential of the approach is demonstrated on five real NIR spectral datasets. Furthermore, as baselines for comparison, partial least squares regression was done on individually pre-processed data sets, and using two multi-block pre-processing fusion approaches, i.e., SPORT and PORTO. The ensemble pre-processing with ROSA achieved either better performance compared to the baseline methods or achieved comparable performance without the need to worry about the pre-processing order, the scaling of data after pre-processing and optimization time requirements. PROSAC can be considered as a general tool for the ensemble pre-processing for NIR data modelling.
Article
Absorption cross-sections provide a basis for many gas sensing applications. Therefore, any error in molecular cross-sections caused by varying environmental conditions propagates to spectroscopic applications. Original molecular cross-sections in varying environmental conditions can only be simulated for some molecules, whereas for most multi-atom molecules, one must rely on high-precision measurements at certain environmental configurations. In this study, a deep learning system trained with simulated absorption cross-sections for predicting cross-sections at a different pressure configuration is presented. The system’s capability to transfer to measured, multi-atom cross-sections is demonstrated. Thus, it provides an alternative to (pseudo-) line lists whenever the required information for simulation is unavailable. The predictive performance of the system was evaluated on validation data via simulation, and its transfer learning capabilities were demonstrated on actual measurement chlorine nitrate data. From the comparison between the system and line lists, the system shows slightly worse performance than pseudo-line lists but its predictive quality is still deemed acceptable with less than 5% relative integral change with a highly localized error around the peak center. This opens a promising way for further research to use deep learning to simulate the effect of varying environmental conditions on absorption cross-sections.
Article
Full-text available
Deep learning (DL) being popularly used in computer vision applications is still in its early stage in chemometric domain for spectral image processing. Often the challenge is that there are too few samples from analytical laboratory experiments to preform DL. In this study, we present a novel combination of DL and chemometrics to process spectral images even with as few as < 100 spectral images. We divided the image processing part such as object detection and recognition as the DL task and prediction of chemical property as the chemometric task based on latent space modelling. For image processing tasks of object detection and recognition, transfer learning was performed on the pretrained YOLOv4 object detection network weights to adapt the model to work well on spectral images captured in laboratory settings. Once the object is identified with DL, a background query is performed for the pre-built chemometric models to select the model for predicting the properties for specific object. The obtained results showed good potential of using DL and chemometric approaches in conjunction to reap the best of both scientific domains. This approach is of high interest to whoever involved in spectral imaging and dealing with object detection and physicochemical properties prediction of the samples with chemometric approaches.
Article
Full-text available
Calibration transfer (CT) refers to the set of chemometric techniques used to transfer (near-infrared) calibration models between spectrometers. The requirement of traditional CT methods to measure calibration standard samples has been a challenge as such measurements are difficult in real-world applications, e.g. when the instruments are located far apart or chemically stable standard samples are not available. In recent years, major developments have taken place in the domain of CT, hence, this work provides a concise but critical review of all the main recent chemometric techniques available to perform CT. Particularly this work explains some newer concepts for standard-free CT, where the standard samples are not required to attain the CT. We conclude that CT approaches that do not rely on standard sample measurements hold promise to help making calibration models sharable between similar analytical devices and to increase the applicability of CT to real-world problems in the analytical sciences.
Article
Full-text available
This study provides an innovative approach to improve deep learning (DL) models for spectral data processing with the use of chemometrics knowledge. The technique proposes pre-filtering the outliers using the Hotelling’s T2 and Q statistics obtained with partial least-square (PLS) analysis and spectral data augmentation in the variable domain to improve the predictive performance of deep learning models made on spectral data. The data augmentation is carried out by stacking the same data pre-processed with several pre-processing techniques such as standard normal variate, 1st derivatives, 2nd derivatives and their combinations. The performance of the approach is demonstrated on a real near-infrared (NIR) data set related to dry matter (DM) prediction in mango fruit. The data set consisted of a total 11,961 spectra and reference DM measurements. The results showed that removing the outliers and augmenting spectral data improved the predictive performance of DL models. Furthermore, this innovative approach not only improved DL models but attained the lowest root mean squared error of prediction (RMSEP) on the mango data set i.e., 0.79 % compared to the best known RMSEP of 0.84 %. Further, by removing outliers from the test set the RMSEP decreased to 0.75 %. Several chemometrics approaches can complement DL models and should be widely explored in conjunction.
Article
Full-text available
This study presents the concept of transfer learning (TL) to the chemometrics community for updating DL models related to spectral data, particularly when a pre-trained DL model needs to be used in a scenario having unseen variability. This is the typical situation where classical chemometrics models require some form of re-calibration or update. In TL, the network architecture and weights from the pre-trained model are complemented by adding extra fully connected (FC) layers when dealing with the new data. Such extra FC layers are expected to learn the variability of the new scenario and adjust the output of the main architecture. Furthermore, three approaches of TL were compared, first where the weights from the initial model were left untrained and the only the newly added FC layers could be retrained. The second is when the weights from the initial model could be retrained alongside the new FC layers. The third is when the weights from the initial model could be re-trained with no extra FC layers added. The TL was shown using two real cases related to near-infrared spectroscopy i.e., mango fruit analysis and melamine production monitoring. In the case of mango, the model needs to be updated to cover a new seasonal variability for dry matter prediction, while, for melamine case, the model needs to be updated for the change in the recipe of the production material. The results showed that the proposed TL approaches successfully updated the DL models to new scenarios for both the mango and melamine cases presented. The TL performed better when the weights from the old model could be retrained. Furthermore, TL outperformed three recent benchmark approaches to model updating. TL has the potential to make DL models widely usable, sharable, and scalable.
Article
Full-text available
Near infrared (NIR) spectroscopy is widely used for non-destructive prediction of fruit traits. Common traits such as dry matter (DM) and soluble solids contents (SSC) can be predicted with reliable accuracy. However, the main problem with NIR spectroscopy is that a model developed on one batch may not perform very well when tested on other batches. Reasons for that are the physical, chemical and environmental differences between the experiments performed in different batches. To deal with these issues, approaches such as variables selection, dynamic orthogonal projection (DOP) and transfer component analysis (TCA) can be used. However, the techniques are known but it is rarely possible for a new user or non-specialist to implement them in the practical situations. To overcome this limitation, for the first time, a graphical user interface-based toolbox (FRUITNIR-GUI) for basic chemometric data processing (regression and variable selection) is developed and presented. The GUI allows performing model adaption and maintenance in the context of multi-batch NIR spectroscopic experiments related to fruit. Furthermore, a case-study demonstrating its effectiveness in correcting for seasonality when predicting DM in apples is presented. The toolbox provides a push-button approach to build chemometric models of varying complexity for the characterization of fruit quality. Moreover, approaches such as variable selection and batch correction with DOP and TCA can improve the model performances on new batches. FRUITNIR-GUI can be freely downloaded at https://github.com/puneetmishra2/FRUITNIR and run using the password “welovenirs” (without quotation marks).
Article
Full-text available
Calibration models required for near-infrared (NIR) spectroscopy-based analysis of fresh fruit frequently fail to extrapolate adequately to conditions not encountered during initial data acquisition. Such different conditions can be due to physical, chemical or environmental effects and might be encountered for instance when measurements are carried out on a new instrument, at different sensor operating temperatures or if the model is applied to samples harvested under different seasonal conditions. To cope with such changes efficiently, this study investigates the application of domain-invariant partial least square (di-PLS) regression to obtain calibration models that maintain the performance when used on a new condition. In particular, di-PLS allows unsupervised adaptation of a calibration model to a new condition, i.e. without the need to have access to reference measurements (e.g. dry matter contents) for the samples analyzed under the new condition. The potential of di-PLS for compensation of instruments/seasons and sensor temperature changes is demonstrated on four different use cases in the realm of NIR-based fruit quality assessment. The results showed that di-PLS regression outperformed standard PLS regression when tested on data affected by the aforementioned factors. The prediction R2 increased by up to 67 % with a 46 % and 80 % decrease in RMSEP and prediction bias, respectively. The main limitation of di-PLS is that, to operate efficiently, it requires that the distribution of the response variables to be similar in the data from the different conditions.
Article
Full-text available
Near-infrared (NIR) spectroscopy allows rapid estimation of quality traits in fresh fruit. Several portable spectrometers are available in the market as a low-cost solution to perform NIR spectroscopy. However, portable spectrometers, being lower in cost than a benchtop counterpart, do not cover the complete near-infrared (NIR) spectral range. Often portable sensors either use silicon-based visible and NIR detector to cover 400-1000 nm, or InGaAs-based short wave infrared (SWIR) detector covering the 900-1700 nm. However, these two spectral regions carry complementary information, since the 400-1000 nm interval captures the color and 3rd overtones of most functional group vibrations, while the 1st and the 2nd overtones of the same transitions fall in the 1000-1700 nm range. To exploit such complementarity, sequential data fusion strategies were used to fuse the data from two portable spectrometers, i.e., Felix F750 (∼400-1000 nm) and the DLP NIR Scan Nano (∼900-1700 nm). In particular, two different sequential fusion approaches were used, namely sequential orthogonalized partial least squares (SO-PLS) regression and sequential orthogonalized covariate selection (SO-CovSel). SO-PLS improved the prediction of moisture content (MC) and soluble solids content (SSC) in pear fruit, leading to an accuracy which was not obtainable with models built on any of the two spectral data set individually. Instead, SO-CovSel was used to select the key wavelengths from both the spectral ranges mostly correlated to quality parameters of pear fruit. Sequential fusion of the data from the two portable spectrometers led to an improved model prediction (higher R2 and lower RMSEP) of MC and SSC in pear fruit: compared to the models built with the DLP NIR Scan Nano (the worst individual block) where SO-PLS showed an increase in R2p up to 56% and a corresponding 47% decrease in RMSEP. Differences were less pronounced to the use of Felix data alone, but still the R2p was increased by 2.5% and the RMSEP was reduced by 6.5 %. Sequential data fusion is not limited to NIR data but it can be considered as a general tool for integrating information from multiple sensors.
Article
Full-text available
To obtain robust near-infrared (NIR) spectroscopy data calibration models, variable selection and model updating with recalibration approaches were used for predicting quality parameters in pear fruit. For variables selection, interval partial least-squares regression and covariate selection approaches were used and compared. Model updating with recalibration was performed by incorporating a few new samples in the calibration set of existing batch data. The interaction of variable selection and model updating was also explored. The results showed that with variable selection, the model performance when tested on a new independent batch of fruit was greatly improved. Further, the model updating with only a few new samples resulted in a reduction of the bias when tested on the new batch. In the case of MC prediction, the variable selection reduced the bias from 1.31 % to 0.19 % and the RMSEP from 1.44 % to 0.58 %, compared to the standard partial least-squares regression (PLS2R). In the case of SSC prediction, the variable selection reduced the bias from-0.62 % to 0.07 % and the RMSEP from 0.90 % to 0.63 %, compared to the standard PLS2R. With a combination of variable selection and model updating the bias and RMSEP were further reduced. The interval-based method performed better compared to the filter-based method. As few as only 10 samples from the new batch already lead to a significant improvement in model performance. In the case of MC, spectral regions of 749-759 nm and 879-939 nm were identified as the most important region. In the case of the SSC, 709-759 nm and 789-999 nm were found to be important spectral regions. Robust models made on selected variables combined with model updating strategy can support to make NIR spectroscopy a preferred choice for non-destructive assessment of quality features of fresh fruit.
Article
Full-text available
Digital plant phenotyping is emerging as a key research domain at the interface of information technology and plant science. Digital phenotyping aims to deploy high-end non-destructive sensing techniques and information technology infrastructures to automate the extraction of both structural and physiological traits from plants under phenotyping experiments. One of the promising sensor technologies for plant phenotyping is hyperspectral imaging (HSI). The main benefit of utilising HSI compared to other imaging techniques is the possibility to extract simultaneously structural and physiological information on plants. The use of HSI for analysis of parts of plants, e.g. plucked leaves, has already been demonstrated. However, there are several significant challenges associated with the use of HSI for extraction of information from a whole plant, and hence this is an active area of research. These challenges are related to data processing after image acquisition. The hyperspectral data acquired of a plant suffers from variations in illumination owing to light scattering, shadowing of plant parts, multiple scattering and a complex combination of scattering and shadowing. The extent of these effects depends on the type of plants and their complex geometry. A range of approaches has been introduced to deal with these effects, however, no concrete approach is yet ready. In this article, we provide a comprehensive review of recent studies of close-range HSI of whole plants. Several studies have used HSI for plant analysis but were limited to imaging of leaves, which is considerably more straightforward than imaging of the whole plant, and thus do not relate to digital phenotyping. In this article, we discuss and compare the approaches used to deal with the effects of variation in illumination, which are an issue for imaging of whole plants. Furthermore, future possibilities to deal with these effects are also highlighted.
Article
Full-text available
In near-infrared (NIR) spectroscopy of fresh fruit often the external influences due to differences in physical, chemical and environmental conditions lead to model failure. Correction methods are required where standard samples are measured covering all different conditions and then remodeling is performed. However, in the real-world, it is often difficult to measure standard samples. To deal with this, two different approaches to correct for external influences without standard sample measurements i.e., dynamic orthogonalization projection (DOP) and domain adaption (DA), are presented, and for the first time are applied to NIR spectroscopy of fresh fruit. Four different case studies, chosen based on their importance and their frequency of occurrences in the NIR spectroscopy domain, were used for the demonstration. The first case was an adaption to maintain the predictive performance of a model when used on a spectra from a second similar instrument. The second case was the correction of the temperature effects due to sensor heating. The third and fourth cases were about maintaining the model performance for multi-season fruit quality prediction models for mangos and for apples. In all of the cases, the aim was to solve the challenges without resorting to new measurement of standards. The results showed that for all the cases, both DOP and DA improved model performances. Up to 31% increase in R2p, and 98% and 66% reduction in prediction bias and root mean squared error (RMSE) of prediction were noted, respectively. The main benefit of the DOP and DA techniques in NIR spectroscopy is the limited need for standard measurements, providing general-purpose tools to complement the NIR spectroscopy and make the models scalable, transferable, and reusable.
Article
A range of modelling techniques were used in the estimation of dry matter content of intact mango fruit from short wave near infrared spectra, collected using an interactance geometry, with models developed on a data set collected across three seasons (n = 10,243) and tested on that of a fourth season (n = 1,448). Model types included Artificial Neural Network (ANN), Gaussian Process Regression (GPR), Local Optimized by Variance Regression (LOVR), Local Partial Least Squares Regression (LPLS), Local PLS Scores (LPLS-S) and Memory Based Learner (MBL), with manual tuning of parameters undertaken. Additionally, two commercially available cloud-based chemometric packages for automated model development were trialled. All of these models gave a better result than use of a global PLS model. The best result (lowest RMSEP) was achieved with an ensemble of ANN, GPR and LPLS-S, with the best individual model result achieved by LOVR, with RMSEP of 0.839 % and 0.881 %, respectively, compared to the global PLS result of 1.014 %. The best precision was achieved with the LPLS model, with a SEP of 0.846 %, compared to the global PLS result of 1.012 %. LOVR was twice as fast as a generalized latent variable selection method LPLS-S-cv in prediction of independent validation set (at 58.7 × 10 − 3 s compared to 163 x 10-3 s). The ANN model was satisfactory in all categories (prediction speed, model build speed, and prediction statistics) and insensitive to tuning, e.g., 33 of the 70 parameter combinations were within 0.05 units of RMSEP of the minimum combination. However, the ANN learning rate was low. For applications that require 'real-time' prediction, such as fruit packlines, use of ANN and GPR models is recommended. For non-cloud based handheld NIR devices lacking the computational power to perform local modelling, ANN is recommended , and LOVR or a model ensemble recommended in cloud based implementation. The automated cloud-based systems performed well (RMSEP of 0.850 % and 0.963 % for Hone Create and DataRobot, ensemble models, respectively), without human intervention for the choosing and tuning of models.