Authentication and quality control of complex samples, such are natural products, presents significant challenges in analytical chemistry. Traditionally, authentication and quality control are performed through targeted approaches (profiling), where for each natural product several quality marker compounds are determined. In contrast, in natural product metabolomic studies, the whole signal recorded with analytical instrumentation can be used. This signal represents the fingerprint for the analysed natural product and contains information of the whole metabolome of the examined natural product. Fingerprints can be recorded either applying solely spectroscopy, mass spectrometry (MS), nuclear magnet resonance (NMR) detection systems or with hyphenated analytical techniques, such are gas chromatography hyphenated with MS or flame ionisation detector (FID), liquid chromatography hyphenated with MS or UV-Vis, etc... Due to complexity and high degree of dimensionality of recorded signals in analytical chemistry, multivariate statistical techniques are utilised to reveal patterns in the recorded signals. While multivariate approaches applied to hyphenated chromatography analytical systems allow data mining and identification of metabolites contributing to the sample discrimination and quality, multivariate approaches applied solely to detection systems offer other advantages, such as fast quality control and authentication of natural products with high sample throughput. Also, application of spectroscopic techniques in metabolomic studies allows development of miniaturised and customised devices for in field and on-site analysis.
This thesis explores different analytical techniques, multivariate approaches and machine learning algorithms for the purpose of simplifying and increasing prediction accuracy in natural product metabolomic studies for purpose of characterisation, authentication and quality control of natural products. Through 4 experimental chapters the following areas were investigated:
1. Development and comparison of different data reduction procedures on gas chromatography hyphenated with electron impact mass spectrometry (GC-EI-MS) data to increase the performance of multivariate statistical approaches in essential oil (EO) authentication, quality control and biological activity prediction.
2. Application of Random Forests machine learning algorithms in the detection of adulterated and natural products with lower quality based on analysis performed with GC-EI-MS and handheld Raman spectroscopy.
3. Characterisation, quality control and authentication, of “extremely complex samples” such are natural product blends and high-value perfumes by determining quality of samples used in their creation.
4. Development of a PLS model based on spectra recorded using an LED-based spectrophotometer for monitoring primary and secondary products in the FuracellTM process.
GC-EI-MS is a most common technique for the analysis of volatile natural products. Single sample, analysed by GC-EI-MS produces a three-way data array, as a function of time, m/z and their intensities. By analysing multiple samples, four-way arrays are created. Most of the multivariate statistical tools cannot handle four-way data arrays and further data reduction is required. The first experimental chapter of this thesis examines and compares three different GC-EI-MS data reduction procedures applied for the purpose of natural product authentication, quality control and prediction of biological activity. The first strategy, and at the same time most commonly applied, is by summing all of the m/z fragments in a single mass scan and plotting them against the time of the mass scan, creating a total ion current chromatogram (TICC), out of which a chemical composition profile is obtained. The second approach is the averaging of the summed responses for each m/z fragment over the total number of scans, the whole time of the analysis, creating a total chromatogram average mass spectrum (TCAMS). In the third approach, GC-EI-MS three-way data array is divided into time dependent sub-windows, where for each sub-window the average mass spectrum (AMS) is calculated. At the end, the AMS of all windows are added into single data set, creating the segmented average mass spectrum (SAMS).
In the first experimental chapter, three strategies for GC-EI-MS data reduction were evaluated for the discrimination of ylang-ylang essential oils based on their distillation time and geographical origin. SAMS showed superior performance compared to the other two data reduction procedures, in principal component analysis (PCA), partial least squares regression (PLS) and discriminatory analysis (PLS-DA) for the prediction and discrimination of ylang-ylang distillation grades and geographical origins, respectively. Also, TCAMS and SAMS were utilised for fast quantification of main compounds in ylang-ylang EOs, without using internal standards. This enabled evaluation of quality of ylang-ylang EOs through comparison with the corresponding ISO standard.
In addition, a high-performance thin layer chromatography approach was utilised for the determination of radical scavenging activity (RSA) and identification of compounds contributing to the RSA. It was shown that increase in distillation time results in ylang-ylang EOs with higher RSA due to higher content of sesquiterpenes, α-(E,E)-farnesene and germacrene D, which are together with eugenol main contributors to the RSA. It was also shown that geographical origin has great influence on the RSA of ylang-ylang EOs. Recorded HPTLC profiles allowed discrimination of YY EO based on their geographical origin in PCA, and prediction of distillation grade utilising PLS.
In the case of prediction of RSA based on three different datasets created from GC-EI-MS data, PLS model created on SAMS showed lowest relative error of prediction (REP) and mean error of prediction (MEP). Data mining on the three datasets created from GC-MS raw data allowed identification of compounds contributing to the RSA. In comparison to the GC-based data sets, ATR-FTIR showed higher accuracy, having lower REP and MEP, as well as root mean square error of prediction (RMSEP). It was also demonstrated that a PLS model created on spectra recorded on smartphone-based handheld Raman spectrometer can be used for the determination of RSA. This is of great importance since analysis and evaluation of biological potency of EOs can be performed directly in-field without any sample pre-treatment.
The second experimental chapter of this thesis illustrates the procedure for the fast quality control of natural products, based on Random Forests machine learning algorithms. In the first part, the application of different GC-EI-MS data reduction procedures, TCAMS and SAMS, followed by Random Forests for the classification of twenty different classes of EOs, at the same time determining the samples with lower quality, was explored. In this work, SAMS showed better performance, where through a calculated proximity matrix it highlighted all EOs with lower quality. Random Forests, PLS-DA and PLS applied on spectra recorded on the smartphone-based portable Raman device, allowed discrimination of pure EOs from the adulterated ones where the adulterant was quantified based on the created PLS model. Also, Random Forests enabled identification of the adulterants.
The third experimental chapter of this thesis examined quality control of blends and mixtures created from several natural products. Application of TCAMS, together with multivariate curve resolution alternating least squares (MCR-ALS), allowed quality control of "extremely complex" samples, such are essential oil blends and perfume mixtures. Singular value decomposition (SVD) applied on TCAMS enabled determining the number of natural products used in creating blends. Resolved TCAMS, through applied Random Forest model, allowed identification of natural products used in creating the blends. Also, PCA on resolved TCAMS allowed determination of distillation grade and geographical origin of ylang-ylang essential oils used in creating high-value perfumes.