Science topic

Chemometrics - Science topic

Chemometrics is the science of extracting information from chemical systems by data-driven means. It is a highly interfacial discipline, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering.
Questions related to Chemometrics
  • asked a question related to Chemometrics
Question
2 answers
May I know if there is any website that have free dataset of gravimetric method (ex: karl fischer titration) and spectra method?
It will be great if anyone willing to share their dataset.
Thanks
Relevant answer
Answer
You can find free datasets related to gravimetric and spectral methods on websites like Eigenvector Research, RamanSPy, Mendeley Data, and Nature, which offer various spectroscopic data for analysis. These resources provide valuable datasets for research and algorithm testing in analytical chemistry.
  • asked a question related to Chemometrics
Question
3 answers
Hi!
I'm doing the NIR research and I'm newbie on chemometrics. To analyse NIR data, I use PLS regression as I'd like to quantify the number of components on essential oil. My question is, how to determine the optimal latent variable (LV) in PLS regression? should I do trial and error one by one then choose the LV with the lowest RMSECV?
Thank you
Relevant answer
Answer
Choosing the lowest RMSECV would lead to overfit the model and it is not good.
Before you perform PLSR, I suggest to understand the data first. Following steps may help you to analyze your data:
1. Analyzing its PCA to find some pattern or outlier identification
2. Analyzing the raw spectra to examine the most prominent peaks that has chemical information and can contribute to targeted y-reference. This helps to check with regression vector later to avoid an overfit model. Other thing, this may help to indicate, maybe, some strange spectra.
3. Try to see number of data. Is it possible to split the data as training and independent test set? rule of thumb to split is 2/3 for training and 1/3 for test. This is important as it allows you to see whether the model tends to overfit or not. Cross validation could be the last option if number of data is limited
4. In the case this research only can use cross validation, after building the model, see the revolution of RMSE. Most of model have a trend where the more LV used the lower RMSE will be but at certain point, it won't improve that significant, so, should stop on LV before that point.
5. Try to cross check with regression vector to see whether the model try include noise or only prominent peaks that responsible to targeted value. if the reg vector has less noise, it could be an indication that your model is robust.
  • asked a question related to Chemometrics
Question
2 answers
I have 17 different steel samples, in which i identified a peak of Ti 441.492.
But , if you look closely at photo, while some some peak are at 441.492 , some are at the slight offset. eg 441.490 ,441.488 etc
I am currently performing , Univariate Analysis for detection of Ti Concentration.
What could be the reason for this ? Also , while taking measurements , prior wavelength calibration was done.
Can , i choose such peak for Univariate Analysis ?
Also , how can i identify lines , which are prone to self - absorbtion ?
I am new into chemometrics , and currently into the learning phase.
Thanks and Regards,
Rahul P
Relevant answer
Answer
Has your sample signal been normalized? The differences do not appear to be very pronounced. Additionally, why have you chosen to use Ti 441.492 nm? It is clear that this is not a commonly used spectral line for Ti. There are many strong peaks for Ti that could be used for univariate calibration, such as Ti I 365.35 nm, Ti I 375.29 nm, Ti I 399.86 nm, Ti II 323.45 nm, Ti II 334.94 nm, etc. Why have you not selected these? Are there other interfering peaks? Self-absorption occurs when the Ti concentration is too high, and in such cases, it is advisable not to use easily excited characteristic peaks as they are more prone to self-absorption. You can simply determine whether self-absorption is occurring by checking for peak center dips or splitting. If you need to quantify the degree of self-absorption, it can be calculated using the appropriate formula.
  • asked a question related to Chemometrics
Question
1 answer
Hello Everyone,
I was reading an article about HSI (hyperspectral imaging)and came across figures representing surface scores for each PC. What do these figures represent?
Relevant answer
Answer
These figures show PCA scores projected onto their corresponding pixels.
  • asked a question related to Chemometrics
Question
10 answers
Friendship
Relevant answer
Answer
Dear Joseph,
Greetings from SCIENCE4U India, hope you are doing well and the war is less influencing your research activities.
Let me know if you need any assistance for your research work from my side.
Best regards,
Bhanu.
  • asked a question related to Chemometrics
Question
11 answers
If yes, would you please provide examples from literature.
Relevant answer
Answer
"untargeted analysis" is fine, if the data are valid, but they are not in this case and using them as you suggest is invalid. In any case, now Mohamed Mahmoud has been provided with professional advice that he should not use the raw area % units from his analysis data outside of the one analysis method. To write any more would be repetitive and a waste of Mohamed Mahmoud's time as his question has been answered.
I wish you luck with your classes at school Jokin and thank Mohamed Mahmoud for his patience.
  • asked a question related to Chemometrics
Question
5 answers
Hi PLS Experts,
I am an absolute beginner at using PLS, and I need your help. I have practised using PLS in R as well as in SPSS.
I am interested in using PLS as a predictive model. I have 2 Dependent Variables (DV) one is continuous while the other is categorical with 3 levels. However, I am not using these in one but model but rather in a separate model, as the categorical variable is also an independent variable in model 1 (the model with Continuous DV).
I am confused with the term Y-Variance Explained (For DV) and its effect on model performance.
Is a low percentage Y-Variance explained (in all components) mean poor prediction by the model?
I recently applied PLS on standardized data with 1 DV and 14 Predictors using R (mdatools package). The cumulative x-variance explained was 100%, but Y-variance explained was only 29% in all 14 components (optimal number of components is 3).
I am unable to explain the reason for such poor performance.
A summary of model is attached in the figure. (Predictions are in bottom right part of image).
Thank you for you time :)
Best
Sarang
#PLS
Relevant answer
Answer
I am a bit late to answer your question but let me try to explain to you as simply as I can.
The continuous dependent variable (y) is used to develop a predictive model. Of course and as you did, you need a multivariate matrix Xnxp, where n is the number of samples (rows) and p are the features (columns). Then, with PLS regression you obtain the weights (regression coefficients) that maximize the covariance between X and y. With these regression coefficients, you are able to predict any future or different dataset.
In terms of the categorical variable (y) is used to develop a classification model. Here you need the matrix X as well but need to use different algorithms such as logistic regression, as David Eugene Booth explained, or PLS-DA, LDA. The goal is to obtain the weights or coefficients that minimized the error. Those, as in regression, with those coefficients you will be able to classify any future or different dataset.
Now, I will explain your figure. Since you are dealing with a regression problem and you used PLS regression the first step is to optimize the PLS number or component or latent variables. This step could be done using the left-bottom plot. As you could see, it is related to RMSE, which is an error, therefore lower it is the better. This figure showed that as the components (x-axis (PLS components or latent variable)) increased the error RMSE decreased in both cv (red line and means cross-validation) and cal (the blue line which means calibration). I can see that you or the software suggested the number of components (n_comp =3) based on the other plots. But as you could see, at component 3 both curves have not reached the lower error, in my case, I will select the number of components 7 or 8 since both curves have reached almost the minimum and they look close to each other meaning that the model is generalizing better compare for example with component 5 or 6, in those components, the red curve suffers an increment.
Then, the remaining plot will be based on the selected PLS component and for sure component 7 will give you a better predictions plot (right-bottom). The top-left plot could be used to select outlier samples and the top-right plot is the regression coefficients that as I told you above could be used for predicting or classifying future datasets.
I hope that my explanation makes sense to you and could be useful. Do not hesitate to ask if you have any other questions,
All the best
  • asked a question related to Chemometrics
Question
7 answers
Dear All,
I'm looking for a suitable and a simple software of Chemometrics technical,
as effective tools and how can i use?
Relevant answer
Answer
Something like XLStat can be run directly from Excel. Please have a read here if it helps: https://high-tech-guide.com/article/can-you-do-principal-component-analysis-in-excel
If you want something free, you might look into Python packages, but maybe too advanced.
  • asked a question related to Chemometrics
Question
11 answers
Dear All,
I'm looking for a suitable and a simple software of Chemometrics technical,
as effective tools for application in exploring chemical data in analytical chemistry, what do you recommend for beginners?
Relevant answer
Answer
I suggest that the easiest and the cheapest way is using web-version of MetaboAnalyst (https://www.metaboanalyst.ca/MetaboAnalyst/ModuleView.xhtml).
Best regards,
Ivan
  • asked a question related to Chemometrics
Question
8 answers
Hi there!
I work with infrared spectroscopy (NIR & FTIR) in the field of food science/food chemistry. I'm looking for collaborators with experience in chemometrics (particularly PLS-R & PLS-DA, but other discriminant methods such as SVM or neural networks would also be great). In particular, I'm after people who would be interested in helping with data analysis & writing up some papers based on data I have collected.
If you are interested & have such experience, please contact me & I would love to discuss with you.
Joel
Relevant answer
Answer
you can contact me and Dr Silvio David Rodriguez for possible collaborations.
Regards,
  • asked a question related to Chemometrics
Question
10 answers
Imagine you have measured a series of curves, e.g. spectra of a dissolved compound of various concentrations, films with different thicknesses etc. Before you can retrieve the data, someone meddles with it, i.e. multiplies it with an unknown factor (or, alternatively, assume that your empty channel spectrum changes within the series), large enough so that it matters, but small enough that the data still seem to make sense.
Do you know a method that not only indicates that the curves have been altered, but also allows to retrieve the original/unflawed data?
Relevant answer
Answer
When original data is modified, then information content is lost. The missing information may or may not be critical to the parameter estimates of interest.
Suppose one data set was modified and you have a series of unmodified data sets, now objective prior information is available and it might be possible to develop a statistical model for the lost information in the modified data set. For example, if one data set was modified to remove a constant (such as a DC baseline offset). You could create a statistic to model the DC offset using the unmodified data sets. Now the estimates for the original data is based on objective prior information.
If someone alters the data and you have no idea whatsoever about the changes they made, then there is no hope. Any method that attempts to recreate the original data is subjective.
In your example - " assumes(s) your empty channel spectrum changes within the series), large enough so that it matters, but small enough that the data still seem to make sense" - this could be objective prior information. If the model for the data could be modified to represent the missing data, then Bayesian analysis would account for the missing data (empty channel) using objective prior probability distributions for all the parameters in the model for the data. If the resulting parameter estimates of interest turn out to have very large uncertainties, at least you would know the unmodified data was inherently uninformative.
  • asked a question related to Chemometrics
Question
3 answers
I am planning to use PCA and OPLS DA for my study in biochemometrics but i quite tight on budget. I am not sure how much is the SIMCA software although they have a trial version, I am worried if I'll be able to maximize the use of the free version in my data. Are there alternatives that is cheaper or free but will give quality data analyses on PCA and OPLS DA?
Relevant answer
Answer
You can also use free online tools for multivariate statistics: MetaboAnalyst (https://www.metaboanalyst.ca/) or Biostatflow (http://biostatflow.org).
p.
  • asked a question related to Chemometrics
Question
3 answers
I cannot find sources which gives a thorough explanation of PCA and how to assign the principal components 1 & 2 including their computations. Say for example, my study will explore polyphenol profiles of a certain plant from different geographical area and I will test their antioxidant activity too and analyze them using biochemometric approach. Which variables should be included in principal components? I will also integrate data from this PCA to construct my OPLS DA.
Relevant answer
Answer
Hi Duri,
I organized some information about PCA and PCR on:
The original is in Portuguese but can be translated by Google Translator:
Best Regards,
Markos
  • asked a question related to Chemometrics
Question
2 answers
I wonder if gas sensor responses can be correlated with non-volatile components?
Relevant answer
Answer
I suppose that non-volatile compounds are unable to vaporize and thereby, cannot undergo sublimation, and can not be detected by electronic nose.
  • asked a question related to Chemometrics
Question
3 answers
We are planning to analyze the activity of the group of phytochemicals based on chemometric analysis, we are finding difficulty wrt various formats accepted by the software/ servers, so looking for a software or server which accepts data formats like m/z.
Relevant answer
Answer
You're welcome.
  • asked a question related to Chemometrics
Question
4 answers
Does it make sense to develop compression methods for large matrices used in chemometrics for multivariate calibration?
The main argument of opponents of this method is “increasing computational power and speed of computers for data processing and unlimited cloud data storage available” do not require compression since the compression slightly reduces the accuracy in multivariate calibration. (Cited from Personal communication).
Relevant answer
Answer
Another possibilities include decoupling your main matrix problem (after preprocessing it) into some combination of matrices with known structures (e.g., Toeplitz, circulant, band diagonal, etc.) and, then solve the subproblems according to your needs. Tensor structures/operations may be optimized for some structures in terms of hardware and networking.
As an engineer, I like to think that the problem and model tend to give some clues regarding simplifications.
The preprocessing stage depends on what you are investigating. Sometimes, outliers have more info than mainstream data.
Finally, the word "compression" has different meanings, levels and varieties. If you change your domain for a given tensor structure, you may get some constrains/compression.
  • asked a question related to Chemometrics
Question
5 answers
This preprint compares the most advanced automated commercial analysis approaches for vibrational spectroscopy including Bruker Lumos II in combination with Purency Microplastics Finder R2021a (FPA-FTIR), Agilent 8700 LDIR (QCL) in combination with Clarity, and WITec alpha300 R in combination with Particle Scout.
Its really worth reading.
Relevant answer
Answer
Thanks for sharing
  • asked a question related to Chemometrics
Question
17 answers
Taking infrared data, as I would like to calculate the LOD so detailed.
Relevant answer
Answer
To Dr Alejandro C. Olivieri,
we are interested in using the LOD calculation as described in your publication :
Franco Allegrini and Alejandro C. Olivieri (2014)
IUPAC-Consistent Approach to the Limit of detection in partial least squares calibration.
Analytical Chemistry, 86, 7858-7866.
We have some difficulties in applying equations (12) and (13).
1) Does the term var(ycal) refers to the variance of the whole set of y calibration values (that is the n values associated with the n spectra forming the calibration set) or  to an evaluation of the error of the measurement of y ?
2) ycal is said "centered" in the publication. In this condition, the mean of ycal  is compulsorily equal to 0 and cannot appear as a denominator in equation (10).
2) As a spectrum is a vector, what does mean var(x) ? We suppose that it is the mean of the variances as measured to each element of the vector x.
Do you have a simple numerical example on which we can compute the LODs ?
Thank you in advance for your valuable help,
Kind regards,
Dr Dominique Bertrand
--
Dominique Bertrand
chimiométrie et data processing
0633338680
  • asked a question related to Chemometrics
Question
2 answers
When creating & optimizing mathematical models with multivariate sensor data (i.e. 'X' matrices) to predict properties of interest (i.e. dependent variable or 'Y'), many strategies are recursively employed to reach "suitably relevant" model performance which include ::
>> preprocessing (e.g. scaling, derivatives...)
>> variable selection (e.g. penalties, optimization, distance metrics) with respect to RMSE or objective criteria
>> calibrant sampling (e.g. confidence intervals, clustering, latent space projection, optimization..)
Typically & contextually, for calibrant sampling, a top-down approach is utilized, i.e., from a set of 'N' calibrants, subsets of calibrants may be added or removed depending on the "requirement" or model performance. The assumption here is that a large number of datapoints or calibrants are available to choose from (collected a priori).
Philosophically & technically, how does the bottom-up pathfinding approach for calibrant sampling or "searching for ideal calibrants" in a design space, manifest itself? This is particularly relevant in chemical & biological domains, where experimental sampling is constrained.
E.g., Given smaller set of calibrants, how does one robustly approach the addition of new calibrants in silico to the calibrant-space to make more "suitable" models? (simulated datapoints can then be collected experimentally for addition to calibrant-space post modelling for next iteration of modelling).
:: Flow example ::
N calibrants -> build & compare models -> model iteration 1 -> addition of new calibrants (N+1) -> build & compare models -> model iteration 2 -> so on.... ->acceptable performance ~ acceptable experimental datapoints collectable -> acceptable model performance
  • asked a question related to Chemometrics
Question
4 answers
Hi Everyone,
I have acquired some plant hyperspectral images (roots, fruit, leaves) from various environmental conditions and now want to explore the data cubes to detect possible differences and plan to study plant physiology and chemometrics in future. The built-in software with the camera (Specim IQ studio) is not serving the purpose.
Any suggestion for easy-to-use and simple interface software or analysis pipeline for such exploration and making classifier models? Preferably open-source but commercial suggestions are also welcomed.
Many thanks in anticipation.
Relevant answer
Answer
Yes, the folder looks like this for a single snap!
I will inbox you as well. Thank you again :)
  • asked a question related to Chemometrics
Question
2 answers
I'm stuck on how to and what constraint to be applied while using MCR for IR spectroscopy data. There are 3 types of constraint : Equality constraint, Unimodality constraint, Non-neagativity, and Closure constraint. Please help me proceed.
Relevant answer
Answer
Muhammad Ali Thankyou for the article recommendation.
  • asked a question related to Chemometrics
Question
13 answers
There are many tools to build a PLSR model to predict a response variable Y based on a multivariate predictive variable X (reflectance spectra for example). My question is the following: once we have built a PLSR model, is it possible to simulate X for a specific Y? Is it possible to do it with R?
Relevant answer
  • asked a question related to Chemometrics
Question
5 answers
I am validating a method of quantification of adulterants in olive oil by infrared spectroscopy using the multiple linear regression calibration, and for the validation I intend to use limit of quantification and detection.
Relevant answer
Answer
Yes. I use the frequency selected trough algorithm of sucessive projection by vectors in n - euclidian space, where the variables projected in space one - by - one, if equal zero is excluded.
The beer law can be aplied because the linear equation form. But, is needed select variables ( frequency ), trough test with number minimum and maximum of variables indiviually.
  • asked a question related to Chemometrics
Question
2 answers
I have been working on creating a multivariate predictive model (most likely PLS) using FTIR spectra to predict the amount of a component in complex mixtures. In order to calibrate the model I have considered two pathways - creating synthetic mixtures of the component and confounders or using an established technique to give me the true value of the component in real mixtures. Both pathways have big limitations. I am now considering spiking real mixtures with several known (by weight) amount of the component. How can I implement this in a X-Y chemometric model ?
Relevant answer
Answer
Kirk,
I want to use PLS or some other multivariate chemometric technique for their ability of using tons of variables for the predictive model. In the case of FTIR, generally the integration of one spectral region is used in a traditional concentration curve. In the case of PLS, I can use the entire spectrum (or a good portion of it) and selective the important LV.
Probably for you variables are the components in the mixtures. In my case, there are generally between 5-10 components in the mixture. I am interested in the quantification one of them only. The other 4-9 components create(correlate) with the signal of my target in the entire spectrum. The issue with synthetic mixtures is the poor quality of the standard materials (with the exception of my target) and limitation of the techniques to determine the composition of the standard materials. The issue of understanding exactly what is in the mixture is why I cannot use simply real samples.
The standard addition method could be the best way ....but....I have not figured out how to do it chemometrically.
Does it make it sense?
  • asked a question related to Chemometrics
Question
10 answers
I have three dataset of quantitative and qualitative liquid chromatography, GC-MS and DNA barcoding of few samples.
What would be the best way to discriminate and visualize the data?
Relevant answer
Answer
Principal Component Analysis (PCA) is normally able to gather similar data and discriminate between different sets. We successfully did it with coffee aroma analysis using GC-MS.
  • asked a question related to Chemometrics
Question
3 answers
In all the textbooks related to chemometrics, PLS has been described as a regression technique that requires a dependent and independent variable for generating a regression line. However, in some chemometric softwares (especially Unscrambler), the while using PLS it shows an score plot (similar to PCA). It seems they are using PLS-DA. I am unable to understand what they are doing as no regression line has been generated.
The following figure corresponds to a study on liquor samples from three different geographical regions. We have applied PLS (it seems it is PLS-DA, not PLSR) in unscrambler. It has classified samples. However, I am unable to understand how to interpret it and whether it is accurate or not.
Relevant answer
Answer
PLS-DA is a supervised pattern recognition technique, used to classify the unknown samples to the predefined class.
First you need to
In this technique, each sample in the calibration set is assigned a dummy variable as a reference value (you need to add new column and assign each sample to its class), which is an arbitrary number designating whether the sample belongs to a particular group or not for example (1 = Class A and 2 = Class B). A sample is classified as A if its value was below 1.5 and classified as B if the value was above 1.5.
The following paper applied the PLS-DA technique for classifying honey based its botanical origin.
  • asked a question related to Chemometrics
Question
3 answers
I want to join a respectable research group that is concerned with Chemometrics
Relevant answer
Answer
Check this guys in barcelona they are doing good:
  • asked a question related to Chemometrics
Question
14 answers
Having reviewed literature on the use of chemometric approaches in quality assessments of medicines (including herbal medicines), I realised that several approaches are adopted. For example, in preprocessing of the data for further analysis, literature reports of methods like normalisation, peak centering, warping, smoothing among others.
Having in mind that the way you preprocess the data may affect the final outcome of the multivariate analysis, I want to find out if there exist any protocol guiding the adoption of any of these tools. For instance, when analysing chromatographic data from HPLC, you may have to correct baseline, then warp and normalise or something. Also, when dealing with FTIR data, you may have to first correct baseline, normalise and smooth (how do you determine the smoothing points?) among others. Are there specific preprocessing tools for specific datasets (that is from different instruments like FTIR, HPLC, LC-MS, etc) and are there specific procedures for the use of such, so that irrespective of who is conducting such analysis, the outcome may always be reproducible?
Thank you.
Relevant answer
Answer
Good question
  • asked a question related to Chemometrics
Question
4 answers
Having reviewed literature on the use of chemometric approaches in quality assessments of medicines (including herbal medicines), I realised that several approaches are adopted. For example, in preprocessing of the data for further analysis, literature reports of methods like normalisation, peak centering, warping, smoothing among others.
Having in mind that the way you preprocess the data may affect the final outcome of the multivariate analysis, I want to find out if there exist any protocol guiding the adoption of any of these tools. For instance, when analysing chromatographic data from HPLC, you may have to correct baseline, then warp and normalise or something. Also, when dealing with FTIR data, you may have to first correct baseline, normalise and smooth (how do you determine the smoothing points?) among others. Are there specific preprocessing tools for specific datasets (that is from different instruments like FTIR, HPLC, LC-MS, etc) and are there specific procedures for the use of such, so that irrespective of who is conducting such analysis, the outcome may always be reproducible?
Thank you.
Relevant answer
Answer
yes it can be
  • asked a question related to Chemometrics
Question
4 answers
What are the formula's and peak fitting types needed e.g. gaussian, lorenzian...
Relevant answer
Answer
For curve fitting, a mixture of Gaussian ans Lorentzian is most appropriate. As suggested by Adeyinka Aina PCA can be useful for comparing 2 samples. PLSR is usual for continuous changes over multiple samples, but assumes a linear correlation to an external variable (e.g. concentration, temperature). MCR-ALS can by used for more general correlations.
  • asked a question related to Chemometrics
Question
2 answers
Hello,
I am performing a Support Vector Machine regression using unscrambler software, but when I try to predict using the SVR predict option, I only get the predicted values(not all the plots as I got in PLSR using unscrambler).
Please suggest how can I get R square and Root mean square error of prediction for the results?
Relevant answer
Answer
You need to calculate that manually. T
  • asked a question related to Chemometrics
Question
3 answers
There is a need to ensure extraction yield of the soluble coffee process, thus we need a method to analyze carbohydrate degradation products produced after hydrolisys of polysaccharides due to high temperatures used during extraction. It is important to measure in process the extraction efficiency to fine tune the process. Which secondary rapid method could be used to measure carbohydrate degradation products in the liquid phase and how can it be calibrated on primary methods?
Relevant answer
Answer
Dear Giulia,
Near infrared (NIR) and Raman spectroscopies have already demonstrated their potential for in-situ analysis. It is nevertheless difficult to say in advance which one would be the best for this specific application. it must therefore be tested.
Best regards,
Ludovic.
  • asked a question related to Chemometrics
Question
4 answers
Dear to whom it may concern,
I would like to ask people who are interested in univariate analysis in metabolomics. Now, I am proceeding my metabolomics data using univariare analysis, namely p-values and FDR-adjusted p-values.
However, as far as I know, the calculation of a p-value for each feature depends on two factors: (a) distribution of each feature and (b) variance of each feature between case and control group. To be more specific, the first step is that we need to apply a statistical tool (I do not know which tool can help me to check this issue) to check whether one examined feature is normally distributed in both these groups or in only one of them, and of course, there are two scenarios as follows:
1. If this feature is normally distributed in both these group, we proceed to use F-test as a parametric test to check whether the variance of this feature in both these groups is equal or unequal. If it is equal, we can do a t-test assuming equal variance, otherwise, a t-test with unequal variance must be taken into account.
2. If not, a non-parametric test will be applied to obtain a p-value for this feature. In this case, may you please show me which tests are considered as non-parametric tests?
I am unsure that what I mention above is right because I am a beginner in metabolomics. In case, this procedure is right, that means that each feature will be processed under this step by step one to obtain a p-value because all features are expressed differently in the distribution and variance way between these groups (case and control).
I hope that you may spend a little time correcting my idea and give me some suggestions in this promising field.
Thank you so much.
Pham Quynh Khoa.
Relevant answer
Answer
Hello, first of all, what's the sample amount in each group? In general, I suggest to test non-parametric if n < 6. For smaller sample size it is not really serious to assume parametric conditions, independently from test results. As non-parametric alternative for t-test, I suggest u-test (Mann-Whitney-Test). Best regards, Max
  • asked a question related to Chemometrics
Question
5 answers
I am working on a mixture of three different API in my product. Their peaks are merging with each other. I would like to use chemometrics to solve my problem. Please suggest.
Relevant answer
Answer
Do you have the chromatograms for each of the individual API's? Is your detector a fixed WL or DAD? If you share (in a general format - Excel, tab delimites, comma separated values, etc) the mixture chromatogram data as well as the individual API chromatograms I can try to take a look and give you some feedback. Regards, Luis
  • asked a question related to Chemometrics
Question
10 answers
In our study, PCA was applied on ATR FTIR data. In the loading plots PC1 and PC2 are showing positive and negative correlation in certain regions of wave numbers. Please suggest how to interpret this positive and negative correlation and what does this signifies.
Relevant answer
Answer
I will give you some useful information related to PCA.
Scores: describe the properties of the samples and are usually shown as a map of one PC plotted against another.
Samples with close scores along the same PC are similar (they have close values for the corresponding variables). Conversely, samples for which the scores differ greatly are quite different from each other with respect to those variables. The relative importance of each principal component is expressed in terms of how much variance of the original data it describes.
Loadings: describe the relationships between variables and may be plotted as a line (commonly used in spectral data interpretation) or a map (commonly used in process or sensory data analysis).
For each PC, look for variables with high loadings (i.e. close to +1 or –1); this indicates that the loading is interpretable.
To study variable correlations, one studies the relative location of variables in the loadings space. Variables that lie close together are highly correlated. For instance, if two variables have high loadings along the same PC, it means that their angle is small, which in turn means that the two variables are highly correlated. If both loadings have the same sign, the correlation is positive (when one variable increases, so does the other). If the loadings have opposite signs, the correlation is negative (when one variable increases, the other decreases).
But remember: Loadings cannot be interpreted without Scores, and vice versa.
Loadings. For that reason the BI-PLOT is the best plot for analyzing PCA.
About the significance of each score and loadings, it depends of the software that you have used for processing your data. In the result, you should be able to find the P values of each score and/or loadings,
Wish you all the best,
  • asked a question related to Chemometrics
Question
7 answers
How can I select the best descriptors to build a QSAR modeling?
Relevant answer
Answer
You can use a GA-PLSR matlab code by Riccardo Leardi to reduce descriptors and then select the best descriptors by stepwise regression in SPSS software.
GA-PLSR code:
Application of genetic algorithm–PLS for feature selection in spectral data sets
  • asked a question related to Chemometrics
Question
3 answers
What is the development of molecules and drugs?
Relevant answer
Answer
Dear Niazi,
The molecular design will be more inclined to embrace a transcriptomics-based approach that centered on RNA-RNA, RNA-DNA, and RNA-Protein interaction studies. The methods to predict 2D and 3D structures of RNA are already determined as shown in this reference:
Parikesit, A. A. (2018). The Construction of Two and Three Dimensional Molecular Models for the miR-31 and Its Silencer as the Triple Negative Breast Cancer Biomarkers. OnLine Journal of Biological Sciences, 18(4), 424–431. https://doi.org/10.3844/ojbsci.2018.424.431
Parikesit, A. A., Utomo, D. H., & Karimah, N. (2018). Determination of secondary and tertiary structures of cervical cancer lncRNA diagnostic and siRNA therapeutic biomarkers. Indonesian Journal of Biotechnology, 23(1), 1. https://doi.org/10.22146/ijbiotech.28508
In this respect, a design of silencing (si)RNA molecules as drug candidates will gain more importance in the future, as well as studies of long(ln)RNA biomarkers. Keep in mind that although transcriptomics-based approach will gain importance, the proteomics-based one will still be utilized by molecular designers.
  • asked a question related to Chemometrics
Question
7 answers
What is the best computer program for the calculation of stability constants of metal comlexes from UV-Vis spectrophotometry titration data?
Relevant answer
Answer
For UV-Vis or many other type of specroscopic titration data you can use HypSpec software. Check through the following website. The best I have ever known.
  • asked a question related to Chemometrics
Question
4 answers
I would like to know if we can use a handheld Raman spectrometer and a chemometric technique for analyzing the content of active ingredient in presence of other excipients?
The excipients are coconut oil, beeswax and some flavoring agents.
The spectrometer is based on Spatially offset Raman spectroscopy (SORS) technology.
Can I use it to analyze the active ingredient in the finished product quantitatively?
Also, if I can then which chemometric method is best suited for the analysis?
Relevant answer
Answer
There’s a good chance that you could use Raman Spectroscopy to do this analysis. Because Raman has very fine peaks and it is qualitative, it is a very good technique to determine an analyte within a mixture. You would need to obtain the Raman spectra of the active ingredient and the expedients in the mixture, and determine a suitable peak or peaks to measure. PLS is a good Chemometrics technique to use.
Colleagues of mine have done similar work – check their article on ResearchGate:
Noninvasive, Quantitative Analysis of Drug Mixtures in Containers Using Spatially Offset Raman Spectroscopy (SORS) and Multivariate Statistical Analysis
  • asked a question related to Chemometrics
Question
3 answers
I have the data of total dissolved soilds of apple as references (y-variable).
I also have near-infared spectra data as predictors (x-variables).
I have the StatSoft Statistica software for the analysis.
Relevant answer
Answer
Different related software's can be used for building ANN predictive model using NIR spectra. My suggestion are Unscrambler and IBM modeler. Before ANN modeling, PCA should be done to reduce large spectra variables to PC1, PC2 .... . after calibrating ANN, you have predicted values by calibrated model and you have also reference values for each sample. Now use RPD index to realizing the goodness of your calibrated model.
  • asked a question related to Chemometrics
Question
8 answers
Dear fellows
I'm a biologist and I've been assigned the task of aligning the chromatograms of some plant samples I processed in the past.
I've been doing some research and all of the methods used for chromatogram alignment either involve dynamic programming or are made in softwares for which I don't posess a license.
So I wonder if any of you can provide me with some guidance as to how to complete this task (taking into account that I'm not familiar with chemometrics).
Thank you a lot for your time and help,
Vanesa Díaz
Relevant answer
Answer
If this is what you mean (see attached images), I can recommend OpenChrom: https://www.openchrom.net/ You can shift the chromatograms left/right by clicking on arrow buttons.
  • asked a question related to Chemometrics
Question
7 answers
How do you prepare graphical abstract, schematic figures and similar extras for your papers? Is there a software with good vector graphic library of shapes?
Besided Corel, Inkscape...
Relevant answer
Answer
BioRender! - (https://biorender.io/)
Is a free online app that contains a library of pre-made cells, proteins, membrane shapes, organs, lab equipment, etc. that you can drag-and-drop so you don't have to spend time drawing each element of the figure out yourself. Saves a lot of time for creating schematic figures, and the icons are all created by scientific illustrators so they're both beautiful and accurate! And it's free for educational use.
  • asked a question related to Chemometrics
Question
5 answers
Using this ( http://physics.nist.gov/PhysRefData/ASD/lines_form.html ) website, I am trying to find the element corresponds to wavelength.  
For example,   
I want to find, the element at the wavelength 521.3891nm but no element is present at the exact wavelength in that website. Then I tried to find the nearest value which could match chemical composition table.  I discovered that 521.3841nm represents Fe I element, but the problem is that 'Fe' is present in a very negligible amount (i.e. ppm) in composition table. Cu component is present in abundance and 521.2780nm represent Cu I material.
I am totally confused, what to choose, either 521.3841 as Fe I element or 521.2780 as Cu I element.   
Please, let me know, the fundamental principle behind choosing element corresponds to the wavelength in spectroscopy. Let me know if any references are available.  
I hope, my question is clear to understand, if not please let me know. 
Waiting for healthy discussion. 
Thank You very much for your time.
Relevant answer
Answer
The best choice in such a confusion is either to check the calibration of the spectrometer or record the spectra from pure copper and iron sample and then compare it with your sample. The best is to record the spectra from pure samples and compare it with your data.
  • asked a question related to Chemometrics
Question
5 answers
* i have collected soil samples during hyperion pass over the study area and chemical analysis was done.
* Lab spectral signatures has not taken.
* How can i correlate the chemical analysis results to hyperion data after preprocessing?
Relevant answer
Answer
You need to model multi-variate regression (PLSR/PCR) between each laboratory obtained soil nutrient data of the sample points, keeping as dependent variable (Y) and spectral reflectance of the sampling points obtained from HYPERION data as independent variable (X). One usually predicts the nutrient value from the relationship between X and Y and cross-validates the predicted value through examining the precision of fit in linear regression between the actual and predicted values of nutrient.
You can learn about the MATLAB implementation of PLSR/PCR from the following link:
  • asked a question related to Chemometrics
Question
5 answers
Please
I want to know which program to use and a step-by-step summary to get a percentage of difference, or a correlation, between the two spectra.
PS: I know that is a lot of things to explaining, but i'm already search on a lot of sites and nothing.
Relevant answer
Answer
First you have to do is to assign observed bands to normal modes of the material you measured. To do that you need the structure of the material / compound which was measured.
Assignment of normal modes you can do either on the basis of literature or with help of quantum-chemical calculations. For calculations you can use Gaussian software package. Other options (packages) are also possible. How to use quantum calculations for assignment you can find under link of Spectroscopy and Molecular Modeling Group: https://smmg.pl/home-en-gb.
When your bands are assigned to normal modes you can compare two spectra: to look for the same and different modes (frequencies are important) and the contribution of modes (absorbance).
Software for spectroscopy: Grams or Opus (both are expensive). I do not know if you can find anything free of charge. Try to find the web-page Spectroscopy Ninja and negotiate free version of the software you can find there.
That is all what I can suggest generally. If I know how your spectra look like, conditions of the measurement and the structure of your materials (compounds which spectra you measured) I can try to make more precise suggestions.
  • asked a question related to Chemometrics
Question
1 answer
External Parameter Orthogonalisation (EPO) algorithm is used to remove the effect of soil moisture from NIR spectra for the calibration of SOC (Soil Organic Carbon) content. This algorithm is used for pre-processing of spectra of soil taken from spectroradiometer and then PlS (Partial Least Square) regression is apllied.
Relevant answer
Answer
Never used EPO but there is a Thesis Dissertation by Nanning Cao that can help
  • asked a question related to Chemometrics
Question
3 answers
When imaging protein solutions (1ug/ml,10ug/ml,100ug/ml,1000ug/ml ; diluted with PBS buffer 7.4) on a gold surface, what is the optimal pretreatment(s) to separate out the effect of the buffer (in my case PBS) which interferes with bands of interest for proteins?
10ul of each concentration was dropped on a clean gold slide and allowed to dry for 24hr under Nitrogen purge, followed by FTIR-reflectance imaging.
Since samples with 1000ug/ml are highly concentrated, their signal appears very clear (Amide I, Amide II, Amide III, Amide A, Amide B), however at concentrations below and = 100ug/ml, protein signature is dominated by PBS buffer bands.
What kind of univariate or multivariate methods would you apply to
(a) identify protein pixels (remove interference of buffer, slide background if any)
(b) quantify protein pixels (eg. make PLSR model on 0ug,1ug,10ug,100ug,1000ug) and predict concentration level of an unknown dried protein sample?
Relevant answer
Answer
After drying you most probably get a coffee ring effect, meaning that the thickness of you layers is strongly changing from point to point. Therefore imaging and measuring individual points might in any way not be the correct method for reasons explained in
The electric field standing wave effect can be corrected as explained in
My feeling is, however, that you should better use a different technique, maybe to coat an ATR crystal and see if you could obtain a calibration line, in particular also, because for near normal incidence and thin layers, you might see nothing at all on your metallic substrate due to the fact that a metallic surface suppresses electric fields parallel to its surface. Certainly, this still does not solve the problem of overlapping bands.
  • asked a question related to Chemometrics
Question
4 answers
Anyone knows about research-related or industrial applications of CNN for HSI data in the food science field? Unlike SVM, KNN, and other "shallow" machine learning algorithms, the CNN enables taking advantages of the spatial information of HSI data.
Most published papers deal with extraction of deep spatial features for architecture classification and other domains. But in the literature, I don't find applications of this deep learning technique on food/agricultural products!
Thank you !
Relevant answer
Answer
Hi,
I found this paper that maybe related with your research
”"Variety Identification of Single Rice Seed Using Hyperspectral Imaging Combined with Convolutional Neural Network"
There are also several papers that discuss about plant disease recognition using HSI and CNN.
"Explaining hyperspectral imaging based plant disease identification: 3D CNN and saliency maps "
"Classifying Wheat Hyperspectral Pixels of Healthy Heads and Fusarium Head Blight Disease Using a Deep Neural Network in the Wild Field "
Hope it helps your research.
  • asked a question related to Chemometrics
Question
8 answers
Is only principal component analysis well enough to carry out chemometric analysis with large data?
Relevant answer
Answer
Depending on your sample, you can go for supervised and unsupervised learning methods. pca and clustering techniques can be used for unsupervised learning whereas discriminant analysis, neural network models can be used for supervised learning
  • asked a question related to Chemometrics
Question
2 answers
I would be very grateful to anyone who can send me propolis samples of about 2 g along with any available information about the origin and collection date of the material.
I will analyze the sample using GC/MS to form a database for chemometric investigations connecting propolis content with its place of origin.
I am interested in raw propolis and not its ethanol (nor any other) extract.
For those interested, I would be happy to send the results of my analysis.
Relevant answer
Answer
No, thank you.
  • asked a question related to Chemometrics
Question
2 answers
Are you aware of industries that already make use of FTIR and chemometrics as their SOP for microbial testing? If not microbial testing, do you know of other industries that use FTIR and chemometrics as their SOP?
Relevant answer
Dear Sir. Concerning your issue about the use of FTIR and chemometrics as their SOP for microbial testing . IR spectroscopy is an excellent method for biological analyses. It enables the nonperturbative, label-free extraction of biochemical information and images toward diagnosis and the assessment of cell functionality. Although not strictly microscopy in the conventional sense, it allows the construction of images of tissue or cell architecture by the passing of spectral data through a variety of computational algorithms. Because such images are constructed from fingerprint spectra, the notion is that they can be an objective reflection of the underlying health status of the analyzed sample. One of the major difficulties in the field has been determining a consensus on spectral pre-processing and data analysis. This manuscript brings together as coauthors some of the leaders in this field to allow the standardization of methods and procedures for adapting a multistage approach to a methodology that can be applied to a variety of cell biological questions or used within a clinical setting for disease screening or diagnosis. We describe a protocol for collecting IR spectra and images from biological samples (e.g., fixed cytology and tissue sections, live cells or biofluids) that assesses the instrumental options available, appropriate sample preparation, different sampling modes as well as important advances in spectral data acquisition. After acquisition, data processing consists of a sequence of steps including quality control, spectral pre-processing, feature extraction and classification of the supervised or unsupervised type. A typical experiment can be completed and analyzed within hours. Example results are presented on the use of IR spectra combined with multivariate data processing. I think the following below links may help you in your analysis:
Thanks
  • asked a question related to Chemometrics
Question
13 answers
I need free software  (open source or that can be found in cracked version) and relatively simple one (that doesn't require coding) for doing PCA for medium sample size data (19).
Sample size: 19
Variables: 5, co-related variables.
I was using The Unscrambler software, but it is not free! Now, I am trying with spss. Is there any better software than this? What about origin?
Relevant answer
Answer
Check out BioVinci. You just need to drag and drop to run PCA. Quite simple to use and easy to understand. You can go here to see the PCA plot example: https://vinci.bioturing.com/panel/workset/build/principal-component-analysis
  • asked a question related to Chemometrics
Question
9 answers
especially for spectrometric analysis
Relevant answer
Answer
Statistics and Chemometrics for Analytical Chemistry, a textbook by J. C. Miller, James N. Miller, and Jane Miller
  • asked a question related to Chemometrics
Question
5 answers
Dear,
I've trasformed my data to avoid problems of non-normality and heteroscedasticity. Then I made the statistical analysis and the post-hoc test. Now, when reporting my data, which data should I report?
The original one or those transformed?
Thanks.
Relevant answer
Answer
Either can be reported, or both. If the transformation is simple and common, it sometimes makes sense to report the transformed values: "mean log10(concentration)". In other cases, you can report the "back-transformed" data, or data "on the original scale." In a plot, sometimes using the original data makes the plot unreadable.
  • asked a question related to Chemometrics
Question
3 answers
Hello everyone,
Recently I have read some articles using the framework of model population analysis (MPA). I think that MPA works by generating a number of submodels and then use statistical methods to analyse the interested information of the submodels. I am concerned about the following issues:
1. In terms of parameter optimization, what is the difference btween MPA and some tradditional intelligent optimization methods such genetic algorithm and simulated annealing?
2. In terms of Bayesian statistics, is MPA a method yeilding the likehood or posterior of the model population?
Relevant answer
Answer
Following
  • asked a question related to Chemometrics
Question
2 answers
I have been using FSCVA for assessing fast dopamine fluctuations of dopamine in the rats brain in vivo. It would be good to analyse the same recordins or new ones on dopamin changes within minutes and hours. But changes in baseline currents and other factors influence the picture. I know the whole accepted way is the principal component regression. But it is too comlex for me to make it in practice. And I do not need absolute concentrations of dopamine, just changes. Are there any ather techniques, perhaps more simple? Or are there free soft for principal component regression?
Relevant answer
Answer
Hello,
Please look at the approach developed by Heien (e.g. Chem Commun (Camb). 2015 Feb 11; 51(12): 2235–2238. doi:  10.1039/c4cc06165a or google: "fast-scan controlled-adsorption voltammetry" ) and Blaha (DOI: 10.1021/acs.analchem.6b02605 Anal. Chem. 2016, 88, 10962−10970). Both methods require modification of Vappl. The FSCAV I tested and got positive results even without PCR. Note that in my opinion a solid pharmacology with a selective (in terms of DA/metabolites) compounds should be used by several labs before these methods may be considered as "established".
Regards,
Leonid
  • asked a question related to Chemometrics
Question
3 answers
I am looking to do a Common Components and Specific Weights Analysis which can be done in MATLAB with SAISIR toolbox for example but I wonder if there is any R coded function for this type of analysis.
thanks by advance!
Relevant answer
Answer
I finally rereredid the analysis with the script you gave me and I finally obtained something similar to the previous results obtained with MATLAB! so thank you to have encouraged me to try again ;-)
best regards
  • asked a question related to Chemometrics
Question
3 answers
In chemometrics discrimination model using Mahalanobis distance, it would not work if the number of samples is less than the number of variables. From a book written by G. Brereton, he has mentioned that it is because the variance-covariance (matrix C) would not have an inverse.
Could anyone tell me why variance-covariance matrix would not have an inverse if the number of sample is less than the number of variable?
Relevant answer
Answer
Dear Giorgio Luciano, can you suggest to me some real case study, or sample data which shows that the dimension of variance covariance matrix is not a square matrix when the number of variables exceed the number of samples when calculating Mahalanobis distance?
  • asked a question related to Chemometrics
Question
4 answers
I'm approaching to the development of Hyperspectral data analysis but instead to use multivariate and chemometric methods i would like to use more direct method for the distinction of NIR spectra.
Starting from a reference spectra, i would like to implement some analytical methods to distinguish this reference spectrum to other acquired spectra. Actually i'm using the Pearson correlation coefficient or the calculation of the standard deviation of the difference between the reference and the acquired or also the concept of distance. Anyway i'm looking for other methods and verify the speed of the calculation. Thank's in advance.
Relevant answer
Answer
Hi Nicola, I know that multivariate analysis is the best approach in spectral identification but for my aim I have to compare two single spectra and verify how much they are close or different in order to distinguish them.
My principal idea is to use Pearson correlation coefficient or evaluate the std of the difference between the spectra. Anyway will be useful to know if more robust method can be applied (excluding the calculation of derivative). Some spectral matching tool like pattern matching for images?
Regards
  • asked a question related to Chemometrics
Question
4 answers
I am interested in analyzing the chemistry of oil production waters, with the purpose to refine calculations of mineral scales and its possible correlations with radiolements.
Relevant answer
Answer
Good day
You can download the reference blow.
Sincerely
  • asked a question related to Chemometrics
Question
3 answers
Raman Spectrum Database where carotenoids such as Lycopene, B-carotene, and Lutein can be downloaded in the form of an .spc file so that it can be used into other softwares such as MATLAB.
Relevant answer
Dear Sir. Concerning your issue about the Raman Spectrum Databases of Carotenoids such as Lycopene, B-carotene, and Lutein. I think the following below links may help you in your analysis:
Thanks
  • asked a question related to Chemometrics
Question
7 answers
Hello,
Recently I have got expertise in chemometric analysis. Chemometric is a techniques where statistical tools especially multivariate analysis is applied to chemical or biochemical data to interpret large volume of data in a reduced dimension. During reduction of dimension of data, there is no loss of significant amounts of information. Two method in chemetric are popular such as PCA and AHC. This two tools helps you to classify the genotype, effect of treatment, geographical origin, degree of adulteration in sample based on biochemical data.
Dear author, if you are interested to work on chemetric I will help you.
For further query please contact 09369641602
Tanmay Kr Koley
Relevant answer
Answer
Dear Dr. Tanamy
i need to cooperate with you to make analysis for three essential oils (already analysed by gc-ms and have antioxidant activity) by chemometric analysis as i tolled you before.
please tell me what the information you need about three essential oils
waiting your response soon
Prof. Dr. El-ghorab
  • asked a question related to Chemometrics
Question
10 answers
Hi Scientists!!
I am trying to choose the best model in a experimental arrangement of four factors and three levels.
I don't know what model is more correct? I chose Box-Behken but I am thinking about central-composite design.
I wanna be sure of my election!
Can you help me?
Relevant answer
Answer
Hi
Fist of all, why do you want to do a RSM (response surface design) ? being a central composite design, composite design in general, Box-Behken design?
- with this question I mean what do you want to achieve as a result from your experimental work:
A process/synthesis/unit operation optimization should be conducted by some steps
(1) Identify a procedure / method that operate
(2) Conduct a parameterization of the procedure, that is to identify the experimental variables that are possible to variety and the responses that can be measured (yield, selectivity, ….)
(3) Conduct an introductory screening design. In this you commonly identify the solvent, catalyst etc. that you should use. In your case you should consult. the paper by Rolf Carlson and collaborators: Strategies for solvent selection. Acta Chem Scand 1985 B39, 79-91 (and the book by Carlson see below).
When you have performed a screening design, where you should include as many variables as possible. Type of design for this purpose: I suggest you to use a fractional factorial design. By such a design you can identify which variables that are the more important and identify a preliminary optimum.
(4) The preliminary identified optimum might subsequently operate as the centre of a new design that allow you to create a predictive model that allow you to estimate: (a) coefficients for single variables, (b) coefficients for the square of each of the single variables and (c) the coefficients for two-factor interactions of the single variables. I strongly recommend to you to use a response surface design of the type central composite design. This allow you to estimate a model:
y(x1, x2, …, xk) = b0 + b1x1 + b2x2 + … + bkxk + b12x1x2 + b13x1x3 + … + b11x12 + b22x22 + … bkkxk2
Be aware that sometimes, the star points of the central composite design cannot be created; e.g. reaction time -1 (10 min)  0 (25 min) +1 (40 min) will result in that “-2” become a negative time:  -5 min.  In such cases you can solve this by change either the value of (-1) increase the reaction time, shorten the step, i.e. the (0) will be at a smaller value. You can also change the value of the star point.
Introduction of the categorical variables such as: “type of solvents”, “type of catalyst”, “type of reactor”, “type of base”,  etc. should not, from my point of view” be included in a response surface design. In fact, such variables shall be investigated by means of a screening design, where all such parameters (variables) is investigated/compared and decided before you do the response surface deign (a composite design).
Some “bibles” for statistical experimental design, screening and response surface modelling and more:
Design:
Box, G. E. P.; Hunter, J.; Hunter, W. G. Statistics for Experimenters: Design, Innovation, and Discovery, 2nd ed.; 390 Wiley: New York, 2005.
Box, G. E. P.; Draper, N. R. Empirical Model-Building and Response Surfaces; Wiley: New York, 1987.
Model building – regression:
Montgomery, D. C.; Peck, E. A. Introduction to Linear Regression Analysis; Wiley: New York, 1982.
Draper, N. R.; Smith, H. Applied Regression Analysis, 3rd ed.; Wiley: New York, 1998.
Optimization in synthesis:
Design and Optimization in Organic Synthesis, Second Revised and Enlarged Edition
Rolf Carlson Johan Carlson
ISBN: 9780444515278, Elsevier Science, 8th April 2005
  • asked a question related to Chemometrics
Question
3 answers
Dear RG Members,
       I try to analise multivariate infrared spectroscopy using PERMANOVA approach and I've like to know what is better distance matrix eg. mahalanobis, euclidian or other for this kind of data set? Please some paper that's supports this are welcome.
Thanks,
ASANTOS
Relevant answer
Answer
Hi
There is a good example of a study using Mahalanobis distance 
Rodriguez-Saona LE, Khambaty FM, Fry FS, Dubois J, Calvey EM. Detection and
identification of bacteria in a juice matrix with Fourier transform-near infrared
spectroscopy and multivariiate analysis. J Food Prot. 2004 Nov;67(11):2555-9.
PubMed PMID: 15553641.
but most groups  use Euclidean distance 
Cebi N, Yilmaz MT, Sagdic O. A rapid ATR-FTIR spectroscopic method for
detection of sibutramine adulteration in tea and coffee based on hierarchical
cluster and principal component analyses. Food Chem. 2017 Aug 15;229:517-526.
doi: 10.1016/j.foodchem.2017.02.072. Epub 2017 Feb 22. PubMed PMID: 28372210.
Zhang J, Huang P, Wang Z, Dong H. Application of FTIR spectroscopy for
traumatic axonal injury: a possible tool for estimating injury interval. Biosci
Rep. 2017 Jul 21;37(4). pii: BSR20170720. doi: 10.1042/BSR20170720. Print 2017
Aug 31. PubMed PMID: 28659494; PubMed Central PMCID: PMC5567294.
or squared Euclidean distance
Fuenffinger N, Arzhantsev S, Gryniewicz-Ruzicka C. Classification of
Ciprofloxacin Tablets Using Near-Infrared Spectroscopy and Chemometric Modeling.
Appl Spectrosc. 2017 Aug;71(8):1927-1937. doi: 10.1177/0003702817699624. Epub
2017 Apr 10. PubMed PMID: 28393531.
For general multivariate analysis of NIR data the following is quite helpful but doesn't address the distance matrix directly
Rosas JG, Blanco M, González JM, Alcalá M. Quality by design approach of a
pharmaceutical gel manufacturing process, part 2: near infrared monitoring of
composition and physical parameters. J Pharm Sci. 2011 Oct;100(10):4442-51. doi:
10.1002/jps.22607. Epub 2011 May 5. PubMed PMID: 21557224.
 best wishes,
Elaine
  • asked a question related to Chemometrics
Question
17 answers
What is the most suitable software tool for data processing and chemometrics applied to NIR/IR, Raman, X-Ray hyperspectral imaging? Please could you share your experience.
Relevant answer
Answer
As  Edgar says MATLAB, Python and R are all softwares where programmes are available for spectral analysis. The research group who are world leaders in spectral processing including NIR and Raman are based in Umea university and have pretty much developed the field of Chemometrics. They use a commercial software package (SIMCA) as well as R/Python but their software algorithms are often published. If you take a look at the webpage of Johan Trygg who is the head of the department (http://www.chemistry.umu.se/english/research/group-leaders/johan-trygg) you will find lots of articles on how to process and model these types of spectra. They are an extremely friendly group and welcome interactions with scientists.
  • asked a question related to Chemometrics
Question
12 answers
Hi 
The reviewer ask this question 
What's the proportion for the peak area of identifiable compounds based on NIST database to the total detected products used GC-MS?
what he mean by this ? and how should I answer him??
''''" The fraction peaks from GC-MS spectrum were identified via National Institute of Standards and Testing (NIST)  library. The identification of the major products was based on a probability match equal or higher than 95%". What's the proportion for the peak area of identifiable compounds based on NIST database to the total detected products used GC-MS? This also should be pointed out in the manuscript.''
Relevant answer
Answer
Hello,
I would also translate "proportion for the peak area of identifiable compounds based on NIST database to the total detected products used GC-MS" into ratio between "sum of area counts for identified peaks divided by area counts of the TIC" . This gives you some idea about how much (as % of the sum of amounts) of the already present compounds were identified. To my opinion this is not "meaningless" as Charles has stated, but i agree it is a limited tool. The area response of the several compounds are varying much. But EI ionization in GC-MS is a more universal ionization type compared to other instrumentation, therefore the comparison has its worth. You can support this additionally with GC-FID data where the area response of hydrocarbons is very similiar. It is easy to criticize this approach, but to my knowledge there is no other practicable chromatographic approach available to estimate the extent of the identification work in a sample. If this is note the case Wayne and Charles should provide us better alternatives.
By the way, I support Asit that apart from NIST matching and authentic standards, it is recommendable to calculate retention indices. NIST supports You with respective data.
  • asked a question related to Chemometrics
Question
1 answer
I possess a GPC 220 PL with a triple detection : RI  - Visco - LS( 15° and 90°)
thanks for your help.
Relevant answer
Answer
 Hi Sandrine,
Have you finally found the information somewhere since? I found out that PET/chlorophenol alone should be too viscous for GPC and that a blend with chloroform (25% chlorophenol/75% chloroform) should be better [1]. I looked into the literature to find the dn/dc of PET/chlorophenol but only have found the PET/chloroform couple (=0.143 @ 25°C if you are interested in) in this book [2] (which can be useful to have for GPC in general):
[1] S.A. Jabarin, D.C. Balduff, Gel permeation chromatography of polyethylene terephthalate, journ. of liq. chrom., 5 (10), 1825-1845 (1982).
[2] A. Theisen, C. Johann, M.P. Deacon, and S.E. Harding, Refractive Increment Data-Book for Polymer and Biomolecular Scientists (Nottingham University Press, Nottingham, UK, 2000),
Hope it will help.
If you have found another solvent for your GPC measurements, I would be interested to know how (different from HFIP that I cannot use).
Thanks,
Violette
  • asked a question related to Chemometrics
Question
2 answers
What type of chemometric method (or methods) the researchers involved in the project intend to use to model the spectral data obtained ?. The project is very interesting and research will be the goal in a few years, from my point of view.
Regards
Relevant answer
Answer
Thank you for this nice remark. We have investigated spectral differences directly in order to identify discriminating spectral features and their relation to biochemical differences. For chemometric modeling we are obtaining good results with Principal Component Analysis in combination with Linear Discriminant Analysis. Details have been described in Santos et al, Anal Chem. 2016, 88(15):7683-8.
  • asked a question related to Chemometrics
Question
13 answers
I know these diagrams are very common in water research - a tool is required to plot fluorides, chlorides and sulfates contents in the same graph.
Relevant answer
Answer
TRY AQUA CHEM
  • asked a question related to Chemometrics
Question
4 answers
I have one question on using Chemospec package in R program. When doing Infrared data analysis, which kind of normalization method should be use when I did peak normalization (normalize the spectra to a peak that is not the most intense)?
See the attached file in page 8 about the normalization explanation, but I am still a little confused
Relevant answer
Answer
Thanks Thomas, for the addition of the other normalization techniques.
To Liz: PQN is probabilistic quotient normalization; TotInT is area normalization, Range is similar to TotInt but instead of the entire spectra only a definied range is used; zero2one is also called unit vector normalization.
As described by Thomas you can calculate peak normalization in Excel or also in R. However, for peak normalization each intensity value of a single spectrum (i) is divided through the intensity of this particular spectra (i) at the choosen wavenumber (k):
Xi´ = Xi / xi,k
After this transformation you import data to chemspec.
However, it would be quite more comfortable to import the spectral data first to Chemspec doing baseline correction or binning (see workflow in figure 1, https://cran.r-project.org/web/packages/ChemoSpec/vignettes/ChemoSpec.pdf) and calculate peak normalization afterwards. But therefor you need to know how to get single intensity values of each spectra. Perhaps you can contact directly the package author and ask him.
  • asked a question related to Chemometrics
Question
7 answers
I am looking for a large dataset of NIR spectra (more than 500-1000 samples) including sample parameters and the results of the calibrations (errors in determining these parameters). Maybe some data have been published and the dataset is in supplementary materials?
Relevant answer
  • asked a question related to Chemometrics
Question
5 answers
In the paper "A statistical approach to determine fluxapyroxad and its three metabolites in soils, sediment and sludge based on a combination of chemometric tools and a modified quick, easy, cheap, effective, rugged and safe method", the authors found that the intensity of compound were affected by the prepared solution. So, what is the mechanism? In addition, the mobile phase compositions (exclude the radio of mobile phase of A, B...) could affect the retention behaviors and intensity. How?
Relevant answer
Answer
This is based on the adsorption coefficients of the compounds to be identified. This is to answer your question in brief.
  • asked a question related to Chemometrics
Question
5 answers
for instance I prepared 20 mixtures of two different dyes and got there absorbance at 10 selected wavelengths to use it for ANN but I don’t have enough experience in using- Matlab-for this task. I have problem in : 1. Preparing data matrix (10 selected wavelengths) for input and the concentration for output. 2. Extracting the model algorithm so I can use it as calibration curve between my absorbance and concentration and calculate the remaining dye concentration after treatment of water by different coagulants.
Relevant answer
Answer
 Hi A. Abdo,
You typically need one ANN dedicated to multivariate function approximation. Radial Basis Function networks (RBFs) constitute such a good candidate solution to solve your issue (follow https://www.hindawi.com/journals/isrn/2012/324194/ ). They may be efficiently implemented as Convolutional Kernel Networks (CNNs, follow http://ceur-ws.org/Vol-1649/118.pdf ). From a practical point of view, there exist a MATLAB library supporting CNNs with an ad-hoc tutorial and code samples:  http://www.uow.edu.au/~phung/docs/cnn-matlab/cnn-matlab.pdf
As an alternative, Matlab Central provides with an implementation of RBFs, the initialization of which one is based on k-means ( https://fr.mathworks.com/matlabcentral/fileexchange/52580-radial-basis-function-neural-networks--with-parameter-selection-using-k-means-? ).
Further readings that may be helpfull to you about ANNs (see alternatively SVM networks) :
Regards
  • asked a question related to Chemometrics
Question
1 answer
Greetings,
Can anyone please suggest a source for finding the IR spectra of solvated sulfur mustard, specifically in water. I need the spectra for comparison and identification purposes.
Thank you in advance,
Ema Sh
Relevant answer
Answer
  • asked a question related to Chemometrics
Question
32 answers
Does anyone knows how to open/run DPT files (NIR spectroscopic data) in The Unscrambler software?
Relevant answer
Answer
Step 1 : rename  .dpt to .txt
Open it in .txt and replace all comma (,) with 5 spaces (     ) and save it
Step 2: Now replace all . with , and save it
Step 3: Open it in origin.
Goodluck!
  • asked a question related to Chemometrics
Question
3 answers
Actually, I am submitting the job into the Polyrate 8.0 software for my reactants and products to calculate the rate coefficients for a particular reaction.
But, in the output, I am getting some non-zero imaginary frequencies. In principle, for reactants and products, there should not be any negative frequency(NImag=0). I have made sure that my inputs are correct including the Z-matrix too for the reactants and products.
I am unable to understand as to why this is happening. Can somebody please help me with this?
I am also attaching the i/p files for your reference.
The files "r1.dat" and "r1.71" are the i/p files for Polyrate. And, the file "esp.fu82" is the Gaussian o/p file from the Polyrate software.
Relevant answer
Answer
I have done this also.
But still I am stuck with those negative frequencies. Can you please help me out with this problem?
  • asked a question related to Chemometrics
Question
2 answers
The Mass Spectra profile of 1-Amino-2-naphthol shows three major peaks at 159 m/z (which is the molecular weight of the compound having 100% intensity), 130 m/z (70% intensity), and at 103 m/z (15% intensity). What is the mechanism of fragmentation of the compound? What are the compounds that are being formed after fragmentation at 130 and 103 m/z? 
Kindly elucidate the fragmentation mechanism so that I can correlate with other naphthalene derivatives.
Thank You
Relevant answer
Answer
Thank you, Dr. Michael, for sharing the report. I believe, after the mass loss of HCN and CO, 1-amino-2-naphthol will be converted into Benzocyclobutadiene (which is a stable molecule). 
  • asked a question related to Chemometrics
Question
6 answers
I found that almost all the applied cases related to  pharmaceutical analysis through UV spectroscopy obey beer's lambert law which is a linear problem. Is there any application in the pharmaceutical analysis field where non-linear models should be used rather than linear models in UV spectral data ? thanks 
Relevant answer
Answer
  • asked a question related to Chemometrics
Question
5 answers
As there are many different techniques available - it is difficult to understand which technique is a good one given a certain kind of data.
Relevant answer
Answer
Chemometrics is an interdisciplinary field that involves multivariate statistics, mathematical modeling, computer science, analytical chemistry. Major application area 1) calibration, validation, significance testing; 2) optimization of chemical measurement s and experimental procedure, 3) The extraction of the maximum chemical information from analytical data.  
Mainly the distribution of multiple variables simultaneously,obtaining more information that can be obtained by considering each variable individually, realm of multivariate advantage.
Finally say Chemometrics having six major hand..1) various order data ,2) data preprocessing, 3) modelling, 4) validation 5) prediction 6) significant of the prediction.
  • asked a question related to Chemometrics
Question
2 answers
I tried writing the script as follows.
from rdkit.Chem.AtomPairs import Pairs
from rdkit.Chem import MACCSkeys
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
suppl = Chem.SDMolSupplier('cdk2.sdf')
ms = [x for x in Chem.SDMolSupplier('cdk2.sdf')]
for m in suppl:
... if m is None: continue
... fps = [FingerprintMols.FingerprintMol(x) for x in ms]
... DataStructs.FingerprintSimilarity(fps[i],fps[j])
...
But it is showing the following error.
Use integer in not tuples in place of i, j.
I was wondering if it is possible to put all the molecules in a loop so that I don't have to give separate entries for different molecules.
Also, I want to calculate all the fingerprints simultaneously if it is possible.
Kindly, help me with this.
Relevant answer
Answer
Maybe not the approach that you were looking for but you could always use Knime.  It has readers for SD files and has nodes for RDKit... It is then a fairly trivial matter to create the fingerprints....
  • asked a question related to Chemometrics
Question
14 answers
Would it be the number of components, how to handle scattering, fixing outliers or something completely different?
Relevant answer
Answer
Number of components gets my vote.
  • asked a question related to Chemometrics
Question
8 answers
We are working on a device to measure fat in milk through NIR Spectroscopy. But no light (12W tungsten lamp) could pass even through 0.5mm milk. Is there any way to reduce the turbidity of the milk by dissolving proteins?
I am aware of the trypsin method to hydrolyze proteins, but is there any other inorganic chemical (which is more robust, insensitive to low temperature and fast)? Is EDTA method superior to Trypsin in terms of dissolving proteins?
Relevant answer
Answer
I think you can also try to disperse casein micelles by adding an urea sol. (3g. 3M or 7M), but this can't overcome the turbidity contribution of fat globules.
  • asked a question related to Chemometrics
Question
4 answers
I am currently working on non-invasive blood glucose measurement using photoacoustic spectroscopy in near IR region(905nm). While using the laser source, what optical power is advisable? Is there a limit on usage of laser on skin?
Relevant answer
Answer
You should calculate the Maximum Permissible Exposure of your light source, which is a metric based on wavelength and exposure time. Intense radiation can still be permissible for very short pulses for certain wavelengths.
The tables in the following link might be helpful. 
  • asked a question related to Chemometrics
Question
7 answers
I'm Katrul. Currently, I do SIMCA for classification analysis in chemometric using MATLAB. There are 2 class for my data: 15 data for class 1 and 150 data for class 2.When I did SIMCA, I obtained Q and T2 for each class. In order to represent the result of classification analysis, it is suggested to use Q vs T2 or Coomans plot. For Q vs T2 plot, I should have two plot but for class 1, I only have 15 data of Q and T2. Meanwhile, when I refer to a journal, all the dataset should be included in each class. What is your opinion regarding to this matter?
When I want to do the Coomans plot, I have to calculate sample to distance model. Do anyone have an alternative to do Cooman plot in SIMCA such as Matlab script?
Relevant answer
Answer
'Q' is the equivalent of 'DmodX' in SIMCA I believe. It is the residual model distance. Not to be confused with 'Q2' of course. The Cooman's plot is simply the plot of the residual model distance (Q or DmodX) of one class versus the other. The critical distance may be calculated in many different ways, there are some methods in the literature others are proprietary.
  • asked a question related to Chemometrics
Question
9 answers
Is it better to have one compared to the other?
Relevant answer
Answer
Selectivity is the quality of a response that can be achieved without interference for any other substance. Sensitivity is how low can you detect the substance of interest. The best analytical method offers the highest sensitivity and highest selectivity. It is desirable to have both properties as high as possible; it depends on the circumstances of the analysis if one of the properties can be sacrificed for the other.
  • asked a question related to Chemometrics
Question
4 answers
In the multivariate significance test there are the hypotheses of Wilk, Pillai, Hotellng and Roy. Which is the statistical significance of each one? What should I use for my data?
Relevant answer
Answer
For multivariate data distribution context you can use hotelling'T2 statistics may help you on favour of mahalanobis distances and probability distances.....after words you can use MANOVA .........
Actually I work with multivariate data .....like pca, pls technique.
  • asked a question related to Chemometrics
Question
3 answers
I wish i could get reference for all calculations pertaining to PLS PCR CLS and other methods.
Relevant answer
Answer
Alejandro Olivieri has published some papers dealing with figures of merit in multivariate calibration (including LOD). I've linked to two examples below. I would also suggest some of his other work, which can be found on his ResearchGate site (link below). 
  • asked a question related to Chemometrics
Question
11 answers
Dear All,
I'm looking for a suitable and a simple software of Chemometrics technical,
as effective tools for application in exploring chemical data in environmental analytical chemistry, study about an investigation of soil pollution (identification of sources of pollutant,......).
.
Relevant answer
you can try using Minitab software, easy to handle and comprise of
cluster analysis, principle component analysis, factor analysis and discriminant analysis
  • asked a question related to Chemometrics
Question
4 answers
I would like to be able to overlay a variety of Theoretical IR spectrum with a single Experimental IR spectrum and be able to determine the best matching theoretical spectrum. Is there a way to do this quantitatively on excel, sigmaplot, or any other plotting program?
Attached is an example spectrum where the black line represents the experimental spectrum. The red and blue areas/lines represent two different theoretical spectra. Both provide reasonably good matching, so I would like to quantitatively determine which matches the best.
Relevant answer
Answer
Please pay attention on the following paper as well, where we have performed a vaidation of a quantitative approach based on IR-spectroscopy accounting for exactly contribution of various mathematical methods for processing of spectroscopic patterns to mentioned above point (a) and (b):   1. Validation of reducing-difference procedure for the interpretation of non-polarized infrared spectra of n-component solid mixtures, B. Ivanova, D. Tsalev, M. Arnaudov, Talanta, 69, 2006, 822-828 
Mr. Batoon,
The applicability of any theoretical approach, including chemometric dataprocessing of the spectroscopic patterns is evaluated accounting for: (a) the accuracy towards the band/frequency position, meaning are they are shifted significantly or not; and (b) accuracy towards integral intensity, meaning is it is influenced significantly. Because of the intensity ratio of pairs of bands is constant under given experimental conditions, including phases of the substances.
Please find an example, dealing with prediction of theoretical IR-spectrum of MgSO4 in the solid-state, predicted by the quantum chemical approach using crystallographic coordinates. 
The intensity ratio of experimental pairs of bands is 0.3029, while the theoretical one is 0.2489. But it is important to highlight here that the experimental IR-spectrum has shown a broad IR-band, which perturbes the pure integral intensity of the used modes as well as there has obtained an artifact at 1159 cm-1. In this example the chemometric (mathematical approach) requests the involvment of this artifact in order to ensure a higher r2 value (0.99967) or a higher level of confidence. So that in this example, exactly, should take into consideration not only the complexity of the experimental data, but chemometrics carried out as a source of error, determining the integral intensity.
Given that in your case it seems qualitatively only that the first method can be regarded as more accurate one than the second approach.
  • asked a question related to Chemometrics
Question
10 answers
Turn-on fluorescence sensors, as you may know, are the type of sensors which do not show any fluorescence emission peaks in a specific wavelength till they bind with the target analyte and that cause appearing a peak at mentioned wavelength.
All of us are familiar with 3S/m formula, but the question is when you do not have any emission peak for blank, then what formula can be used to measure LOD? 
(I do not want to use graph to find LOQ and then convert it into LOD)
I'll be appreciated if you share any papers or references about this topic with me.
Relevant answer
Answer
Because LoD is an ambiguous term there is no one correct way to measure it. The question you first have to answer is - what will you use this LoD for?
If you just want to find LoD just to characterize you analysis method and it wont play an important roll later in the use of the method I suggest you use the ICH proposed method where LoD=3.3*S/slope. S is standard deviation of the residuals. This will give you a good conservative LoD (this is based on my experience with LC-MS). LoD can vary between days/experiments so measuring it on more than one day is a good idea.
If you need the LoD to really show you if some sample contains the analyte or not then determine the average value and standard deviation of blank samples. Then calculate decision limit (Lc) from this data: Lc=average(blank)+1.65*standard devation(blank). (I suggest using at least 10 separately measured blank sample results.) Any signal above Lc is at least 95% times the analyte and not noise. (NB! I suggest repeating sample measurements - if the concentration of the sample is at or close to Lc it is possible (even likely) to get a false negative results.)
If LoD is an important parameter for data interpretation and for characterizing your analysis method then I suggest you look at ISO 11843-2:2000 or IUPAC. They provide you with approaches to estimate decision limit and detection capability. Although estimating these will be more work, the result will be statistically correct and will give more confidence to your interpretations.
  • asked a question related to Chemometrics
Question
13 answers
Hello,
We want to use upconversion nanoparticles that convert from the visible or NIR range into the UV range for a series of experiments.  If possible, we would like to avoid synthesizing and characterizing the particles as our lab is ill-equipped to do so.  I've looked around for a commercial source for purchasing UV upconversion nanoparticles.  I found three companies - Mesolight, American Elements, and Nanograde - that seemed to have what we need, but all either can't provide the particles or can only offer very small quantities.
I was hoping someone here has a source for these particles that they can point me to. 
Thanks!
Relevant answer
Answer
Hi Sam
What sizes of particles do you look for ?
What will be your excitation wavelength ? And what kind of emission do you expect ?
Merci
Olivier
  • asked a question related to Chemometrics
Question
11 answers
In validation studies we are obliged to work with a special number of calibration levels
Relevant answer
Sir Olivier Roussel 
wishe if possible to send me those document (poster and presentation) even if they are writen in french 
best regard 
thank u for all reply 
  • asked a question related to Chemometrics
Question
5 answers
Unscrambler or Solo, R or matlab?
Relevant answer
Answer
Dear Mugdha,
MathWorks has recently included a set of functions for SVM into the Statistics Toolbox of Matlab, which is rather commonly available with the academic versions of the software. It permits to:
- Train an SVM classifier;
- Classify new data with an SVM classifier;
- Tune an SVM classifier;
- Train an SVM classifier using non-linear or custom kernel functions;
- Cross validate SVM classifiers;
I personally used this set of functions several times and I have to admit it is really easy and intuitive. I would heartily recommend it to you.
For further information, please have a look at the MathWorks documentation (see the attached URL).
I hope this will help you.
Best regards,
Raffaele
  • asked a question related to Chemometrics
Question
2 answers
I need to know the binding energy, vertical ionization potential, vertical electron affinity and homo-lumo gap of 7 atoms lithium clusers of decahedral shape, both the theoretical and experimental data
Can anyone pls help? 
Relevant answer
Answer
Professor Dr. Béatrice Marianne Ewalds-Kvist
Thank you very much for your kind help
Best Regards
  • asked a question related to Chemometrics
Question
3 answers
I am working on a dimer system recently. And I use the molecular dynamics package AMBER to perform the REMD simulation during my work.Is there any one can help me with the question on how to confine the dimer into an imaginary sphere to avoid the molecules from flying apart from each other ?
Relevant answer
Answer
There must exist an option in AMBER to perform simulations with spherical boundary conditions. This is possible with the NAMD molecular dynamics program.
  • asked a question related to Chemometrics
Question
6 answers
Preprocessing methods in NIR spectroscopy.
Relevant answer
Answer
Dear Rassol,
This book can be good for you :
Near-Infrared Technology: In the Agricultural and Food Industries (from Phil Williams and Karl Norris). 
Editor: Amer Assn of Cereal Chemists; Édition : 2 (novembre 2001)
ISBN-10: 1891127241
ISBN-13: 978-1891127243
Regards.
Ludovic.
  • asked a question related to Chemometrics
Question
5 answers
I have been analysing Raman Spectroscopy data as a predictor of meat quality and in my latest data I have been getting R^2cv values which are significantly and consistently higher than the R^2cal values. For example I've gotten an R^2 Cal: 0.00102829 and
R^2 CV: 0.288515. I have been using MatLab Software with the PLS toolbox, leave one out cross validation and 20 maximum latent variables.
Any ideas as to why it's happening would be appreciated.
Relevant answer
Answer
I totally agree with Raffaele. It seems that you are talking about RMSEC and RMSECV. If you are talking about RMSE instead of R2, several comments:
- Not only in Raman, but in general, the RMSECV will be, by definition, higher than RMSEC. BUT they should be comparable in magnitude. In your case they are VERY different! :-)
- Having very low RMSEC and very high RMSECV indicates that your model may be overfitted. The main reason could be that you are chosing too many latent variables in calibration. Other reason could be a wrong CV method or pre-processing of Raman spectra (Maybe you have a strong influence of fluorescence signal that is totally arbitrary from sample to sample and this affects your model)...
And, as someone said: One model is not the one that better calibrates, but the one that better predicts!
Cheers
Jose
  • asked a question related to Chemometrics
Question
1 answer
A membrane consists of excipients and solvents blended together.
Relevant answer
Answer
I think it is possible, in principle. The devil is in details.
  • asked a question related to Chemometrics
Question
4 answers
I would like to get more literature on chemometrics.
Relevant answer
Answer
Chemometrics is the application of statistical and mathematical methods to chemistry and chemical analysis. With that in mind, I would look at real stats books written by real statisticians. I have copies of Brereton's books on chemometrics. They cover methods of analysis for poorly gathered data. You can do a lot better by creating a proper and good designed experiment. I like Optimal Design of Experiments: A Case Study Approach by Brad Jones and Peter Goos. Brad has actually won some prizes for his use of statistical methods in chemistry. 
What you will find when you read through a lot of these chemometrics books, and compare that to what is taught/known in statistics and mathematics, you will often find big gaps. For example, Breretons books cover all the Design of Experiments(DOE) topics in 3-4 pages. If I was given his data sets in one of my stats classes as an exam problem, and I analyze the data the way he does, I would get those problems wrong.
As someone with a background in chemistry and stats, I asked Brereton why he spent so little time on DOE methods. He said, "Because they don't work." I have talked to a lot of people in industry that use these DOE methods. Their general take on DOE is a bad design from an inexperienced researcher  followed by a bad analysis ruins the experimental results. A reanalysis of a bad design by an experienced statistician can yield significant improvements in the model. A good design and a good analysis is obviously the best. 
  • asked a question related to Chemometrics
Question
4 answers
I want to extract a signal of given protein from FTIR spectral data of mixture of proteins. As the IR signature of different chemical molecules are different, so will the signature of a protein be also be different.
Relevant answer
Answer
We have done such an experiment for the very ideal case of a mixture of lysozyme (LYZ, a-helix is dominant secondary structure) and concanavalin A (CONA, b-sheet is dominant structure). We adsorbed both proteins from their binary mixture on positively or negatively charged surfaces. Even qualitatively one could see, which of the proteins was dominantly adsorbed (CONA: 1630 maximum, LYZ: 1650 maximum).
The quantitative protein compositions were determined by factor analysis on the spectra of the binary protein layers and we used the spectra of the pure proteins as factor spectra. You can find this study under this reference:
M. Müller, B. Keßler, N. Houbenov, K. Bohata, Z. Pientka, E. Brynda,
pH dependence and protein selectivity of poly(ethyleneimine)/poly(acrylic acid) multilayers studied by in-situ ATR-FTIR spectroscopy,
Biomacromolecules, 7(4), 1285-1294 (2006)
Best regards,
Martin