Science topics: StatisticsNormalization
Science topic

Normalization - Science topic

Explore the latest questions and answers in Normalization, and find Normalization experts.
Questions related to Normalization
  • asked a question related to Normalization
Question
5 answers
Hi all,
my lab has been using the SequalPrep Normalisation Plates from Invitrogen for a couple of years now, however we have never been able to get the expected concentration of 1-2ng/ul when using 20ul elution volume. We usually get a concentration in the range of 0.2-0.8ng/ul. The starting material we used is >250 ng amplicon per well. We have contacted the customer support before but it could not be explained why our concentration was so low.
I would just like to hear from others who have used the plates and what your experiences were and whether there any tips or tricks when performing the normalisation?
Thanks.
Relevant answer
Answer
More notes from my experience...
The sequelPrep plate wells have a coating up to the volume of ~50uL. To get a more concentrated sample,
1. Avoid touching the plate walls while pipetting, as this may scrape off the coating where the DNA binds.
2. Make sure the PCR product contains the fewest primer dimers possible since it is not essentially a size selection process (generally, bead cleanup is used for size selection). Although the starting material is more than 250 ng, this includes the primers and free nucleotides that add to the concentration. More amplicons in = more amplicons out in this step.
3. If the total volume of PCR product + binding buffer is 50 uL, DNA is bound to the wells up to 50 uL. When we add only 20 uL of the elution buffer, we probably do not get the rest of the DNA bound to the region above 20 uL while eluting. To improve concentration, we could gently vortex the plates.
  • asked a question related to Normalization
Question
1 answer
Greetings. Probably the question is not complex at all, but can't find an answer.
If I have RT-qPCR data of gene expression in a sample with multiple analitycal replicates - to compare it to data obtained in other experiments I need to normalize the expression of genes of interest to the expression of reference gene (which is constitutievely expressed)..
How to perform it if there are replicates and expression of both genes of interest and reference gene are in a form of Expression and Standard Error of the Mean?
Is there a formula to adjust GOI SEM using RG SEM?
Relevant answer
Answer
Just think how you'd do it for, say...a western blot. You'd have all your densitometry for your protein of interest, all your densitometry for your loading control, and you'd...divide the former by the latter to get your per-sample protein expression. Then you'd look at your (now normalized) protein levels and see if anything interesting is happening.
You would never work out the SEM of your loading control and the SEM of your GOI and then...do stuff with those SEMs: that's madness. You normalize each sample first, THEN do your comparisons.
qPCR is no different: you take exactly the same approach and do exactly the same thing, with the exception that qPCR data is in log-space, so divisions become subtractions.
So: let's say you have five samples of treated cells, and five control samples.
Isolate RNA, spec it, QC it, make cDNA.
Now, run qPCR, using these ten samples, in triplicate (so three wells per gene, per sample) for two (or three) validated reference genes and your GOI.
This will give you replicate Cq values for everything.
First thing: manual QC.
Look at the replicate Cq values for each sample, and see if they agree. This is basically "did you pipette the same way three times, and also did anything weird happen": you might see something like 22.2, 21.9, 22.1 as replicate Cq values for GAPDH, and this is great. The final GAPDH Cq for that sample is the arithmetic average of those three Cqs.
If instead you have 22.2, 21.9, 26.8, then you can pretty safely assume well #3 just went weird, and discard that data. Use the mean of the two remaining (non-weird) Cqs.
So eventually you should have, for each sample, mean Cq values for your GOI, and mean Cq values for each reference.
Make your two reference gene Cqs into a "normalisation factor": the average of the two.
Then subtract this from your GOI Cq for that sample. This is the dCt value for that sample, for that GOI: that sample is now normalized.
Do this for all your samples. Now all your data is normalized, and is in the form of dCt values.
Note that dCt is somewhat counter-intuitive, as low numbers are high expression, and vice versa. You can just multiply everything by -1, because the statistical comparisons are exactly the same. "-dCt" is a perfectly valid metric.
dCt values are approximately normally distributed, and entirely statistically-tractable. Use these for your stats. From the example above, you will have, for your GOI, five treated dCt values, and five control dCt values: this can be a straight T-test. If paired data, paired T-test.
There are other ways to do this that are mathematically identical but less "OMG LOG SPACE SCARY", which I typically use because log space scares me, but this basic method works and is easy to work through.
So, TL:DR, ignore SEMs, they're not useful here. Go back to the raw data.
If you have data you're willing to share, I could workshop up a basic spreadsheet for you.
  • asked a question related to Normalization
Question
1 answer
1. For the gene expression data (microarray dataset which is been extracted from the Gene Expression Omnibus (GEO) platform), which of the following normalisation techniques are suggested as the best in order to handle the outliers: quantile, log, z score,… As I was following articles where they were normalising by combining quantile and log, but when I check for the dataset I’m working on, there are outliers which are then negatively skewed after normalising. Is it normal to have skewness even when they are normalised? If not, are there any other ways where we can normalise them without any skewness?
2. I was using the Student t-test and Fold change values, to identify the DEG for two different cores, where I ended up getting 202 genes in total, where 44 are common between these two cores. Is it normal to get some common differentially expressed genes for two different conditions ? If not, what mistake probably would have occured?
3. Any precise formula to calculate the fold change values from the gene expression values? All over the internet, there are plenty of formulas. So, I'm confused about which formula to use.
Relevant answer
Answer
1. It is not uncommon to get skewness even after normalization of the data. Some degree of skewness is inherent to RNA-seq data due to the natural variation in gene expression.
2. It is normal to get some common differentially expressed genes between two different conditions. This might occur due to perturbation in common biological processes or pathways.
3. You do not require any sophisticated formula to calculate fold change. The simplest way is:
a. For each gene, you calculate the average expression value across all replicates in each condition (conditions 1 and 2).
b. Divide the mean expression in condition 2 by the mean expression in condition 1.
  • asked a question related to Normalization
Question
1 answer
This might be a basic question but I have a question on how to normalise a western blot when you have multiple controls.
I have some western blots using human samples and they go roughly like this:
Control, Control, disease 1, disease 1, disease 1, disease 2, disease 2, disease 2.
I have 2 controls on there, I want to normalise to HSP90 as a loading control, but as I have 2 control samples do I take the average of these? and then when I calculate fold change is it again from the average of the 2 normalised controls?
Thanks
Relevant answer
Answer
Hi El Ro ,
So if I'm understanding your question correctly, you want to normalise every blot to HSP90 as a loading control, then use those normalised values across blots for comparison?
You would need to normalise all bands to loading control, then average normalised values across your control and experimental conditions
So average 2 controls, average 3 disease 1, average 3 disease 2
Then you can compare fold change from the average of your controls vs disease 1 vs disease 2
Does that make sense?
Best wishes,
Steven
  • asked a question related to Normalization
Question
1 answer
Dear ResearchGate community,
I am fairly new to RNASeq analysis & wanted to ask for your input regarding accounting for different sequencing depth across my samples. I am aware that there are several normalization techniques (e.g. TMM) for this case, however, some of my samples have considerably higher sequencing depths than others. Specifically, my samples (30) range from 20M to 46M reads/sample in sequencing depth (single-end). Can I still normalize this using the tools provided in the various packages (DESeq2, limma etc) or is it preferable to apply random subsampling of the fastq files prior to alignment (I am using kallisto)?
Many thanks in advance!
Best,
Luise
Relevant answer
Answer
It is preferable to avoid downsampling (randomly reducing the number of reads in some samples).
Normalization techniques like DESeq2, edgeR, and limma are preferred for handling differences in sequencing depth across samples, preserving the maximum amount of information. Downsampling, on the other hand, can be used to equalize sequencing depth across all samples but may reduce statistical power, especially in samples with lower read depths. Downsampling may be considered if computational capacity is limited, data size is significant, or extreme cases have vast differences in read depths. In most cases, normalization methods are preferable.
  • asked a question related to Normalization
Question
6 answers
In data clustering, what is the best normalization method? And, what is the influence of each method on the results?
Relevant answer
Answer
In my opinion, the best normalization technique varies depending on the applied data and the equation. Sometimes it can be Min-Max, sometimes vector, sometimes total normalization, etc. Because the power of a technique's ability to transform data fairly may decrease or increase depending on the circumstances.
  • asked a question related to Normalization
Question
4 answers
As is known, for AI/ML applications, operations such as data transformations/normalization are performed in the data preprocessing stage. In fact, Normalization aims to transform data so that features/values ​​of different scales/sizes are on the same scale. At first, this is an innocent start. Generally, in AI and ML applications, Min-Max or Z-score data transformation techniques are frequently preferred for data normalization. Although the traditional trend is "Which normalization method to choose may vary depending on the characteristics of the data set and the context of the application", one of these two methods is selected automatically/without questioning. However, when there are alternative solutions such as Vector, Max, Logarithmic, Sum, why are these two enough?
Relevant answer
Answer
Dear Suraj, First of all, thanks for your reply.
In your opinion, "Min-Max scaling and Z-Score normalization are often used in AI/ML applications because they are simple, easy to implement, and often effective in many scenarios". But Max, Vector, Sum, or Logarithmic normalization techniques are also simple and easy to implement. These methods can also be effective in different scenarios.
Secondly, again according to you "Also, many machine learning algorithms are not very sensitive to the particular normalization technique used and these methods usually work well as a starting point". Yes, several types of ML algorithms may not need normalization. But usually, normalization is used in many other methods and seen as a necessity. For example, for a house price prediction, there are many features (criteria) in different units and these are normalized for many ML algorithms.
Again in your view "There are other scaling techniques such as Min-Max scaling and Robust Scaling which are less sensitive to outliers than Z-Score normalization". In fact, apart from these two, there are at least 15-20 other normalization techniques.
I agree with what you said yes "different datasets and algorithms can benefit from different normalization strategies". But it is generally not used and Min-Max and Z-Score are generally used. This is a choice open to criticism. In my view, you have to take into account that some types of normalization may yield different results, especially in cases of decision-making among alternatives for ML.
  • asked a question related to Normalization
Question
1 answer
Hi,
I have a question on qRT-PCR gene expression data analysis. I have a gene panel with 300 genes (292 test genes) + 8 housekeeping genes. The samples were run in batches over the period of time (first batch of samples = 90 + 3 Pooled Controls, Second batch = 50 + 3 Pooled Controls, Third batch = 70 + 3 Pooled Controls). Please let me know how should I normalize this type and handle batch effect.
Does the below approach makes sense?
  • Combine the data (Ct values) from all the 3 batches (90 + 50 +70 samples) and save in *.csv file.
  • Calculate Delta Ct = Difference between the Gene of interest and Arithmetic Mean of 8 housekeeping genes or Negative Delta Ct Ct(reference genes)- Ct(gene of interest).
  • Plot PCA, heatmap etc.
Thank you,
Toufiq
Relevant answer
Answer
Hey. Did you get any answers?
  • asked a question related to Normalization
Question
9 answers
We are trying to nanocoat cells to protect the cells. Normally the cell surface is negative charge, so we treated cells with positive charged material as the first layer, then treated cells with negative charged material as the second layer. How can I measure the charges of cell surface to demonstrate that the nanocoating of cells is successful? We ever tried to use zeta-potential to measure the charges of cell surface, but it didn't work. 
Relevant answer
Answer
The surface charge of a cell can be measured using zeta potential measurements and electrophoresis. Zeta potential measurements are used to measure the cell membrane surface charge of fixed cells in solution. Surface charge density is used to describe the charge distribution on the surface
  • asked a question related to Normalization
Question
4 answers
Dear all,
I am currently trying to determine the number of archaea present in an ecosystem with the 16S rRNA gene.
I am normalizing the mcrA gene with the total archaea 16S rRNA gene. The mcrA gene measures the methanogenic acivity.
The problem that I am currently facing is that some archaea have multiple 16S rRNA genes, which makes it difficult to normalize to the mcrA gene.
Does anyone know how to solve this problem?
Thanks in advance!
Relevant answer
Answer
If this is being done by QPCR you can use an average 16S rRNA gene copy number for all known archaeal genomes and divide your total archaea QPCR 16S count by this number. This average 16S operon/archaea genome is routinely updated as new genomes are sequenced can be found at the following database. https://rrndb.umms.med.umich.edu/ Current average = 1.7 operons/genome
  • asked a question related to Normalization
Question
2 answers
Set effects are usually quite imponent and mask sample characteristics when dealing with human samples and TMT-labeled non-targeted proteomics.
What is in your view the best approach to preserve the experimental differences while flattening down set effects (technical artifacts)?
Relevant answer
Answer
Roberto,
Here is the article partly covering this issue.
  • asked a question related to Normalization
Question
2 answers
The Background of the Question and a Suggested Approach
Consider that, e.g., a tensile strength test has been performed with, say, three replicate specimens per specimen type on an inhomogeneous or anisotropic material like wood. Why do the strength property determinations typically not consider the number of collected data points? As a simplification, imagine, e.g., that replicate specimen 1 fails at 1.0 % strain with 500 collected data points, replicate 2 at 1.5 % strain with 750 data points and replicate 3 at 2.0 % strain with 1 000 data points. For the sake of argument, let us assume that the replicates with a lower strain are not defective specimens, i.e., they are accounted for in natural variation(s). Would it not make sense to use the ratio of the collected data points per replicate specimen (i.e., the number of data points a given replicate specimen has divided by the total number of data points for all replicates of a given specimen type combined) as a weighing factor to potentially calculate more realistic results? Does this make sense if one were to, e.g., plot an averaged stress-strain curve that considers all replicates by combining them into one plot for a given specimen type?
Questioning of the Weighing
Does this weighing approach introduce bias and a significant error(s) in the results by emphasising the measurements with a higher number of data points? For example, suppose the idea is to average all repeat specimens to describe the mechanical properties of a given specimen type. In that case, the issue is that the number of collected data points can vary significantly. Therefore, the repeat specimen with a higher number of data points is emphasised in the weighted averaged results. Then again, if no weighing is executed, then, e.g., there are 500 more data points between replicates 1 and 3 in the above hypothetical situation, i.e., the averaging is still biased since there is a 500 data point difference in the strain and other load data and, e.g., replicate 3 has some data points that neither of the preceding replicates has. Is the “answer” such that we assume a similar type of behaviour even when the recorded data vary, i.e., the trends of the stress-strain curves should be the same even if the specimens fail at different loads, strains, and times?
Further Questions and Suggestions
If this data point based weighing of the average mechanical properties is by its very nature an incorrect approach, should at least the number of collected data points or time taken in the test per replicate be reported to give a more realistic understanding of the research results? Furthermore, when averaging the results from repeat specimens, the assumption is that the elapsed times in the recorded data match the applied load(s). However, this is never the case with repeat specimens; matching the data meticulously as an exact function of time is tedious and time-consuming. So, instead of just weighing the data, should the data be somehow normalised concerning the elapsed time of the test in question? Consider that the overall strength of a given material might, e.g., have contributions from only one repeat specimen that simply took much longer to fail, as is the case in the above hypothetical example.
Relevant answer
Answer
yes of course, because our target is to produce strong, rigid materials and avoid carbon as much as possible.
  • asked a question related to Normalization
Question
5 answers
I have quantified the intensity of bands obtained in gels in which I have loaded experimental samples with WT enzyme and several mutants. My goal is to compare whether the WT enzyme or the mutants are better based on the intensity obtained for each band (the higher intensity I measure, the more activity the enzyme has).
I have several replicates of the experiments, I have normalized the values with respect to WT in each case, and then I have calculated the mean and standard error values. So I get a value of 1 for the WT enzyme and values ​​below 1 or above 1 for the mutants. Would this be a correct way to express the results?
I wonder if it would be more appropriate to apply log2 to the values ​​obtained, so that values ​​close to 0 will indicate that the mutants are similar to WT, positive values ​​will indicate that it is better, and negative values ​​that it is worse.
ADVANTAGE:
Applying log2 maybe is useful because the same absolute value is obtained when an enzyme is 10 times worse or 10 times better (compared to the case of not applying log2, that the enzyme 10 times worse will have a value of 0.1 and the 10 times better will have a value of 10)
PROBLEM:
There are cases in which the mutants have a value of 0 because no band to quantify is obtained. So I can't apply log2 to 0 and the graph is incomplete for those mutants. And there are also times when the intensity is very low and when applying log2 a very high value comes out that distorts the graph.
ADDITIONAL PROBLEM: HOW TO NORMALIZE WHEN REFERENCE VALUE IS 0
Under certain conditions I obtain values of 0 or very close to 0 for the WT enzyme, which is the one I use as reference to calculate the ratios. What can I do in this case? Would it be correct to do this: I normalize the data with respect to a mutant that has high values, calculate the mean of the ratios, and normalize the means with respect to WT. In the Excel file “Problem” I have put an example of this under the heading “Situation 1”.
I'm not sure if it's correct to work with the data in this way. In the spreadsheet I have written an example called "Situation 2" in which I have normalized the data in both ways (with respect to WT and with respect to mutant 1) and I see that the final results are completely different.
What do you recommend to do to display the data correctly?
Relevant answer
Answer
I would calculate the ratio, as you did, of the mutant to the wild-type. Mutants with less of whatever you are measuring would have values less than 1 (possibly 0), and mutants with more would have values greater than 1.
If the numbers vary over a wide range, then you can plot them on a logarithmic scale to spread them out so all the data points can be seen. I would use a log10 scale, personally, but I don't see anything wrong with using a log2 scale. Since you can't show zero on a log scale, you could omit that data point and simply state in the figure legend or text that no point is shown for that mutant because its value is zero.
  • asked a question related to Normalization
Question
4 answers
I'm doing a study based on compare two orbital sensors data, and on the study i'm basing on there is this normalization formula for the rasters: ((Bi<= 0) * 0) + ((Bi >= 10000) *1) + ((Bi >= 0) & (Bi < 10000)) * Float((Bi)/10000), Where "Bi" means "band". Is there someone who understad e could explain this formula? Thanks very much.
Relevant answer
Answer
It just makes values less than 0 to 0, and greater than 10000 to 1, while anything in between will be from 0 to 1. as the value is divided by 10000.
The way of writing formula is similar to conditional but without using if statement.
For example: (x>=0)*0+(x<0)*1 will evaluate to 0 for positive number and 1 for negatives. the expression (x>0) will either be 1 or 0, based on the value of x.
  • asked a question related to Normalization
Question
3 answers
I'm planning to implement controlled cooling method for forged parts instead of normalizing process. I need to know how to design metallic box and implementation method.
Kindly share your valuable feedbacks.
Regards,
Vignesh
Relevant answer
Answer
The exposure time in vermiculite depends on many factors: the material, the dimensions of the parts, their temperature, the number of parts in one box, etc. Usually the exposure is several hours.
  • asked a question related to Normalization
Question
7 answers
My question concerns the problem of calculating odds ration in logistic regression analysis when the input variables are from different scales (i.e.: 0.01-0.1, 0-1, 0-1000). Although the coefficients of the logistic regression looks fine, the odds ratio values are, in some cases, enormous (see example below).
In the example there were no outlier values in each input variables.
What is general rule, should we normalize all input variables before analysis to obtain reliable OR values?
Sincerely
Mateusz Soliński
Relevant answer
Answer
You need to interpret OR using Exponential of estimates.
  • asked a question related to Normalization
Question
5 answers
I am working in a project where it's necessary to do a LED board to replace UVA 340/UVA 351 fluorescent tubes (represented in the graphic by the blue line).
I have to use LEDs with wavelengths from 280 to 420 nm and achieve a maximum point of 1 W/m² at 340 nm.
How can I normalize and distribute my LEDs in order to achieve the goal?
Relevant answer
Answer
Raquel Andrade, there are two possible solutions of your task. Simplest is to put your LEDs close to focal point of a lens, which image them onto your plane of illumination. More homogeneous distribution can be obtained with two lens system: first lens image of LEDs on the the plane of the second lens, which image surface of this lens into plane of illumination. Due to UV light your lenses have to be silica made.
Good luck!
  • asked a question related to Normalization
Question
13 answers
EDIT: Please see below for the edited version of this question first (02.04.22)
Hi,
I am searching for a reliable normalization method. I have two chip-seq datas to be compared with t-test but the rpkm values are biased. So I need to fix this before the t-test. For instance, when a value is high, it doesn't mean it is high in reality. There can be another factor to see this value is high. In reality, I should see a value closer to mean. Likewise, if a value is low and the factor is strong, we can say that's the reason why we see the low value. We should have seen value much closer to the mean. In brief, what I want is to eliminate the effect of this factor.
In line with this purpose, I have another data showing how strong this factor is for each value in the chip-seq datas (with again RPKM values). Should I simply divide my rpkm values by the corresponding RPKM to get unbiased data? Or is it better to divide rpkm values by the ratio of RPKM/ Mean(RPKMs) ?
Do you have any other suggestions? How should I eliminate the factor?
Relevant answer
Answer
Actually, the log transformation in the figure I attached was done according to the formula: log((#1+1)/(#2+1)). Just later, I thought that I added "1" to my values to be able to carry out log transformation (not to eliminate zero values). So I considered that maybe, it would be more correct to add "1" to adjusted values just before the transformation.
Thanks again :) Jochen Wilhelm
  • asked a question related to Normalization
Question
3 answers
Dear collegues,
I try to a neural network.I normalized data with the minimum and maximum:
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
maxmindf <- as.data.frame(lapply(mydata, normalize))
and the results:
results <- data.frame(actual = testset$prov, prediction = nn.results$net.result).
So I can see the actual and predicted values only in normalized form.
Could you tell me please,how do I scale the real and predicted data back into the "unscaled" range?
P.s. minvec <- sapply(mydata,min)maxvec <- sapply(mydata,max)
> denormalize <- function(x,minval,maxval) {
+ x*(maxval-minval) + minval
doesn't work correct in my case.
Thanks a lot for your answers
Relevant answer
Answer
It actually works (but you have to consider rounding errors):
normalize <- function(x, min, max) (x-min)/(max-min)
denormalize <- function(x, min, max) x*(max-min)+min
x <- rnorm(1000)
r <- range(x)
nx <- normalize(x, r[1], r[2])
dnx <- denormalize(nx, r[1], r[2])
all(x == dnx)
# FALSE ---> rounding errors
all(abs(x - dnx) < 1E-8)
# TRUE ---> identical up to tiny rounding errors
  • asked a question related to Normalization
Question
8 answers
I have raw data for global/untargeted mass spectrometry metabolomic data. I have processed that data and now have with me the peak intensities of all the m/z values. I had also spiked the samples with an internal standard. Can anyone tell me how can I normalize my data using the internal standard?
Relevant answer
Answer
شكرا جزيلا
  • asked a question related to Normalization
Question
2 answers
Hi, I have some DTI data and I want to use ANTs to make a b0 group wise template to normalize my data to the IIT_mean_b0 image. I want to use buildtemplateparallel.sh and SyN algorithm ,but I'm the amateur one. How am I supposed to run this and what are the inputs?
Relevant answer
Answer
Joan Jiménez-Balado Thank you Joan
  • asked a question related to Normalization
Question
4 answers
Apologies if this has been answered, I was not able to find anything similar to my question.
In 3 separate days, I have analysed cells from 10 WT and 10 KO animals by flow cytometry. Below I show some made up MFI values of a fluorophore to make my point. You can see that between each experiment the readings are shifted but in all 3 experiments, there is an approx 40% increase in the MFI values of the KO compared to the WTs. For this experiment, a 40% increase is biologically relevant but without any normalization, a T-test does not show a statistical significance.
If I was comparing two different treatments made on the same cells, I would run a paired T-test in the unnormalized data, but since here my KO and WT cells are not coming from the same animal, can I also do that?
Since I cannot repeat the experiments, my way of going around this is by dividing all values from each experiment by the average of the controls of that experiment (last 3 columns). The average of the control group will always be one but there is SD as well. The T-test ran with these normalized values now shows statistical significance, but is this correct?
Relevant answer
Answer
Andrey, thanks for your detailed reply.
As I tried to explain in my question, WT and KO are not paired groups as these samples are coming from different animals.
As a result, in the calculations to make the distribution of differences more symmetric, I don't know which log(MFIs) to subtract. Would it make sense to average the MFI WT and KO used in each separate day and run a paired T-test? If there is evidence the groups are different, can I then do the analysis I propose in the last paragraph of my question?
  • asked a question related to Normalization
Question
2 answers
I'm trying to follow through using the hyperbolic tangent for score normalization as here:
It states there that the final values should be between 0 and 1, however my final output is in the range between 0.47 and 0.51 for a number of sets of scores.
Most of these sets are already between the range [0,1] - although some have quite a different range of separability between genuine and mismatch scores.
The process I am performing is to calculate the mean of all genuine match scores, and the standard deviation of the entire set (as described in the paper) - and then I parse it into a tanh normalization function. I notice some other papers use a different set of means/standard deviations, but all combinations I try end up with similar results.
Here is my normalization code (written in Rust). Constants is just a struct containing all stats for genuine/mismatch/all.
pub fn tanh_normalization(score: f32, constants: &ScoreConstants) -> f32 {
let internal = 0.01 * (score - constants.genuine.mean) / constants.all.standard_deviation;
return 0.5 * (internal.tanh() + 1.);
}
Does anyone have any ideas that could help me? Or any other papers related to this that might help?
Thanks in advance.
Relevant answer
Answer
  • asked a question related to Normalization
Question
1 answer
Hey,
I recently have a confusion about single cell ATAC-seq integration analysis between samples. I have read many discussions about that issue. So, I summarized them into two solutions as follows:
SOLUTION 1. (data QC ignored here) find the union feature set from different samples -> generate count matrix for each sample -> merge them into one large count matrix -> normalization/Scaling/cell clustering/ cluster annotations……
SOLUTION 2. generate the count matrix for each sample -> normalization/Scaling/cell clustering/ cluster annotations for each sample -> find common features among all samples -> generate count matrix against the selected common features for each sample -> merging data using pipelines, e.g. Signac/Harmony, to perform cell clustering, cluster annotation and other following analysis (which usually with give a new assay for common features).
My questions:
Either one selected, I will have cell clusters now. So the next plan for me is retrieving differential features for each cell type/cluster, which will be the key to the further investigation of biological functions.
Q1. I know that batch effect indeed exists between samples, but for SOLUTION 1, will normalization and scaling for a single large count matrix work for differential enrichment analysis between samples?
Q2. If SOLUTION 1 is not reasonable, SOLUTION 2 will give rise to a new assay only contain the selected common features, based on which the batch effect should be well corrected and the cell might be better clustered. However, how to perform the differential analysis for non-common features in each clusters? (That's to say, will the batch effect correction in the newly integrated assay by SOLUTION 2 will work for total differential feature detection in raw assays at the sample level?)
Thanks and best regards!
Relevant answer
Answer
  • asked a question related to Normalization
Question
3 answers
The interactive wavelet plot that was once available on the webpage of colorado (C. Torrence and G. P. Compo, 1998) does not exist anymore. Are there any other trusted sites to compare our plot? And, in what cases we normalize our data by the standard deviation to perform continuous wavelet transform (Morlet)? I have seen that it is not necessary all the time. Few researchers also transform the time series into a series of percentiles believing that the transformed series reacts 'more linearly' to the original signal. So, what actually should we do? I expect an explanation by mainly focusing on data-processing techniques (standardization or normalization or leaving as it is).
Relevant answer
Answer
Thank you Abbas Thajeel Rhaif Alsahlanee and Aparna Sathya Murthy for addressing the question. It was of great help to me. I figured it out through the documentation of statistical methods in python.
  • asked a question related to Normalization
Question
6 answers
If in a multivariate model we have several continuous variables and some categorical ones, we have to change the categoricals to dummy variables containing either 0 or 1.
Now to put all the variables together to calibrate a regression or classification model, we need to scale the variables.
Scaling a continuous variable is a meaningful process. But doing the same with columns containing 0 or 1 does not seem to be ideal. The dummies will not have their "fair share" of influencing the calibrated model.
Is there a solution to this?
Relevant answer
Answer
Monika Mrozek I think that based on Johannes Elfner shared, it makes sense NOT to scale the discrete variables.
  • asked a question related to Normalization
Question
2 answers
I am running an RNA-sequencing experiment where I am analyzing the differential gene expression of oysters collected from different locations. I plan on using the Cyverse DNA Subway Greenline platform which utilizes Kallisto and Sleuth. Since I will be conducting multiple comparisons (ie oysters from Site 1 vs oysters from Site 2 vs oysters Site 3 etc.), I understand that this could run into significant statistical issues involving inferential and individual variation of each sample. Will the Kallisto and Sleuth algorithms correct for this? I imagine I will need to run all of my samples simultaneously through Kallisto so that normalization is done across all samples. Will this be sufficient to mitigate the noise from individual sample variation and make biological variation more significant? Or would I need to employ normalization methods such as TMM via edgeR? I am pretty new to this and learning along the way so any feedback is much appreciated!
Thanks in advance.
Relevant answer
Answer
Hi there Mark, thanks for the reply! That's good to know, I'll go ahead and run the data through normalization.
  • asked a question related to Normalization
Question
5 answers
Hello every one
Based on my studies, I have found that it is not possible for all researchers to make the same conditions for producing samples. Therefore, each researcher by reporting working conditions and effective parameters, trying to produce a sample and perform a series of experiments to extract a series of response data based on their experiment design. The issue in such reports is when one intends to study and then compare the results. Due to differences in parameters, comparisons between studies are not possible. The reason for this is the difference of several parameters during the comparison. My question is, is there a general way to normalize response data based on multi-independent parameters?
Relevant answer
Answer
Hello Milad,
If I understand your query, you're asking whether some adjustment(s) could be made to results such that comparisons of results of one study with those of another could be "apples to apples" (e.g., as if equivalent). The most optimistic answer is a qualified "maybe."
If a large corpus of such studies existed, and you could evaluate the singular and combinatorial effects of each and every relevant nuisance variable with respect to observed outcomes (as sometimes people attempt in a crude way when they regress meta-analysis results on study characteristics), then you might be able to impose statistical adjustments. Otherwise, you'd only be able to impose partial "control" for study differences, and possibly these adjustments would over- or under-correct. This is definitely a case in which "one size will not fit all (cases)."
Good luck with your work.
  • asked a question related to Normalization
Question
4 answers
I am a layman in metabolomics or LC-MS. And I am confused that which is the suitable internal standards for metabolomics study, like in urine、fecies or plasma samples, if I don't use isotape label metabolites as internal standards. Or, how to do normalization of peak intensities without any standards. Sorry for the stupid questions. Thank you all!
Relevant answer
Answer
1) Internal standards (IS) are used to normalize the data for structurlly similar compounds in target or semi-targeted analysis... it is not a good idea to use them to normalize your data in metabolomics, as the structural variability is an important factor influecing the detection.
2) in metabolomics, normalization is performed on QC samples: every 5 samples you should inject a pooled sample (quality control sample = QC) and then you use the data from those injections to normalize your data, using normalization softwares (I suggest NOREVA --> http://idrblab.cn/noreva/)
3) by the way, I suggest to use also IS to check the quality of injection of each sample. If the IS signal is missing or is too intense from any sample, you should remove such sample before proceeding to normalization.
  • asked a question related to Normalization
Question
3 answers
I have used geolocation grid data and interpolated the values between to get the incidence angles, but I am not able to assign the incidence angles to specific pixels on the image. Is there any way to get a 2d matrix of the incidence angles or any other way that may help me get to this?
Relevant answer
Answer
x , y coordinateلازم تكون النقاط الها احداثيات
  • asked a question related to Normalization
Question
4 answers
Hi everyone,
I would really appreciate it if someone can tell me when and why a raw fluorescence starts below zero. In the picture attached below the blue colour curve starts from a negative value. My delta Rn values are quite low and I belive the negative start of the blue curve (raw data) is effecting the normalisation.
Can anyone give some insight?
Relevant answer
Answer
The only way I can think of to get a negative fluorescence reading is if the intensity of a reference is being subtracted, and the intensity of the reference is higher than the intensity of the sample.
  • asked a question related to Normalization
Question
2 answers
I am trying to analyze Alzheimer and healthy (in case of Alzheimer) human brain slices. After loading the data and normalizing the data through default options of SCTransform command, when I try to plot the the expression levels of some genes through vln command I see that the expression values are changed to integers or they are categorized in some defined levels and not a continues period of numbers.
I have rechecked this issue with the default mouse brain tutorial and I have got the same problem. Here is vln plot from the mouse brain dataset which is provided by Seurat as a tutorial. So it could be reproduceable for you as well.
> brain<-Load10X_Spatial(Directory,filename = "filtered_feature_bc_matrix.h5", assay = "Spatial", slice = "slice1", filter.matrix = TRUE, to.upper = FALSE)
>brain <- SCTransform(brain, assay = "Spatial", verbose = FALSE, do.scale=TRUE)
>VlnPlot(brain, features = 'APP')
here is the result of my sessioninfo():
R version 4.0.2 (2020-06-22) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)
sctransform_0.3.2 dplyr_1.0.2 patchwork_1.1.1 ggplot2_3.3.2 stxBrain.SeuratData_0.1.1 panc8.SeuratData_3.0.2 SeuratData_0.2.1 Seurat_3.2.3
Relevant answer
Answer
Hi Elyas,
I can't really answer you why this classification is done, however this transformation to categorized data is simply integrated in the function of the package. - so its not a problem - it is part of the analysis itself (using anchor weighting and PCA)
You can find more information about how this works here:
(page 191 and the following - transferData function; and page 78 - find transfer anchor function)
and here:
(page 19 and the following)
hope that helps :)
  • asked a question related to Normalization
Question
4 answers
I have metabolite concentrations from mammalian cells, and also total protein concentration in each replicate. I do not have an internal standard.
Please could someone describe or provide a literature reference/software method etc. of the best statistical method to normalize metabolite concentrations to total protein?
Many thanks
Relevant answer
Answer
hello,Bibek。 have you solved this problem? now I have IS and total protein. But I still confused how to normalize thie metabolomics data. should I use IS and protein both or other one of them, could you plesae help me? Thank you.
  • asked a question related to Normalization
Question
15 answers
Dear to whom it may concern,
I would like to ask you about the normalization methods used to remove non-biological variations from the metabolomics data.
Because there have been many normalization methods reported until now, I am so confused about based on what criteria to select the best normalization method for a particular metabolomics data? Also, what is the meaning of each normalization method?
I hope that you may spend your little time clearing my questions and if convenient for you, may you show me the documents or tips in this case, please?
Thank you so much,
Pham Quynh Khoa.
Relevant answer
Hi Pham Quynh Khoa,
I found the following article very useful.
Misra BB. Data normalization strategies in metabolomics: Current challenges, approaches, and tools. European Journal of Mass Spectrometry. 2020;26(3):165-174. doi:10.1177/1469066720918446
Do have a read.
Hope this helps.
jaime
  • asked a question related to Normalization
Question
3 answers
Hello,
This questions pertains to comparing peaks from spectra of 2 or more different samples in a fair way through normalizing
I know Origin normalizes based on the highest peak however if I have PL or EL spectra that has two peaks for many samples, what is the best way to compare?
One method I observed was dividing the spectral intensity across the wavelengths by area (after integration) of the peak. Would this be the best way?
Relevant answer
Answer
Dear Noor,
the difference of PL and EL consists in the occupation of the excited energy level (here band of conductivity).This is important if you investigate an indirect semiconductor. By PL only a small momentum is transmitted by photons. Therefore, the momentum for occupation of the band of conductivity is generated by interaction with phonons. This process strongly depends on temperature. In contrast, the interaction with electrons allows immediately a high momentum transfer. I know this problem from the investigation of GaP.
Of course it depend on the question you want to discuss. I would select the intensity of the first phonon satellite hv = Eg - hw (hv emission energy, Eg gap energy and hw phonon energy). This emission line largely depends on the sample properties.
With Regard
R. Mitdank
  • asked a question related to Normalization
Question
2 answers
In our hands Drosophila Spike-in often fails to generate enough reads for a statistically relevant normalization and it is pretty expensive. This computational spike-in free method has recently been published and seems to produce similar results to the drosophila spike-in and ChIP-Rx data they benchmarked it against. Does anyone else have an opinion of this method? Any thoughts, concerns? Is anyone willing to try it against their current ChIP-seq normalization methods?
Relevant answer
Answer
I agree with your point that it is difficult to rationalize the application of one drosophila read-derived correction factor genome wide. Our biggest problem is that Active motif suggests 50ng of spike in chromatin is sufficient to achieve ~2-5% drosophila reads, and we rarely do better than 0.05% drosophila reads for most ChIPs. It seems that I would need to make my own drosophila chromatin, because I cannot afford the amount of spike-in chromatin that I think I would need to get to 2%.
I also agree with your point that for a case like H3K27me3 chip in the presence of EZH2i, there should be very little H3K27me3 available to IP anyways. A case such as this is really the only time we see the drosophila spike in working. This is largely due to how we make the libraries. We balance the ng of DNA going in, even though we know there should be less from the EZH2i sample. Perhaps it is valid to just library prep the same volume of DMSO vs EZH2i rather than the same mass? Then I doubt we would need downstream normalization.
Our real trouble is this : We have a situation where in condition A we see many superenhancer-like sites of H3K27Ac that are lost in condition B. In condition B we also see a massive gain of smaller enhancer and promoter-like sites of H3K27Ac that are not present in A. We know by western that total H3K27Ac is higher in B than in A. When we sequence these ChIPs we have trouble calling the small gains in H3K27Ac in B, and we have trouble calling decreased H3K27Ac at superenhancers in condition A. Drosophila spike in fails miserably here because the H3K27Ac is abundant in both samples, and the antibody is great. The H3K27Ac is locally redistributed and globally increased.
So out of 100M reads, we only get like 200k drosophila reads, which is not statistically reliable. So we try this computational method for normalization and we see exactly what we would expect and what all our other data (ChIP-qPCR, western, etc) suggest we should see.
I have spent quite some time trying to understand this ChIPseqSpikeInFree method. It is true that it falls back on your point that the assumptions you make about the data must be orthogonally validated by other experiments. I think that you must have a hypothesis driven reason to use this computational method or the drosophila spike-in method. For the EZH2i H3K27me3 ChIP, I suppose a western blot is really sufficient to say that H3K@7me3 is depleted, but for H3K27Ac, we need to know where it is depleted and enriched, and we need to account for it being globally increased in abundance.
Our worry is that although this computational normalization idea has been published by bioinformaticists, it has not been vetted by biologists for real world application. I am just worried that even though this seems to work for my experiment, I will get killed by reviewers whether this method is valid or not, because drosophila spike-in is a lot easier to understand and explain.
  • asked a question related to Normalization
Question
3 answers
Hi all,
I have done qPCR with house keeping gene and GOI.
My goal here is to check the expression of a particular gene and not to compare them with treated or untreated control.
Is there a way other than absolute quantification, to calculate the CT value with house keeping and GOI data?
Thanks in advance.
Relevant answer
Answer
Hello
Andrei S. Babenka is right and I dont think there is another option!
  • asked a question related to Normalization
Question
5 answers
Hi.
How should I normalize noisy spectra? Normalization by 0 to 1 will change the spectral shape while normalization by dividing by maximum will not change the spectral shape, however, the spectra will remain noisy and difficult to combine with other spectra (e.g., spectra for NIR and VIS regions). Thank you.
Relevant answer
Answer
Spectra normalization is commonly used for reducing the variance of devices (such as laser fluctuation) and environmental factors. It is not designed for reducing noises. You need to perform noise reduction prior to that. You can try smoothing by using digital filters.
  • asked a question related to Normalization
Question
1 answer
I am planing to apply EMS on upper extremity muscles and observe muscle using sEMG and MMG. I am wondering how to do the normalization in case of EMS? is it same as in the case of normal sEMG and MMG i.e. dividing the sub-maximal signals with the signals obtained during MVC?
Relevant answer
Answer
Hi Jawad, I've always felt it was ideal to at least stick with the same analysis method for both the submaximal and maximal contractions. So if the submaximal stimulation is a series of spikes at a certain frequency, and peak-to-peak (P-P) is the most appropriate method to get the amplitude, then you'd want the same for the maximal signal. However, P-P does not work well with MVCs, so a maximal M-wave from electrical nerve stimulation would be optimal. If that is not an option, and normalization can only occur with an MVC, then I'd use RMS as your analysis method for both the muscle stimulation and MVC (and not P-P).
  • asked a question related to Normalization
Question
5 answers
I build collaborative filtering recommender system using surprise library in python. My dataset contains of three columns ( 'ReviewerID', 'ProductID', 'Rating') , the rating scale [-30,40] , I calculate the RMSE and it equals to 0.9 , Then I make a normalization process for the rating to change the scale to [-0.4, 0.4] when I calculate the RMSE it equals to 0.003 .. The difference in the RMSE is big and not reasonable, is it wrong to normalize the rating scale in CF?
Relevant answer
Answer
Dear Sumaia Mohammed Al-Ghuribi , in addition to the drawn-above answers, I would say there has never been any mistake with the normalization as we already have some datasets with normalized ratings in CF. Just compare all evaluation criterion on the same ratings. BTW, the normalization have a great (either positive or negative) impact on the work of some measures like Jaccard which is effective with numeric ratings [1-5] and non-effective with the normalized values chiefly in high-sparse datasets. Finally, you can just google the normalization topic on ratings in CF and you will surely find a great number of resources on this matter.
Good luck with your work.
  • asked a question related to Normalization
Question
2 answers
Hi,
In LogNormalize function feature counts for each cell are divided by the total counts for that cell and multiplied by the scale.factor prior to log transformation. My question is what does exactly multiplying by the scale factor do and why it is by default 10000?
Thank you
Relevant answer
Answer
Hi Elyas,
To me, you get an ideal difference between your counts (especially low counts). If a gene is not expressed in a sample and lowly expressed in another, log2(counts) would give much higher log-ratios than log2(counts/10000), which would give many more significant genes among these low expressed.
Like RPKM in bulk RNA-seq data that you divide the count per million ( because you usually get millions of counts for bulk RNA). While with UMI single-cell data you get 1000s to 10,000s of counts (as LogNormalize usually uses for single-cell UMI data).
For instance, if you consider a given gene across 4 cells and each with 10,000 total counts, with 0, 1, 2, and 3 counts respectively. The below you can see the difference between data based on 1M and 10K:
Using log(1+counts/million) gives:
[0, 1, 2, 3] -> [0, 4.6, 5.3, 5.7]
Using log(1+counts/10k) gives:
[0, 1, 2, 3] -> [0, .7, 1.1, 1.4]
  • asked a question related to Normalization
Question
3 answers
I have a keystroke model which is one of the modes in my multimodal biometric system. The keystroke model gives me an EER of 0.09 using Manhattan Scaled Distance. But then I am normalizing this distance to fit in the range of [0, 1] using tanh normalization. And when I run a check on the normalized scores I am getting an EER of 0.997. Is there something I am doing wrong? The tanh normalization I am calculating based on the mean and std dev of matching scores for genuine users.
  • asked a question related to Normalization
Question
3 answers
I would like to use ordinary one-way ANOVA. I have 4 groups with 4 samples and more than 10 variables. Most variables show nonskewed distribution based on the QQ plot but some do not fit well.
I get the expected results if I logarithmize only the skewed data. Is it right to transform only those variables - of course in each group - that are skewed by the QQ plot?
In this case, I will have log-transformed variables (where necessary) and variables in their original form. The samples and variables are independent. Is this statistically acceptable?
Can anyone show me a similar situation in the published literature?
(MANOVA and Kruskal-Wallis are excluded.)
Thanks for the help!
Relevant answer
Answer
1)
Neither data no rvariables are "parametric" or "non-parametric". A distribution model can be parametric, what means that the density, mass or distribution function has parameters that define the shape (or the moments).
There are many skewed distribution families that can be expressed in a parametric form (see e.g. the Handbook of statistical distributions: shorturl.at/cAR47). The log-normal is also a parametric distribution.
2)
The distribution of the variables is not (directly) relevant for the validity of the ANOVA. Concretely, the residuals of the entire model should be approximately normal distributed to get a valid interpretation (that's the conditional distribution). For an ANOVA model this automatically means that the distribution within each group should be approximately normal. But it is usually better to have a look at residual diagnostics plots using the data from all groups together. This generalized to models that include metric predictors (regressions, ANCOVA models, multilevel models).
3)
If using logarithms is meaningful depends on the kind of variables you have and how you want to model effects. If effects are proportional to the expectation (mean), using logarithms is actually recommended. An alternative is a gamma-model with log-link (shorturl.at/bily4, https://stats.stackexchange.com/questions/67547/when-to-use-gamma-glms).
  • asked a question related to Normalization
Question
3 answers
Is there any standard procedure/sequence of tools to process the hyperspectral tabular data before PLSR regression modeling.
example of tools are 1) De-resolve 2) second derivative 3)normalize 4) de-trending 5)baseline etc.
Application is for field spectroradiometer data of soil and crop.
Or the sequence of tools differ for different datasets ?
Relevant answer
Answer
The pre-treatments you use will depend on the sample set: noise, level(s) of analyte, number of samples, ranges, etc. In many cases, it is a case of start small (Absorbance data) and add treatments, as needed. Just remember, every treatment can hide/generate peaks and will have an effect on sensitivity.
  • asked a question related to Normalization
Question
5 answers
Hi all,
I have gone through previous threads and numerous publications trying to find the best gene to use for qPCR with LPS activated macrophages (BMDMs) and microglia, and I understand everyone has their own preferences.
Between HMBS, ActB, HPRT, and 18S rRNA, which one is the most popular and dependable among macrophage and microglia researchers? I work primarily with primary mouse cells and sometimes ips-derived microglia.
I have even noticed some recent publications still using GAPDH..
Any advice will be appreciated!
Tanya
Relevant answer
Answer
I agrree to the aforementioned comments. However, it strongly depends on your treatment which gene is not affected under specific conditions and may be - therefore - used as "housekeeper/control". This can be tested by try and error methods.
  • asked a question related to Normalization
Question
4 answers
I am using the Lagrange Multiplier for maximizing the Normal/Gaussian distribution of objective functions under some constraints. So, I am searching for better techniques other than lagrange multiplier.
Relevant answer
Answer
See ICA papers... FastICA, infomax,...
  • asked a question related to Normalization
Question
5 answers
Hello Research Community,
Thank you very much for your help.
Relevant answer
Answer
Peter G Hains This is what I am thinking as well. The next samples I am sending for sequencing will have the same amount of protein so this won't happen. Indeed I have to work with what we have, and a qualitative analysis will be the best. There is no accurate representation in those 7 proteins, perhaps a representation of the most abundant, it is something to discuss.
I will read the manual thoroughly.
Thank you Peter, you have been a true help.
  • asked a question related to Normalization
Question
3 answers
I have a task that delivers several measures of metacognition. I want to check that the task aligns to my model of metacognition using CFA. However, the results are on very different scales - 0-1; 0-10,000; 0-10; etc. SEM will not work with such widspread scales. What is the best way to prepare the data for CFA? Thank you for your help.
Relevant answer
Answer
Hello Kym,
Classical factoring algorithms start with a correlation matrix; Pearson correlations standardize all variables as if you had transformed them to z-scores at the outset (indeed, that is the definitional version of the formula for the Pearson product-moment correlation--the product of the first moments about the mean).
The only technical concerns with variables having wildly different scales are: (a) possible loss of computational accuracy due to needing lots of digits (most software packages are likely to use 15.5 digits of accuracy in the internal representation of real number values); and (b) as Holger Steinmetz implies, the interpretation of unstandardized coefficients if you rescale/transform scores.
Starting with LISREL, the trend for SEM programs was to use covariance (not correlation) matrices as the scale information is considered important (e.g., for unstandardized coefficients), as is location (so, means are important as well). However, if your interpretations are based on standardized coefficients (in my experience, most published studies in my areas are), then correlations are fine. Hence, rescaled values are fine.
In general, if a specific variable manifested a spread in my data set of many orders of magnitude, I'd consider rescaling/transforming almost automatically.
Good luck with your work!
  • asked a question related to Normalization
Question
4 answers
Hi, I'm running an antibody-dependent phagocytoisis assay, where we add serum to fluorescent antigen-coated beads, then donor neutrophils, and subsequently measure the amount of phagocytoisis as MFI (fluorescence) within our neutrophil population to give us a phagocytosis score.
As we run the assay on different days with different neutrophil donors we are also running a titration of a serum standard. I was wondering how best to utilise this standard curve?
One method we're using is similar to what you would do with an ELISA, whereby you interpolate the values of our diluted sample from the standard curve. But this is only giving me a somewhat arbitrary value of the dilution required (which I can convert to phagocytic units, but again this is just done arbitrarily). The other method I've tried is min max normalisation using the top and bottom of the standard curve values.
What I would ideally like is a method of normalisation that gives an output as phagocytic score, as this seems biologically most relevant. Any input is appreciated.
Cheers,
Mari
Relevant answer
Answer
I expecte may be interested in z score that is called standard value.
Z=(observation - avarege) /standard deviation.
Standard scores
The raw scores (data) obtained as a result of tests and measurements do not help to infer the levels of individuals tested compared to the mean or comparison between the people themselves, if we know that individuals may be subjected to a set of tests (test battery) and each of these tests enjoy The unit of measurement is different from the other, some of which are measured by the minute and its parts as an expression of the unit of time, including what is measured in meters and its parts as an expression of the unit of distance, including what is measured in degrees or the number of times, and for the purpose of eliminating the units of measurement so that we can collect laboratory degrees and give a uniform degree In the tests we use the standard score.
For the purpose of extracting the standard score it is necessary to extract the mean and standard deviation for each test, through which the standard score can be extracted.
  • asked a question related to Normalization
Question
4 answers
Does normalizing observed variables (i.e. bringing them to zero mean and unit variance) influence the fit statistics in SEM? I am running maximum likelihood confirmatory factor analysis with Satorra_Bentler Correction for the measurement model. SB RMSEA is 0.042 (non corrected RMSEA is 0.055) and SRMR is 0.056 which with a sample size of 267 signal good model fit (Hu & Bentler, 1998, 1999) but my SB CFI is 0.92 (non corrected CFI is 0.896) and SB TLI is 0.906 (uncorrected value is 0.877). The latter two values remain under the 0.95 treshold. I'm looking for ways to improve model fit. Already looked at MIs and nothing can be changed. Any suggestions? Can normalization of the observed variables help? Can you direct me to any readings relevant to this question? Thank you for your answers.
Relevant answer
Answer
Hello Andac,
The issue with using standardised/normalised variables is that standardised and unstandardised coefficients have different sampling distributions. Stated another way, if the observed variables do not all have the same (similar) variances, then there can be differences in the modeled coefficients (and fit statistic) using observed versus standardised data.
Some of the following may be informative:
Baguley, T. (2009). Standardized or simple effect size: What should be reported? British Journal of Psychology, 100(3), 603-617.
Baguley, T. (2010). When correlations go bad. The Psychologist, 23(2), 122-123.
Greenland, S., Schlessman, J.J., & Criqui, M.H. (1986). The fallacy of employing standardized regression coefficients and correlations as measures of effect. American Journal of Epidemiology, 123, 203–208.
Greenland, S., Maclure, M., Schlessman, J.J., Poole, C., & Morgenstern, H. (1991). Standardized Regression Coefficients: A Further Critique and Review of Some Alternatives. Epidemiology, 2(5). 387-392.
Criqui, M.H. (1991). On the Use of Standardized Regression Coefficients. Epidemiology, 2(5), 393.
Happy researching,
Tim
  • asked a question related to Normalization
Question
5 answers
I'm currently preparing for some in situ hybridisation, which requires Maleic Acid Buffer with Tween (MABT) for some of the wash steps. The recipe i have been given is for 1L of 5x concentrated MAB, but it seems to be taking ages to dissolve. I have to have the solution prepared for autoclaving by 10am as my lab can only autoclave once per day at the moment. Is it normal for MAB to dissolve slowly? Does anyone have any recommendations? 
Relevant answer
Answer
Any other buffer system which replace the maliec acid buffer ?
  • asked a question related to Normalization
Question
5 answers
I am working on some code where I am using intervals and I have to pass these intervals in softmax function e^x/sum(e^x). So if pass intervals as they are I got infinite primals and infinte partials. So I want to know a way how I can normalize them.
Relevant answer
Answer
To your point 3: this was my understanding of your somewhat fuzzy description.
To your point 5: this is the result of function's non-linearity, nothing else.
I am neither "C" nor "C++" programmer, thus I am unable to follow the calculations shown in your illustration; especially when non-standard data types are in use. Sorry.
Anyway, maybe another hint will be useful. As you probably know, the range obtained during interval calculations certainly contains all possible results (is guaranteed) - but may be oversized. To minimize the chances of such an outcome one should use SUE (Single Usage Expressions). In few words: try to rearrange your expression in such a way that no variable is repeated in its new form. For example, the equivalent resistivity of two resistors connected in parallel is usually written as R = R1R2 / (R1 + R2), while for interval approach it is much better to use equivalent form of this formula, namely: R = 1 / (1/R1 + 1/R2). Or even simpler: "x * x" will usually produce wider interval than "x2". The appropriate transformation of your formulas may be tricky, hard to guess, or even impossible, but try to do your best and eliminate repeating variables as much as you can.
  • asked a question related to Normalization
Question
3 answers
I want to measure BCA Protein for normalization purposes.
Let's say I have two flask. With cells irradiatied with a big dose (>10 Gy) and non-irradiated cells.
Number of cells in non-irradiated flask will increase
And a lot of cells in irriadiated flask will certanly be dead
So if I'm gonna measure BCA for example after 3 days of incubation will I also measure protein in cells who already dead?
Relevant answer
Answer
In my opinion, yes. I do not think you can distinguish between live and dead cells when measuring total protein. When you lyse your samples for BCA, all cells would lyse regardless of whether they are alive or dead. It may not matter in the grand scheme of things, as irradiated cells would not proliferate/some might be dead (and disappear from culture by apoptosis) and so your total protein levels should still be lower than non-irradiated cells that continue proliferating. The only way you can distinguish live/dead cells at the protein level is if you are measuring a protein that is expressed exclusively in the live/dead cells. If you want to eliminate dead cells as a variable, use cell viability for normalization purposes instead.
  • asked a question related to Normalization
Question
3 answers
Dear experts,
I am dealing with the synthesis and characterization of PIR foams. In particular, I am monitoring the kinetics of these foams by FTIR (please find the spectra attached below). As well-known from the literature, the asymmetric CH stretching band at 2972 cm-1 (which remains constant during the reaction) is typically used as internal reference band to correct the density changes during the foaming process. In the same way, my question is if you know from the literature some reference band that may be used for PIR to the same purpose.
Please note that for PU a polyether polyol is used, while for PIR is used a polyester polyol.
Thanks in advance.
Relevant answer
Answer
You can use any band based on the chemical groups that do not show changes during any chemical reactions or physical reactions. However, majority of the IR bands are the contributions from multiple groups and, thus, tend to change the intensity by either chemical reactions or physical structural changes. Aliphatic CH stretching bands, in the region of 3000-2800 cm-1, are usually very stable as the potential energy distribution of these bands are nearly 99% pure. In the case of PU, you have relatively well isolated 2972 cm-1 band. However, PIR lacks this band. Yet, there are multiple bands between 2950 and 2850 cm-1 that appear to be relatively stable. Ideally, you should curve resolve these heavily overlapped bands (there are at least three bands, maybe 4) to be more accurate, but by measuring the height (absorbance), you can obtain quasi-quantitative comparison. Use the intensity of one of the band (the one near 2850 cm-1 might be good) and divide the absorbance of the band you want to compare to normalize the intensity then compare the normalized intensity of two different conditions, such as different reaction times.
  • asked a question related to Normalization
Question
4 answers
Hi,
I have a question on qRT-PCR data analysis. The gene panel has more than 200 genes along with the multiple reference/housekeeping genes. I have 50 paired samples (i.e Before vs After Vitamin supplementation) and 6 pooled controls (3 Pooled controls in one plate and 3 in another plate as interplate controls). In other words, 50 samples in plate 1 (Before) + 3 Pooled Controls and 50 samples in plate 2 (After) + 3 Pooled Controls.
I would like to see the gene expression changes in the Before vs After Vitamin supplementation. Please let me know if the workflow followed looks fine.
  1. Calculation of Delta Ct = Difference between the Gene of interest and Geometric Mean of Multiple housekeeping genes
  2. Calculation of Delta Delta Ct = Difference between the samples (before and after vitamin supplementation) and average of pooled control samples
  3. Calculation of the 2 to the power of (negative Delta Delta Ct) to evaluate fold gene expression levels
Following the calculation of the Delta Ct, does the Delta Delta Ct calculation looks fine? I am bit confused here, if the average of the pooled controls should be considered and subtracted with the individual samples or difference of Before and After samples should be considered individually?
In addition, what values are considered for the statistical analysis like paired t-test, PCA, Scatter Plot and other visualization (Negative Delta Ct) or Delta Delta Ct?
Best Regards,
Toufiq
Relevant answer
Answer
The attached PDF contains several mistakes.
  • asked a question related to Normalization
Question
5 answers
I have a number of time series (different lengths, different amplitudes) that I want to cluster according their similarity. Does anyone know about any method for normalizing DTW scores to make the scores comparable between them? Thanks.
Relevant answer
Answer
There also a soft-DTW version similar to a kernel. This gives you values between 0 and 1 like a softmax output. It can be found in the tslearn library (Python).
  • asked a question related to Normalization
Question
6 answers
Hello all,
I am data mining Wikipedia to discern which titles are edited in the most countries by geolocating edits performed with IP addresses. I am only interested in the top 100 titles edited in the most countries. I am arguing that these titles represent global ideas because their edits are the most spatially widespread. With these counts, I can then measure per country how many of these global titles are edited in that particular country. This can then be used to create a type of globalization index per country (e.g. Germany edited 95 of the titles edited in the most countries). I eventually would like to do a correlation of this index with a well established globalization index that relies on counting objects crossing borders (e.g., import/export). My argument is that the higher the connectivity of a country, the higher the globalized title index. I am only interested in the subject matter and discourses in the top 100 titles, so I need my same to be manageable.
My question is regarding the normalization of data. The number of individual editing IP's do effect the number of titles edited per country. However, this is not a normal per capita situation, for example a murder rate is all murders/population. In my case, I am arbitrarily selecting only the top one hundred titles on a list of titles per number of countries in which they are edited. It would be analogous to setting a murder rate per capita to the 100 most gruesome murders/population. A title that might be 101st in rank on the list could still be considered global is this aspect, but it just didn't make it to the top 100. So, I am uneasy about normalizing the data.
What would be the best way to normalize/standardize this data by number of individual editing IP's within Wikipedia per country given the situation that the numerator is an arbitrarily delimited group of a phenomenon?
Your help is greatly appreciated, Tom
Relevant answer
Answer
Thank you
  • asked a question related to Normalization
Question
3 answers
Do you think 'data value conflict' issue can be resolved using data normalization techniqueS? From my understanding de-normalization is a suggested by practitioners for DW development, but, normalizing a database includes amongst other aspects arranging data into logical groupings such that each part describes a small part of the whole, also normalization implies modifying data in one place will suffice, it also minimizes the impact of duplicate data. What do you suggest?
Relevant answer
Answer
I want to answer this discussion in four series in the first I will focus on the when and why to use denormalization.
When and Why to Use Denormalization
As with almost anything, you must be sure why you want to apply denormalization. You need to also be sure that the profit from using it outweighs any harm. There are a few situations when you definitely should think of denormalization:
  • Maintaining history: Data can change during time, and we need to store values that were valid when a record was created. What kind of changes do we mean? Well, a person’s first and last name can change; a client also can change their business name or any other data. Task details should contain values that were actually at the moment a task was generated. We wouldn’t be able to recreate past data correctly if this didn’t happen. We could solve this problem by adding a table containing the history of these changes. In that case, a select query returning the task and a valid client name would become more complicated. Maybe an extra table isn’t the best solution.
  • Improving query performance: Some of the queries may use multiple tables to access data that we frequently need. Think of a situation where we’d need to join 10 tables to return the client’s name and the products that were sold to them. Some tables along the path could also contain large amounts of data. In that case, maybe it would be wise to add a client_id attribute directly to the products_sold table.
  • Speeding up reporting: We need certain statistics very frequently. Creating them from live data is quite time-consuming and can affect overall system performance. Let’s say that we want to track client sales over certain years for some or all clients. Generating such reports out of living data would “dig” almost throughout the whole database and slow it down a lot. And what happens if we use that statistic often?
  • Computing commonly-needed values up front: We want to have some values ready-computed so we don’t have to generate them in real time.
It’s important to point out that you don’t need to use denormalization if there are no performance issues in the application. But if you notice the system is slowing down – or if you’re aware that this could happen – then you should think about applying this technique. Before going with it, though, consider other options, like query optimization and proper indexing. You can also use denormalization if you’re already in production but it is better to solve issues in the development phase. Source: Emil (2016)
Any thought: Dennis Mazur , Nawroz Abdul-razzak Tahir Jeyris Martínez Gutiérrez Nada QASIM Mohammed
Dr.Hikmat Abdulkarim Almadhkhori
Ahmad Saad Ahmad Al-Dafrawi
Dr R Senthilkumar Lilianna Wojtynek ?
  • asked a question related to Normalization
Question
4 answers
Hi,
I am using geNorm, Normfinder, RefFinder, and BestKeeper to identify and select the appropriate reference genes for the normalization in the RT-qPCR analysis. I was trying to identify the tools to calculate the pairwise variations (Vn/n + 1) to determine the optimal number of reference genes to calculate the V value such as V2/3, V3/4, etc. Should this calculations needs to be performed manually, and how can this be performed or any tools could specifically perform this task. Please assist me with this.
Best Regards,
Toufiq
Relevant answer
Answer
geNorm software (sorry, Excel sheet) will do it for you. If only you can find the original sheet.
  • asked a question related to Normalization
Question
3 answers
I measured behavioral data (human body inclination) of two different groups (each, 15 subjects) in three different moments (sessions), the data was not normal (Shapiro-Wilk test) and did not respect homogeneity of variance (Levens' test). I ran non-parametric test (Mann-Whitny U test for independent comparisons and Friedman and Wilcoxon sign-rank test for paired sample comparisons).
The reviewer of my article asked me to transform data to be normal and then run parametric test (rmANOVA). I applied many transformation, only one of them was good: the LMS approach proposed by Cole and Green (1992). It is also known as LMS quantile regression with the Box-Cox transformation to normality as well as it is known as Box-Cox Cole-Green (BCCN) transformation. The formula is: Z = ((y/μ)^L-1) / (S*L), where L is a constant parameter, μ is the mean value and S is generalized coefficient of variation (i.e., σ/μ and σ is standard deviation). It is not so common transformation.
My question is: can I use any kind of transformation in the literature to transform data to meet the assumptions of parametric test in statistical analysis like rmANOVA? Or, I should just use the transformations that are famous in the field that I am working?
Many thanks in advance for your comments!
Cole TJ, Green PJ. Smoothing reference centile curves: the LMS method and penalized likelihood. Stat. Med. 1992;11:1305–1319.
Relevant answer
Answer
If you have enough sample numbers than yes.
  • asked a question related to Normalization
Question
3 answers
I do not think the data I have requires log-transformation to carry out PCA as the distribution is not skewed. Should I normalise my data using scale = TRUE (in R studio) before carrying out PCA. This lowers the variance compared to raw data and does not seem to affect separation drastically.
Relevant answer
Answer
Mehmet, PCA package as a complete package contains a normalization process. There is therefore no need to normalize inputs to it.
  • asked a question related to Normalization
Question
4 answers
My goal is to filter a polyline shapefile eliminating all the features with a complex shape. So I calculated, for each feature, the length and the number of dangling endpoints. My goal is now to create a new field that relates these two parameters, so that, for example, a feature with 3 dangling endpoints and a length of 1 meter is deleted, but a feature with 3 dangling points and a length of 50 meters is maintained. I think I need to normalize values but I'm not expert in statistic.
Relevant answer
Answer
Thank you Henry Chaplin I already filtered the features by the sinuosity, but there was still noise and I wanted try the dangling endpoints way.. Anyway at the end I filtered using:
1 Trim
2 Unsplit line
3 Sinuosity parameter
4 Merge all the lines with the buffer-Join method
5 Dissolve
6 Length parameter
It's a bit complex but my file (stream network) was really full of noise
  • asked a question related to Normalization
Question
5 answers
How to normalize numeric ordinal data from [1, Infinity) to [1,5]? I want to normalize relative scores which is ordinal by nature. But range of score can be from [1, Infinity). So, I need to bring it on scale of [1,5]. Can anybody help me figure it out?
Data values are double (float) type values.
Relevant answer
Answer
We had a similar problem in the context of educational assessment where very often people work with "judgement scales" of both types, i.e. two-sided bounded (like your [1,5]) as well as one-sided bounded (like your [1,inf] or [0, inf]), and they have to merge those distinct scales in some way. Fortunately, there are not so many ways to do such mergings, if you want to keep things simple and choose the right mathematical framework which suggests you the appropriate transformation. Just picking the one or another transformation to fit your data e.g. to a normal (or some other) distribution is IMHO a brute statistical approach which doesn't do justice - plausibly - to the inherent structure of your data. For instance, in our case, it was plausible to assume a sub-additive operation on scores, something of the form a + b - a*b, and derive a structure which turns out to be a semi-module. From here, everything else could be derived - without guesswork. The 1-1 transformation from a bounded to a unbounded interval we used had the relatively simple form: 2-log ( score / (1 - score) ), where 2-log is the binary log function, and scores are numbers between 0 and 1. The resulting numbers are between -inf and +inf, but of course we can adapt it such that the resulting numbers start from 0 onwards. Because the transformation is 1-1, it has an inverse which goes the other way back. Works great! Probably there are statistical techniques for doing tests with your data, but that is none of our concern in our context.
  • asked a question related to Normalization
Question
2 answers
Do I right understand that scRNA seq data in RPKM format are normalized in the expression within one sample, but do not normalized between the samples? And, therefore, they can contain all the copies of one gene due to PCR amplification step?
Do you know any good recommendations or articles which can help with RPKM data normalization and downstream analysis?
Thank you in advance!
Relevant answer
Answer
Yes, the RPKM metrics is used to normalize for sequencing depth and transcript length within one sample. You cannot use it to compare expression profiles in different samples. For intra-sample comparisons better use TPM (Transcripts Per Kilobase Million) metrics. Here is a link to a short article on how TPM and RPKM are related (https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/). And here is a useful guide on how to calculate these metrics using scater from Bioconductor package, see part 7.8.8 "Normalization for gene/transcript length" (https://hemberg-lab.github.io/scRNA.seq.course/cleaning-the-expression-matrix.html).
  • asked a question related to Normalization
Question
1 answer
Hello everyone, I performed an Interferon Y ELISA after a coculture of my cell lines transduced with vectors containing antigens and pbmcs transduced with 8 t cell receptors recognizing the antigens. My problem is that transduction efficiency between the tcrs is very different from 1% to 94%. I would like to analyze the Interferon y secretion on the basis of transduction efficiency, but I think I have to normalize my data. How can I do this? And what is the easiest way to do it? I am writing my master thesis and I don't have any experience in normalization yet. So it would be great, if somone has an idea. Thank you so much!
Relevant answer
Answer
Convert your raw data for interferon gamma elisa into a 8 t cell receptor -normalized value. The normalization can be done by dividing the IFN-g values by 8 t cell receptor values. Then calculate the fold change between the different controls and treatment groups.
  • asked a question related to Normalization
Question
9 answers
HI,
I am working on breast cancer and I\m trying to use MCF12F as normal breast cell line. I purchased it from ATCC and Iam having problem growing it. It dosent seem to attach and the few cells that have attached are floating after few days. The cells are like this and have added everything to the medium as recommended by ATCC. They replaced the cell twice for me but still its the same. Does anyone have this problem. Suggestions are welcome.
Relevant answer
Answer
MCF-12F is a floating culture and not supposed to adhere; that's what the "F" stands for in MCF-12F. Sorry so late to reply, just saw this, lol
  • asked a question related to Normalization
Question
3 answers
Hello,
I am trying to do normalization the data of GSE8397 with MAS5.0 by using R:
setwd("D:/justforR/GSE8397")
biocLite()
library(affy)
affy.data = ReadAffy()
However, the data used to 2 platforms: Affymetrix Human Genome U133A and Affymetrix Human Genome U133B Array.
The code gave me the warning message: "Error in affyio::read_abatch(filenames, rm.mask, rm.outliers, rm.extra, :
Cel file D:/justforR/GSE8397/GSM208669.cel does not seem to be of HG-U133A type"
So, how can I keep normalizing the data when they are in both U133A and B? Should I try another method of normalization (RMA or GCRMA?)
DO you have any ideas about this problem?
Thank you so much!
Relevant answer
Answer
Hi Phung,
I guess it's impossible to analyze these two types of affymetrix arrays by simple command lines. in fact U133A and U133B share only 168 probes among the 22k sondes in both designs. take a look at this post (https://www.biostars.org/p/283639/). in fact both designs are sold to be used as complementary and can't be compared.
fred
  • asked a question related to Normalization
Question
8 answers
I am working on dataset in which almost every feature has missiong values. I want to impute missing values with KNN method. But as KNN works on distance metrics so it is advised to perform normalization of dataset before its use. Iam using scikit-learn library for this.
But how can I perform normalization with missing values.
Relevant answer
Answer
Thank you everyone for your valuable suggestions. I will work on these points.
  • asked a question related to Normalization
Question
4 answers
During meta-analysis, we have to do normalization data in preprocessing data steps and then we will analyze the different expressed genes. But after we do the normalization, how can we know that the data we have is good or not? I mean do we have any validation step for normalization? If yes, how do we do validation for normalization?
Thank you!
Relevant answer
Answer
Are you using microarray data? One common way to check if your data has been normalized correctly is to compare expression distributions with boxplot. If it's normalized correctly the boxes should be even.
If you're new, I think this slide might help. (It did for me :) : http://barc.wi.mit.edu/education/bioinfo2007/arrays/slides/arrays_lecture1-color.pdf
  • asked a question related to Normalization
Question
3 answers
I am studying about how to do normalization. At first, I think that people have to calculate the log2 ratio as a primary step to convert the T=R/G ratio in microarray into another simple number to show the difference of up and down regulated of the genes, but then I see that in some case, people do log2 transformation as a step of normalization. So, is it log2 transformation a normalization method? Is it log2 transformation the same as log2 ratio calculation? (or they are different?
Thank you for your attention. Please ask me if my question confuse you!
Relevant answer
Answer
Taking log2 is a transformation, not a normalization.
Normalizations make the distributions more similar. They can be applied to untransformed or to transformed data (the choice is not arbitrary; it depends on the kind of normalization what makes sense).
Note that log(R/G) = log(R) - log(G).
In dual-color hybridizations, a frequently used normalization is the lowess-normalization that actually detrends the log(R/G) against the log(R*G) data. It therefore makes sense to calculate log(R/G) first and then normalize.
  • asked a question related to Normalization
Question
4 answers
It's a comparison of liposomal vitamin C to non-liposomal vitamin C. There were 21 geometric mean data points taken from each test group. I have the SD for each data point, there is some significant inter-subject variation. Do I need to dose-normalise this data before presenting the geometric mean and SD datasets on the graph? Is normalisation of data in this case necessary and why?
Relevant answer