Science topic

Data Quality - Science topic

Explore the latest questions and answers in Data Quality, and find Data Quality experts.
Questions related to Data Quality
  • asked a question related to Data Quality
Question
1 answer
Dear ResearchGate community,
I'm currently working with a large dataset of phytoplankton species, and I'm looking for a way to routinely update the taxonomy of these species. Specifically, I'm interested in identifying changes to scientific names, accepted names, synonyms, and higher-level taxonomic classifications.
I've tried using various R packages such as worrms to retrieve taxonomic information from online databases, but I've encountered some challenges with data quality and consistency.
I would appreciate any suggestions or advice on how to efficiently and accurately update the taxonomy of my dataset on a regular basis. Are there any reliable databases or APIs that I should be using, like AlgaeBase? Are there any best practices for handling taxonomic data in R?
Thank you for your help!
Nicolas
Relevant answer
Answer
Maybe you can use taxonomy files from PR2? It is regularly updated and there is an R package: https://pr2database.github.io/pr2database/index.html
  • asked a question related to Data Quality
Question
5 answers
Hello everyone, I am checking the quality of some RNA-seq data with FASTQC and I am getting results that are not clear to me. Is this kind of result normal?
Relevant answer
Answer
The plot shows that the average quality per base alongside your 150 pb reads is very high which is good. This result is kind of normal when the sequencing has been outsourced. Most companies will give you prefiltered fastq files containing only reads with high quality. You can ask your sequencing provider if that was the case, although sometimes you can also find that info in the report they send together with your fastq files.
  • asked a question related to Data Quality
Question
4 answers
How can one do case-control matching to randomly match cases and controls based on specific criteria (such as sociodemographic matching - age, sex, etc.) in order to improve data quality?
Relevant answer
Answer
In case-control studies, you can do it in multiple ways. I assume you already know how to get the controls. Imagine, you are performing secondary analysis, you have a dataset with 200 patients, here you have 30 people of SLE (case), and 170 people without SLE (control), as we all know matching 1 case with '1 to 4' controls provides adequate power, don't match more than 4 controls for 1 case.
First, match 1:1, that is 1 control for 1 case. As sex is almost always binary, you don't have to manipulate this variable. But age is not binary, so you have to determine the age range you would go. For instance, a 20-year-old man could be matched with 18-25-year-old men in your dataset, this way you could get. Create a categorical variable for age by specifying that range. Also, create a binary variable of who are cases and who are controls. Then, export these 60 people from the dataset and can perform your analysis. If you want to get 1:2, the procedure is the same (export 90 people), but whatever you do, prespecify the hypothesis, sample size, and analysis plan in the beginning. Best of luck!
You can also see propensity score matching or inverse proportional treatment weighting but that procedure is a little complex and you need somebody with advanced statistical analysis knowledge to do that.
  • asked a question related to Data Quality
Question
3 answers
Hello everybody,
I am currently writing my final thesis final about data quality, in particular about consistency. Therefore I am looking for a labeled IoT timeseries dataset for consistency detecting. Does anybody know, where I can find such dataset?
Or does anybody know where I can get a labeled IoT-timeseries dataset for anomaly detection?
Thank you for your help!
Relevant answer
Answer
Hello,
For the general case of time series anomaly detection, several benchmarks have been recently proposed. These two benchmarks contain labeled time series from different domains (for instance, room occupancy detection from temperature, CO2, light, and humidity [1] or accelerometer sensors of a wearable assistant for Parkinson's disease patients [2]).
You may find the links to the benchmarks below. Good luck with your thesis!
[1] Luis M. Candanedo and Véronique Feldheim. 2016. Accurate occupancy detection of an office room from light, temperature, humidity, and CO2 measurements using statistical learning models. Energy and Buildings 112 (2016), 28–39. https://doi.org/10.1016/j.enbuild.2015.11.071
[2] Marc Bachlin, Meir Plotnik, Daniel Roggen, Inbal Maidan, Jerey M. Hausdor, Nir Giladi, and Gerhard Troster. 2010. Wearable Assistant for Parkinson's' Disease Patients With the Freezing of Gait Symptom. IEEE Transactions on Information Technology in Biomedicine 14, 2 (2010), 436–446. https://doi.org/10.1109/TITB. 2009.2036165
  • asked a question related to Data Quality
Question
1 answer
Hello, I'm trying to calculate the results for a product system by selecting the following options:
  • Allocation method - None;
  • Impact assessment method - ReCiPe Midpoint (I) / ReCiPe 2016 Endpoint (I);
  • Calculation type - Quick results / Analysis;
  • Include cost calculation and Assess data quality.
Well, the results are always a list of zeros for every item in the LCI. I've already tried to do the following actions to solve the problem, however I didn't have any success:
  • Increased the maximal memory to 5000 MB;
  • Validated the database (it returned back with zero errors);
  • Opened the SQL editor and executed the query: select p.ref_id, p.name from tbl_processes p inner join tbl_exchanges e on p.id = e.f_owner where e.id = 7484497 (got the reference ID and the name of the process where the exchange occurred and searched for it, opened the process and didn't find any error message with more details or a warning popup).
The openLCA version I'm working on is 1.11.0. Thank you very much for all the help. Best regards, Beatriz Teixeira
Relevant answer
Answer
You might try deleting the current product system and make new flows, processes and product system. Seems like some mistake has been made in previous steps.
  • asked a question related to Data Quality
Question
3 answers
Answers from different disciplines are welcome.
Relevant answer
Answer
Dear Rami Alkhudary,
You may want to review the data below:
Factual Accuracy and Trust in Information: The Role of Expertise
_____
How Do You Know If Information Is Accurate? How To Evaluate Information Sources
_____
What Should I Trust? Individual Differences in Attitudes to Conflicting Information and Misinformation on COVID-19
  • asked a question related to Data Quality
Question
1 answer
I am currently working with GANs or to be specific CGAN (Conditional Generative Adversarial Networks) for synthetic signal generation. To improve the generated data quality which means to increase the synthetic data similarity to original data, I have already analyzed and observed improvement by performing several hypertuning combinations for the discriminator and generator, such as modified momentum, iteration and learning rate. On top of that, batch normalization and layer number manipulation helped in additional improvement. My question is what other parameters should be a must to look into excluding that of general neural networks?
Relevant answer
Answer
hi,
quantitative techniques for evaluating GAN generator models are listed below.
  • Average Log-likelihood.
  • Coverage Metric.
  • Inception Score (IS)
  • Modified Inception Score (m-IS)
  • Mode Score.
  • AM Score.
  • Frechet Inception Distance (FID)
  • Maximum Mean Discrepancy (MMD)
for more info:
Best wishes..
  • asked a question related to Data Quality
Question
1 answer
In the era of information and reasoning, we are shown several scientific pieces of information either in print form or online globally. Despite the appreciable access to information the originality, novelty, and quality of information are substandard. For example, a large number of researches done in the developing world are either published in reputable journals or on the shelf. However, implementation of these research findings is scarce.
This could be due to data quality or the quality and quantity of the research team involved. The Issues that could affect the quality of research in developing countries include but are not limited to;
· Availability of limited resources to support research projects
· Inadequate time devoted to research projects because people who teach at the university level in developing countries are rarely full-time professors and usually have several jobs.
· The theoretical nature of research methodology in the curriculum, so students become professionals without the practical knowledge of how to do research.
· Limited access to journals, search engines, and databases and high subscription cost that is beyond the reach of the budgets of both individual professionals and university libraries.
· Weak ethical review committee to verify ethical treatment of human subjects.
· Rationing research funds to several colleges and department, which lead to limited competition and an increased chance of doing weak research
· Weak institutional structure and lack of empowerment to research staff
· Poor data management systems and lack of databases
· Availability of poor research guidelines and poor composition of the research team (i.e. failure to involve all relevant expertise in developing proposals and conducting analysis and interpretation of findings)
In the face of the above challenges, using real-world health be a solution to data quality problems? If, what are possible changes using real-world health data in developing countries?
Relevant answer
Answer
Developing countries can make use of a lot of health research conducted in developed countries, for these researches are scientifically proven, while those of developing countries are of short data for the universities there do not pay enough to do excellent researches.Regards.
  • asked a question related to Data Quality
Question
14 answers
How to maintain data quality in qualitative research? How to ensure quality in qualitative data collection as well as data analysis?
Relevant answer
Dear Prof. Dr. Devaraj Acharya , several methods could be used to ensure qualitative research analysis, such as Guba and Lincoln’s Concepts for defining and investigating quality in qualitative research as the following RG link. Kindly visit.
Kind Regards,
  • asked a question related to Data Quality
Question
7 answers
Hello, I am trying to get a metagenomic analysis and found Novogene whose prices are pretty cheap (almost 1/3 of our university core). Does anyone have any experience with this company? about the data quality or reliability?
Please let me know,
Thank you,
Relevant answer
Answer
Hi Alba.
I had experience with Novogene, both for whole-genome and metagenomics.
The sequencing is very good, and I had no problems at all in terms of quality.
I don´t recommend to subscribe for their bioinformatic analysis, since the analyses performed are pretty basic and can be done easily by the researcher, and many times you need more complex analysis that are not included in the price.
Let me know if you need more information.
Ricardo
  • asked a question related to Data Quality
Question
3 answers
What metrics do people use for quantifying the quality of timeseries data obtained from sensors such speed, acceleration, relative velocity, lidar point clouds? Also, how do you define the quality of such timeseries data?
#timeseries #timeseriesdata #datanalysis #ts #quality #dataquality #metric #sensors
Relevant answer
Then in your case, I would do a comparative analysis of the variables. For example, I would build a scatterplot between variables on the reference data, and I would also build the same scatterplot between variables in the data under study.
  • asked a question related to Data Quality
Question
3 answers
Dear Han,
After comparing annual flux ET data with CMFD precipitation, it seems that ET at almost all six sites is even greater than precipitation. Therefore, I wonder if it possible that the runoff or groundwater played a crutical role in these sites. Otherwise, It's about the flux data quality problem.
Please refer to the attachment for details
Cheers,
Ziqi
Relevant answer
Answer
I think you need to analyze the hydrology of the study area. At the same time, I think, there may be mistakes
  • asked a question related to Data Quality
Question
4 answers
Dear community,
We are working with attention check items in several questionnaires to improve the data quality (compare ). We are also measuring several constructs (such as embodiment, immersion, presence, motivation, self-efficacy etc.) which established questionnaires recommend measuring with many items (sometimes >30). This length makes it infeasible given participants' limited attention and time. Thus, we have shortened some scales. I would like to justify why this is acceptable given the higher data quality due to the attention check items. Unfortunately, I could not identify any literature that indicates this. Are you aware of anything in this direction? Please also feel free to point out any literature regarding shortening scales or the trade-off of long scales and time.
Thank you!
Relevant answer
Answer
This paper and its reference list should help:
Marsh, H. W., Martin, A. J., & Jackson, S. (2010). Introducing a short version of the physical self description questionnaire: new strategies, short-form evaluative criteria, and applications of factor analyses. Journal of Sport and Exercise Psychology, 32(4), 438-482.
  • asked a question related to Data Quality
Question
3 answers
Usually taqman qPCR gene expression assays using taqman fast advanced mastermix require 20ul reaction volumes. I wanted to scale down reaction volumes to 10 or 5ul to save on reagents but I was wondering if this is possible without sacrificing data quality.
Relevant answer
Answer
Yes, you can make 10 ul reaction mixture for taqman! However, during running PCR you need to make sure that you have changed the the total volume to 10ul instead of 20ul.
Best of luck
  • asked a question related to Data Quality
Question
4 answers
Hello,
Looking for studies about data quality improvement considering fixed model. what are the options in case of limited data? (<10000 samples)
Regards
Relevant answer
Answer
Hi Johnson Masinde , thank you for your answer. I requested the file to have more details but it seems answering partially. I'm more interested in the data improvement methods in case of limited datasets.
  • asked a question related to Data Quality
Question
3 answers
I have found reports mentionning the limits of both but apart from answers on the scope (EF is broader, CF is more specific, they are also quite interrelated), i can't find a comparison on the quality of the data itself.
Relevant answer
Answer
Ecological foot is more suitable representative of environmental degradation, carbon footprint is a small portion.
  • asked a question related to Data Quality
Question
6 answers
is there any data-set of smart-meter data (energy data) which labeled with data quality flags
Thanks
Aslam
Relevant answer
Answer
Collection of real time and crisp data and its validation.
  • asked a question related to Data Quality
Question
5 answers
Hi!
I'm working on a research and we have some kind of hydrologic datasets(e.g. river discharge, precipitation, temperature and etc.).
We need to be sure about the quality of our datasets.
Is there any test (preferably new ones) to be sure about quality datasets?
Thanks
Relevant answer
Answer
I have a few ideas. Try to find out who installed the equipment and collected the data, who interpreted and edited data for final data set entry. What were their methods and quality control. Do you have just digital data, or examples of strip charts, punch tapes, site visits with comments, etc.? Who worked up data, how did they deal with missing records? What equipment were used, was it installed correctly for research, etc.? Are there documents such as a data library, publications, pictures? Can you visit sites of data collection, and look over any remnants as staff gauges, benchmarks of cross sections or bridge sites, historical aerial photos covering area, streamflow records used for stage discharge curves, adjustments with time. Location of rainfall sites relative to watershed measured and topography. I think you would find in discussing this topic with the US Geology Service professionals and research hydrologists or others involved with this work, a substantial amount of time is used in development of a quality data set. Someone should have documented the specifics, but realizing how sometimes this is not done or personnel changes and may cleanout and dump files or records, even reports, much can unfortunately be lost with time. Some research grade stations like USFS Coweeta Hydrologic Experiment Station or agencies like USGS take much effort in cataloging, filing, logging and storing past records, notes, etc. for posterity.
Also, plot the time series data out. Do the data of the various stations seem to correlate with each other. Are the readings frequent as many times per day, which is a good sign, or infrequent as daily readings, not as good quality. Do the flow hydrographs coincide with rainfall events? Temp changes with day and night, rainfall events, etc. Are there other streamflow data stations in vicinity which have good correlation with records taking into account some time differences in storms and response.
  • asked a question related to Data Quality
Question
6 answers
I used Zetasizer Nano ZS to measure my polystyrene microparticles, and the data quality report suggests the zeta potential distribution is poor. I've checked the SFR spectral quality data, all of my 6 replicates for the same sample is below 1. In the technote, poor zeta potential distribution could originate from improper sample concentration, high conductivity samples, and less than enough sub runs. I've checked the derived count rate, which is 13,000-18,000 kcps with attenuator setting is 7. The sub run number is 22 for all samples. Since I saw blacking effect on the electrode after measurement, which suggests electrode degradation. Therefore, I suspect the conductivity is relatively high. However, the dispersant composition only includes 50mM Tris Buffer with some trehalose and glycerol. I'm not sure whether the monomodal will be required in this case and more sub runs will be helpful? Your answers will be highly appreciated.
Relevant answer
Answer
Hi Juelu Wang , you can see the conductivity of your measurements on the Zeta Report or also listed in the Zeta workspace. Yes, it appears that you have enough scattering intensity signal. If your overall mean zeta potential is very close to zero then it may be difficult to improve the data quality. The diffusion barrier method may improve. Here are a few comments that may be useful:
  • asked a question related to Data Quality
Question
6 answers
My research team and I collected a batch of data this month(~150 workers) from Mturk. Despite having many quality checks embedded in the survey (e.g., multiple attention checks, reverse coded items) we still feel that the data are suspicious. We ran a similar study one year ago and one of our measures assesses sexual assault perpetration rates. We used the same measure in our current study and the perpetration rates are unusually high this time. Is anyone else having trouble finding quality participant responses in Mturk? Does anyone have suggestions for how we could target better participants? Are there any forums or blog posts that we should be aware of that will help us better understand what is going on? Any information would help and be greatly appreciated!
Thanks in advance!
Relevant answer
Answer
Unfortunately, I found another recent study on the quality decrease of MTurk data: An MTurk Crisis? Shifts in Data Quality and the Impact on Study Results by Michael Chmielewski, Sarah C. Kucker.
  • asked a question related to Data Quality
Question
5 answers
KIndly suggest the mean of removing outliers from data set so as to improve data quality.
Relevant answer
Answer
You can apply some basic clustering techniques i.e. k-means, DBSCAN and so on for the outlier detection and removal.
  • asked a question related to Data Quality
Question
16 answers
Text classification task, if data quantity is low but data quality is not low. We could use data augment methods for improvement.
But the situation is that data quantity is not low and data quality is low. (noise in the labels, or training data accuracy low)
The way I get the low quality data is by unsupervised methods or rule-based methods. In detail, I deal with a multi-label classification task. First I crawl web page such as wiki and use regex-based rule to mark the label. The model input is the wiki title and the model output is the rule-matched labels from wiki content.
Relevant answer
Answer
The low quality data may be noisy and untrained as you have written such data could be reprocessed but if data is irrelevant , out of context and not explained the problem and requirement then the data would be futile . The unsupervised method or rule based method could help by the understanding the new problem and the rules those would help to take decisions .
  • asked a question related to Data Quality
Question
7 answers
Hello, I am trying to get one RNA-Seq with a lot of samples and found Novogene whose prices are pretty cheap (almost 1/3 of our university core). Does anyone have any experience with this company? about the data quality or reliability?
Please let me know,
Thank you,
Relevant answer
Answer
Excellent quality. I have run RNAseq, BSseq and genome sequencing (our own libraries and novogene-built libraries) though their facility and always get fantastic data.
  • asked a question related to Data Quality
Question
3 answers
Good-quality data is an essential indicator in clinical practice or research activities. There are various attempts to define data quality, which are heterogeneous, domain-specific. I am looking for current and published data quality evaluation frameworks particular to data from electronic health records.
  • asked a question related to Data Quality
Question
5 answers
To perform data quality assessment in the pre-processing data phase (Big Data Context), should data profiling being performed before data sampling (on the whole data set), or is it ok to have profiled on a subset of the data?
If we consider the second approach, how sampling is done without having information about the data (even some level of profiling)?
Relevant answer
Answer
Hadi Fadlallah , yes. That should decrease computational expenses, help to perform an investigation of a subset instead of the whole set. That is similar to a data science process where a small dataset is analyzed afterwards the methods are applied on big data set.
  • asked a question related to Data Quality
Question
4 answers
Given the data set, I want to identify the faulty records/data points values in the original data and then try to rectify it. The data-set is a mixture of numerical and categorical variables ( Total: 200 variables) with 2 million records.
I have tried Frequent pattern mining to achieve this which gives rules for variables and values in the data. ( Works well but takes time.)
I wanted to understand if something similar to this can be achieved by deep learning capabilities with some more insights.
Relevant answer
Answer
Sorry I'm late to the party but if you know which datum is faulty you should be able to solve the problem. If you don't know how do you decide?. Which came first the chicken or the egg?
David Booth
BTW outliers often get a bad rep but are actually useful.. How do you tell the outliers from the approx. 2.5 % of data that lie in a tail. see the attached
  • asked a question related to Data Quality
Question
5 answers
We often describe 3Bs of data quality issues, namely Broken, Bad, and Background. "Broken Data" means most data are collected in different time by different people. Sometime the history of data has missing data set. "Bad Data" means data has outliers which might be caused by noise, wrong collecting setup, or degraded/broken sensors, etc. "Background of Data" means the collected data lacks of working environment info. For example, jet engine data without weather, wind speed, and air density data and it will be difficult to analyze the fuel consumption issues. We also need a closed-loop data system which allow users quickly assess if the data is useful and usable. It not, users can further improve the data collection system to avoid collecting useless data.
Relevant answer
Answer
This is like saying a road grader is poor because it doesn't do well on a concrete pavement. It is like saying the fish tastes red, Red is not a property of fish tasting and machine learning is not a form of data.. WORDS HAVE MEANINGS. Thanks to Prof. Arnold Insel for the fish tasting example. David Booth
  • asked a question related to Data Quality
Question
14 answers
Data Quality impacts the accuracy and meanings of machine learning for industrial applications. Many industry practitioners do not know how to find/use the right data. There are two levels of data: visible vs. invisible. Most the visible data are from problem areas or based on our experiences. General questions for the visible data are: First, how to find the useful data? Second, how to evaluate which data is usable? Third, which data is most critical? As for the invisible data, vert often people use either an ad-hoc or trial-and-error approach to find and seek but often the work can not be reproduced by others. We need a systematic approach to address the data quality issues in AI for industrial applications. We welcome people to share their research experiences and thoughts.
Relevant answer
Answer
To address the Data Quality, it should be considered in its totality irrespective of its usage, Once the data is complete it will certainly the serve the purpose no matter if it is for AI, ML or any other application.
To achieve the highest level of Data quality we should try to cover one or all the following dimension of data.
1. Uniqueness
2. Completeness
3. Timeliness
4. Consistency
5. Accuracy
6. Validity
The successful implementation of these dimensions in overall data quality process will decide the ultimate shape of your data.
- Another key point to note is that Data quality is not really a IT challenge to solve and there should be equal participation of Actual Data consumers or Data Creators who are responsible for it. Without complete understanding of data it would be impractical to decide if data is right or wrong.
To implement the Data quality , One should not try to "boil the ocean" and better go with iterative approach.
Step 1: Run Data Profiling to discover the high level data issues.
Step 2: Select the most Critical Data elements(CDE) those are candidate for data quality.
Step 3: Decide the DQ dimension that make sense for given CDE.
Step 4: Define and Develop Data quality rules to check the data error.
Step 5: Produce the exception report(with records details not satisfying the rule criteria) and share it with Data users or Owner
Step 6: Develop process to correct the data issues and monitor the progress.
Step 7: Repeat the Step1 to Step 6 until all the candidate data is covered for data quality process.
  • asked a question related to Data Quality
Question
6 answers
Dear community,
I would be very greatful if someone can advise me a package in R / RStudio to analyse the ddCt values from a qPCR. Ideally, the package has some example data and would have tools for the whole pipeline of analysis (import of data, quality control, analysis and visualization)
I checked a few but was so far not very satified so far, Thanks for the help!
Relevant answer
Answer
Dear Ruslan,
I think it is worth checking out the EasyqpcR (https://www.bioconductor.org/packages/release/bioc/html/EasyqpcR.html). It is is a very comprehensive and well documented package and has been used in a variety of most recent publications. Major advantage is that it can be run via a GUI (using gWidgets; see: https://bioc.ism.ac.jp/packages/release/bioc/vignettes/EasyqpcR/inst/doc/vignette_EasyqpcR.pdf) which makes it very easy to use.
If you are looking for a more advanced tool to analyze your qPCR data which can cope with raw intensity values and perform melting curve analysis etc. I recommend the qpcR package (https://academic.oup.com/bioinformatics/article/24/13/1549/238435) which can do quite a lot of cool stuff.
If you are running a high through put experimental test design maybe check out this tool: HTqPCR (https://academic.oup.com/bioinformatics/article/25/24/3325/235116)
To facilitate qPCR data import there is another nice tool available called ReadqPCR (https://www.bioconductor.org/packages/release/bioc/html/ReadqPCR.html).
For more options check out the comprehensive review from 2014
about qPCR analysis tools which compares among others, 9 R packages. Maybe you will find something more interesting ;) Good luck!
  • asked a question related to Data Quality
Question
1 answer
I am comparing datasets of different quality (uncertainty) by assessing model performance of species distribution models.
Would it be correct to use cross-validation in this case?
Since training and testing data contain the same level of uncertainty, I would expect model performance to be inflated in case of bad quality, and I doubt that the difference in model performance between two different datasets will represent the difference in data quality correctly, unless both are validated with the same validation set.
I am aware that validation with external structured data is always the best option in this case, but suppose that this is not available.
Kind regards,
Camille
Relevant answer
Answer
  • asked a question related to Data Quality
Question
4 answers
Dear all,
I would like to invite you to participate in a research project, about the development of a data governance framework for smart cities. The aim of this research project is to analyse the data requirements of smart cities and to propose a suitable data governance framework that will facilitate decision making, improve operational efficiency as well as ensure data quality and regulatory compliance. Once completed, the framework will be applied to NEOM (www.NEOM.com), a greenfield project involving the design and build of a mega smart city from the ground up, where cutting edge technology will form the backbone of the city’s infrastructure.
To participate in the survey, please click on the following link.
Your support and input will be very much appreciated.
Yours sincerely,
Topé
Relevant answer
Answer
Thank you all for the support!!
  • asked a question related to Data Quality
Question
1 answer
At our core, we are concerned with data quality as well as the quality of our statistical analyses.
  • What kind of processes have you implemented, just for yourself or within a research team, to check for transparency and reproducibility of your results?
  • Have you any formal or informal peer-reviewing experience with regard to your statistical analysis?
  • What kind of info is mandatory in your syntax files?
Thanks!
Relevant answer
Answer
In some instances we use other softwares to perform the statistical analysis based on the syntax developed and report this in the write-up for data quality and assurance purposes.
  • asked a question related to Data Quality
Question
3 answers
Hi we are taking some DLS measurements in a Zetasizer Nano ZS for starches nanoparticles size. We know that we could measure in the Nano ZS90 but we dont have it. We have seen some works in literature that claim they have measure with the same Zetasizer starch nanoparticles. We know that it is difficult to measure starch nanoparticles due to the dry nature of the samples and the water dissolution difficulty. The main problem is that there is an aggregation in the sample and there is poor data quality.
Is there any suggestion for the samples preparation, or an adjustment that sould be done in the Zetasizer that we ignore?
Also we are aware of the fact that there should be also a confirmation of the particles size with a SEM measurement.
Relevant answer
Answer
Panagiotis Loukopoulos If you're starting with a dry powder you have a fused collection of sub- and post-micron aggregates and agglomerates. Have you measured the SSA of your starch? If so, what is it? The behavior of the material will reflect the bulk size in terms of properties such as flowability, dusting tendency, filter blockage etc. Look at Figure 2 in the attached. Nowhere do you see free, independent, discrete particles < 100 nm. Thus your measurements should be taken on laser diffraction and not DLS. Many starches are also in the 20 micron region for mean size and thus, again, not applicable to DLS. Takse a look at this recent webinar:
May 28th, 2019 Dispersion and nanotechnology
  • asked a question related to Data Quality
Question
5 answers
I have got a slag sample that contains high amount of iron in it. I conducted xrd using Cu tube, on the powdered sample with very high background (because of iron efflorescence). Then to improve the data quality i got it done with Co tube but the signal to noise ratio is very low. I have attached the .raw files and the screenshot of data for both from the Cu and Co tubes. I know of only one option i.e. increase time per step. Are there any other settings that can improve the signal to noise ratio? Please let me know if any additional information is required?
Relevant answer
Answer
Your Co tube case apparently suffers from low photon flux at the detector (compared to the Cu tube case). Do you have taken the same sample?
It seems to me that you might have got an experimental issue here.
For example the sample height adjustment needs to be improved.
Please play around with the height position of the sample with respect to the x-ray beam (rotation axis of your set up).
Good luck
  • asked a question related to Data Quality
Question
5 answers
While developing a questionnaire to measure several personality traits in a somewhat unconventional way, I now seem to be facing a dilemma due to the size of my item pool. The questionnaire contains 240 items, theoretically deduced from 24 scales. Although 240 items isn't a "large item pool" per se, the processing time for each item is averages on ~25 seconds. This yields an overall processing time of over >1.5 hours - way to much, even for the bravest participants!
In short, this results in a presumably common dilemma: What aspects of the data from my item analysis sample to I have to jeopardize?
  • Splitting the questionnaire into parallel tests will reduce processing time, but hinder factor analyses.
  • Splitting the questionnaire into within-subject parallel tests over time will require unfeasible sample sizes due to a) drop-out rates and b) eventual noise generated by possibly low stability over time.
  • An average processing time over 30 minutes will tire participants, jeopardize data quality in general.
  • Randomizing the item order and tolerating the >1.5 hours of processing time will again require an unfeasible sample size, due to lower item-intercorrelations.
I'm aware that this probably has to be tackled by conducting multiple studies, but that doesn't solve most of the described problems.
This must be a very common practical obstacle and I am curious to know how other social scientists tackle it. Maybe there even is some best practice advise?
Many thanks!
Relevant answer
Answer
Sounds like you've created an instrument which is sound as it is based on theory - but this version should only be the pilot version, rather than the final instrument. As you see - there are too many items, which will affect the quality of the data collected. Have you conducted any sort of pilot study - factor analysis - and tested to see how those items relate? You'll probably find some items are redundant, and possibly even some scales ... use EFA to explore how the items load - then as you delete items - the number of items in each scale will decrease - you can then delete scales on the basis of insufficient items.
I find David De Vaus' work on survey design and validation very useful:
  • asked a question related to Data Quality
Question
14 answers
Data quality is crucial in data science project and how can we improve the data quality before it get into analysis?
Relevant answer
Answer
Sampling is the basic part of data science . In statistical quality assurance few imperical formule are available of which variance is most important.
  • asked a question related to Data Quality
Question
14 answers
I'd like you to participate in a simple experiment.  I'd like to limit this to simple linear and multiple linear regression with continuous data.  Do you have such a regression application with real data?  Some cases with several independent variables might be helpful here.  If you have done a statistical test for heteroscedasticity, please ignore that, regardless of result.  We are looking at a more measureable degree of heteroscedasticity for this experiment.
To be sure we have reasonably linear regression, the usual graphical residual analysis would show no pattern, except that heteroscedasticity may already show with e on the y-axis, and the fitted value on the x-axis, in a scatterplot.
///////////////////////////////
Here is the experiment:
1) Please make a scatterplot with the absolute values of the estimated residuals, the |e|, on the y-axis, and the corresponding fitted value (that is, the predicted y value, say y*),  in each case, on the x-axis.  Then please run an OLS regression through those points.  (In excel, you could use a "trend line.")  Is the slope positive?  
A zero slope indicates homoscedasticity for the original regression, but for one example, this would not really tell us anything.  If there are many examples, results would be more meaningful. 
2) If you did a second scatterplot, and in each case put the absolute value of the estimated residual, divided by the square root of the fitted value, that is  |e|/sqrt(y*), on the y-axis, and still have the fitted value, y*, on the x-axis, then a trend line through those points with a positive slope would indicate a coefficient of heteroscedasticity, gamma, of more than 0.5, where we have y = y* + e, and e = (random factor of estimated residual)(y*^gamma).  Is the slope of this trend line positive in your case? 
If so then we have estimated gamma > 0.5.  (Note, as a point of reference, that we have gamma = 0.5 in the case of the classical ratio estimator.) 
I'd also like to know your original equation please, what the dependent and independent variables represent, the sample size, whether or not there were substantial data quality issues, and though it is a very tenuous measure, the r- or adjusted R-square values for the original linear or multiple linear regression equation, respectively.  I want to see whether or not a pattern I expect will emerge.
If some people will participate in this experiment, then we can discuss the results here. 
Real data only please. 
Thank you!
Relevant answer
Answer
Thanks @David -
I take it that it is OK for me to use the data in Appendix 1? Would you happen to have an excel file of those data?
Thanks again - Jim
  • asked a question related to Data Quality
Question
3 answers
Dear colleagues,
I've been working about the qualities of political participation for a while (theoretically as well as empirically).
In one of our latest projects we found that political participation can have negative impacts on trust in institutions and self-efficacy under certain circumstances. Does anyone have similar data or studies? Especially interesting would be data from countries in transition or young democracies. Thank you.
Relevant answer
Answer
Political anything can have negative connotation to many people.
  • asked a question related to Data Quality
Question
4 answers
The study by Josko and Ferreira (2017), explained a use for Data Visualization(DV) in quality assessment paradigms which they call a “Data Quality Assessment process (DQAp):
  • They highlight that the problem with using DV in this manner is not in the value of what it can provide visually, but the complexity and knowledge required. 
  • They indicate the need for DV tools to be contextually aware of what is considered “Quality” vs “Defect” therefore requiring such methods to be constructed based on specific requirements which will not be possible for all sources of data.
What is your thought regarding the use of Data Visualization tools as a DQAp? Let's discuss!!
Relevant answer
Answer
Thank you for your detailed answer. May I assume that you agree but there need to be more things for data quality assessment?
  • asked a question related to Data Quality
Question
1 answer
I need to know what is the required quality for different kinds of water demand in a building, like drinking, shower,...A quality index for domestic water and the required amount for every application is very helpful. What is the index that I should search for, and is there any reference for that?
Relevant answer
Answer
I think you should refer to standard number 1053. You can find the quality requirement for drinking water
  • asked a question related to Data Quality
Question
1 answer
I wonder if anyone can help with references to the health status of professional Tea tasters, given that Fluoride is absorbed directly through the oral cavity?
I am looking for data on the quality of their teeth, any oral cancer, plus wider potential impacts on the rest of the tasters bodies.
Is the development of electronic "tongues" related to health issues as well as seeking objectivity?
Relevant answer
Answer
Tea tasting is the process in which a trained taster determines the quality of a particular tea. Due to climatic conditions, topography, manufacturing process, and different clones of the Camellia sinensis plant (tea), the final product may have vastly differing flavours and appearance
  • asked a question related to Data Quality
Question
4 answers
Dear colleagues,
I'm currently researching supplier sustainability assessment and am looking to find any literature on data/response validation
Companies (but also individuals) are often asked to report on sustainability practices and progress. These responses are subjected to varying degrees of validation; ranging from no further checks to complete validation of every response.
Do you have some recommendations (also from non-SSCM channels) or examples of cases / findings where the respondents' data quality was improved through validation?
Many thanks in advance!
Iain Fraser
Relevant answer
Answer
The ISIMIP project has a lot data on sustainability:
  • asked a question related to Data Quality
Question
2 answers
Hello,
I've completed sequencingy my library, and starting my analyses. I'm new to using mothur, and I request assistance with evaluating the my data's quality. I've included a subset from one of my samples containing output of fasta, fastq, and qual files. The subset contains the same three individual organisms among all files.
I understand the fasta files contain DNA sequences for later identification. It's the output from the fastq and qual files that giving me the most confusion. For example, what do the numbers signify in the qual file?
Please help me understand what the fastq and qual files are saying of the fasta file.
Thank you,
Gary
Relevant answer
Answer
* Fastq is the sequence + quality information. This file is basically provided by the sequencing company. It is used to control the quality of the reads going in further pipeline. Fastq file is converted to Fasta after adapter trimming and quality assessment.
* Fasta files are the sequence files which have sequence attached to a specific identifier. These are generated after the quality filtering, and size of file is smaller than fastq.
* Qual files are the Fastq minus fasta, this only has quality information and generated after spliting fastq file.
If you are using a pipeline which filters out good quality reads (your threshold), you don't have to think about what fasta and qual file has. Fasta files will become comprehensive after dereplication and clustering the reads into OTUs. Therefore, just follow the protocol. Meanwhile, please go through the manual of mothur to understand how the process works. I suppose you should do it before you start the actual analysis.
  • asked a question related to Data Quality
Question
9 answers
Hi everyone,
I am pretty new in SEM and I am seeking clarification regarding model fit through AMOS.
While performing the analysis I obtained good model fit indices values exept RAMSEA and CMIN/DF.
For a further comprehension of the model: I used the Maximum Lilkelihood Estimation; the model has 4 exogenous (C, D, E,F) and 2 endogenous (A, B) variables; the total amount of the items (responses provided on the Likert scale) of the questionnaire is 36; the scores of the items were averaged for each variable.
These are other details:
N= 346
CMIN/DF=6.9
NFI=0.98
CFI=0.98
SRMR=0.04
GFI=0.99
RMSEA=0.13
The standardised parameter for the model are:
A <-- C* 0.11
A <-- D 0.08
A <-- E*** 0.36
A < -- F*** 0.22
B <-- A*** 0.46
B <-- C 0.05
B <-- F 0.09
Significance of Correlations:
*** p < 0.001
** p < 0.010
* p < 0.050
The model explaines 28% of the variance in B and 38% in A
Can anyone please thoroughly suggest me how to overcome this problem of the inadequate (poor) value of RMSEA?
Can it be a problem related to data quality? I also read in Kenny et al. (2015) that in case of low d.f.( in my case d.f.=2) RMSEA has more probability to be over 0.05.
Kenny, D. A., Kaniskan, B., & McCoach, D. B. (2015). The performance of RMSEA in models with small degrees of freedom. Sociological Methods & Research, 44(3), 486-507.
Relevant answer
Answer
CMIN/df is not really used in these large sample studies, so we can safely ignore it. This leaves the outlier of the RMSEA. In my experience, simple models (1 or 2 factors and a low number of items) tend to follow the pattern in your results.
The main difference of RMSEA compared to the CFI (and other measures) is that RMSEA is a measure of absolute fit and does not have any corrections based on model compexity. It appears you may have a very simple model (2 df), although you don't specify here the exact model, so that assumtion be wrong.
Now, the SRMR is also an absolute measure of item fit, but you have a good SRMR. However, according to David A. Kenney (http://davidakenny.net/cm/fit.htm), the SRMR is biased in favor of low df, which you certainly have.
What does this mean? Well overall you have a simple model that captures a good amount of the variance considering how simple it is, but there's still a large absolute error. Do you need to alter you model? Well maybe. If there are clear theoretical reasons to do so, you should. However, considering all fit metrics, it is probably "acceptable," but not "good." So I wouldn't change the model unless there is a good theoretical rationale. Then you'll want to provide a good explanation of the fit metrics and any biases in any article or paper you write based on this model.
So is it a problem? Maybe. Depends on how complex your model really is and if there is a theoretical rationale to update your model.
Is it a problem of data quality? Maybe, but probably not. Most SEM techniques can adapt for variance, missing data and so on. It depends on the estimator you use and the degree of these issues. you don't say what estimator you are using or if the data are continuous, categorical or interval.
Is it because of the low df? Probably, in as much as you probably have a simple model and there are some inherrent biases in some of the measures.
  • asked a question related to Data Quality
Question
3 answers
I am trying to develop a Machine learning algorithm for Data Quality. The goal is given a data set , I should be able to identify the "bad" records in the given data set.
I tried one class SVM, I am little confused, I assume that i need to train the model with "good" data only. but if I have to do this then I would need to know that what are the good records and this not the goal.
If I train the model with good and bad both types / class of records then oneClass SVM is not givng the good accuracy. is it a good approach.
Relevant answer
Answer
No worries. You're welcome.
If the target labels are in your data, then you have a classification task (not clustering).
At the very early stage, make sure that your dataset is prepared well. Then, try many classification algorithms, where each of them needs to be evaluated with k-fold cross-validation, for instance.
You need sometimes to tweak the classifier based on the nature of the data, such as selecting the right kernel.
Cheers,
Dr. Samer Sarsam
  • asked a question related to Data Quality
Question
4 answers
I have heard people suggesting me to track the dominant eye when doing single eye eye-tracking, because the "dominant eye will give more accurate data". However in normal subjects the fixation locations of two eyes should be quite consistent. Is the data quality significantly different between dominant eye-tracking and non-dominant eye-tracking?
Relevant answer
Answer
No, unfortunately. I am still working on it. However, take a look at the study attached. In the opposite way, they consider a fixation accurate where the dominant eye is located on the target. Also, what eye tracker are you using? I am using the tobii x3-120 and in their manual they recommend using the dominant eye for more accurate precision. I am finding in my study the dominant eye is more accurate too.
Paterson, K. B., Jordan, T. R., & Kurtev, S. (2009). Binocular fixation disparity in single word displays. Journal of Experimental Psychology: Human Perception and Performance, 35(6), 1961-1968. doi:10.1037/a0016889
  • asked a question related to Data Quality
Question
4 answers
Hello all,
I want to created a composite index that will measure the well-being state of UE countries. My data set is panel (15 years, 28 countries an 15 indicators). I want to created the index based on Principal components technique, but i am not sure what should i do before.
What i was thinking to apply is:
-> removing outliers: i do not think that is necessary to remove them for this analyse. Is this right?
-> check the seasonality of data
-> check the stationarity of data with Unit Root test: Some variables need to be standardized. Should i standardize them by log them? Id i ise the diff tehnique, i will loose some rows.
One mention here is that the SPSS has the option to standardize variables, but when i take them in eViews in order to check the unit root test, then i see that they need to be standardized (my guess is that SPSS standardization means normalization)
-> check Alpha Cronbach values (it should be over 0,6 to continue applying PCA)
-> PCA
Thank you!
Relevant answer
Answer
The index will be used for classification and for prediction (probably with KNN).
  • asked a question related to Data Quality
Question
4 answers
Hi! Can anyone share some practical experience about Bacterial WGS on HiSeq 2500 Technology (average depth of 200X coverage)?.
I would like to know mainly 1. About data quality 2. Average Read length 3. Error rate 4. Advantage over MiSeq.
Thanks.
Relevant answer
Answer
agree with @ Ajit kumar Roy
regards
  • asked a question related to Data Quality
Question
3 answers
I am looking for data analysis tools that can be added into a new database (legacy system migration, without database from legacy system) which takes structured data (pre-determined format, seen as correct) as an input.
Relevant answer
Answer
There are many data analytic tools (usually commercial) that claim that they can provide information about the quality of your files (and some also claim that they can subsequently provide files where the errors are 'corrected'.) The methods are called 'profiling' tools. I am not aware of any that work in a minimally effective manner. Difficult errors are determining 'duplicates' in files using quasi-identifying information such as name, address, date-of-birith, etc. Two records may be duplicates (represent the same person or business) even when the quasi-identifying information as representational or typographical error. Any quantitative information from the two records representing the same entity may have slight (or major) differences. If the records having missing values associated with the data that an individual wishes to analyze, then the missing values should be filled-in with a principled method that preserves joint distributions (e.g., Little and Rubin book on missing data, 2002). The 'corrected' data may also need to satisfying edit constraints (such as a child under 16 cannot be married).
htpps://sites.google.com/site/dinaworkshop2015/invited-speakers
  • asked a question related to Data Quality
Question
3 answers
How can I check for climatic data quality aside comparison with other station's data or checking for outliers?
  • asked a question related to Data Quality
Question
6 answers
We have a wearable eye-trackers in our lab (SMI ETG-2w), and I hope to start some experiments using it. I notice that many desktop eye-trackers provide a sampling rate at 500-1000 Hz, and allow 9-points calibration before experiment; whereas ours only has a sampling rate at 60 Hz (though it can be upgraded to 120 Hz if we pay some extra money), and only allows 3-points calibration.
We hope to do some serious neuroscience & psychophysics experiments, analyzing saccades, fixations, and pupil size, and the subjects will sit in a lab. No out-doors experiments are currently planned. Now I have some doubt on whether our eye-tracker can provide enough precision & accuracy, as in pilot runs when we show dots on random locations on the screen and let our subjects fixate on it, our eye-tracker could not reliably give the correct fixation locations on some dots.
Does wearable eye-trackers always provide worse data than desktop eye-trackers? I hope someone with some experience on both kinds of eye-trackers can help me know what levels of data quality I can expect at most.
Relevant answer
Answer
Hey Zhou,
I would add, that it critically depends on what you mean by "serious neuroscience & psychophysics experiments".In the linked question, I have collected some related questions to ET use.
In general, if you don't need subject mobility, more restraints are better (head/chin rest, bite bar, etc.) and will yield better results (in terms of accuracy and precision).
In addition, stationary ETs are currently still faster by a large margin. The fastest VOG-based head-mounted ET is around 300 Hz whereas for stationary, head-fixed VOG-ETs you can got up to 2000 Hz. Since head-mounted ETs use a (low frame rate) scene camera to record the participants environment, many systems don't see the need to go to higher sampling rates.
For fixation analyses, which is more common in psychology & related fields, low sampling rates are not a problem.
But as soon as you go for saccades, sampling rate becomes a critical factor.:
  • Small saccades, or even worse: mircosaccades, have very short durations (<30ms) which you simply cannot record with low sampling systems (see Johannes' example).
  • Peak velocities of all saccades except the largest ones get severely distorted at sampling rates below 500Hz. Upsampling can get you down to 250 Hz or even 120 Hz (see Mack2017). With some elaborate bulk of signal processing procedures you might get acceptable results even at 60 Hz (see Wierts2008).
  • Onset latencies, and thus durations, also have increased jitter with lower sampling rates due to onsets between samples (see Andersson2010).
Finally, the calibration makes a difference (see Nystroem2013).
TL;DR: Provide more info on what you plan to do. If you analyses saccade peak velocities, get a stationary ET.
Hope that helps, Greetings, David
  • asked a question related to Data Quality
Question
11 answers
I'm new to NGS anlysis, there are many QC (quality control ) software for NGS data, which one is best? I'm using https://usegalaxy.org so it's better to be available in this web sever. or is there any better program for QC to run on my laptop? is it heavy computation for a laptop (core i3 cpu and 4 Gb ram)?
Relevant answer
Answer
There are many tools actually. If you find my book chapter useful, it could help you hopefully. Although title says it's basically about RNA-Seq data but you would get idea about many tools that apply to NGS data QC. Below is the link of my chapter.
  • asked a question related to Data Quality
Question
3 answers
I've used REDCAP for several years but we're initiating data collection on a new project that includes 10 parent survey instruments and 13 forms that research assistants are entering standardized assessment data into. I'm trying to figure out if there is a way to restrict double data entry to just the assessment data (not the surveys) and if there are recommendations about user permissions / data access groups. Any advice from REDCAP users with experience with double entry would be greatly appreciated!
Relevant answer
Answer
The RedCap administrator is able to provide access to double entry of data under user rights for your project. You can further define if one or two users will capture the data. At our university it is fairly easy to get the RedCap administrator to make changes to a project.
All the best wit your RedCap project.
Ref:
  • asked a question related to Data Quality
Question
5 answers
Fellow researchers,
I am looking for an article (review?) that investigates if quick/timely/near real-time reporting of health care quality data for benchmarking / quality improvement purposes results in higher acceptance of such data by providers.
I want to make the argument that if you want to use such data for quality management you need to have it quickly, ie not wait for months/years before reprots are published. Reviews that I found emphasize all sorts of contextual factors for the success of quality improvement interventions in healthcare but not timeliness of reporting. Can anybody help? Many thanks!
Relevant answer
Dear Christoph Kowalski, it is not exactly what you are asking for (i.e., a review or study reporting on the increased acceptance or satisfaction with the reporting system) but 'timelines' and 'punctuality' in the provision of analytical data is considered as a quality criteria for any monitoring system (even more so in the case of clinical performance analysis systems). 
As an example, the Eurostat Quality Assurance Framework establish both 'timeliness' and 'punctuality in the delivery of metrics' as their 13 principle (page 30, http://ec.europa.eu/eurostat/documents/64157/4372717/Eurostat-QAF-V1-2.pdf) defining 'timeliness' as 'the availability of the information needed to take action at the appropriate moment of decision making', meaning that there has not necessary to be from a precise period, but be ready and available to the professional (i.e., health or otherwise) to facilitate an informed decision. In clinical settings, is established depending on contextual needs. One of the best examples you could use to argue your point is to follow the analogy of the laboratory test in the emergency care as an information system allowing timely decision making based on current data on the patient, thus promoting action to change an urgent situation. This example can be extended to other health information systems already mentioned in the other answers above. 
Similarly, 'punctuality' to deliver those metrics following a predefined delivery plan is also stated as a data quality principle as it sets the expectative of the user of the information system on the data availability within acceptable margins, facilitating the use of the information provided.
Both 'timeliness' and 'punctuality' are quality dimensions that you should take into account when trying to create an information system to be useful in the clinical context. 
  • asked a question related to Data Quality
Question
2 answers
the basic concern is whether or not the existence of a regulated environment improves data quality and reproducibility
Relevant answer
Answer
don't know.  First I'd want to see the data, and then see what makes sense....  however, since I haven't seen anything... no worries
thanks for replying,
best regards,
joe
  • asked a question related to Data Quality
Question
3 answers
which machine learning techniques used for SAR data quality assessment?
Relevant answer
Answer
  • asked a question related to Data Quality
Question
3 answers
Background: I'm using an online survey company to launch a questionnaire nationally. The survey is to identify factors affecting online shopping intention, so there will be a lot of scale questions, as well as respondents' demographic information. I'm asking the company to get 50 responses before carrying on the full survey.
Question: Would you suggest some quick stat tests using the pilot data, to check if the responses are of high quality? (whether the profile is diversified / whether they answered the questions carefully / etc)
Many thanks for your help!
Li
Relevant answer
Answer
Would you suggest some quick stat tests using the pilot data, to check if the responses are of high quality? (whether the profile is diversified / whether they answered the questions carefully / etc)
You should always conduct a pilot study / test before going for "big bang" data collection.  The purpose of the pilot study include the following which is not an exhaustive list:
  1. checking the reliability of your survey questionnaire especially for those new questions you'd developed on your own to measure certain constructs / variables e.g. Cronbach Alpha Reliability / Composite Reliability scores.
  2. evaluating whether your future respondents can fully understand your survey questionnaire - if not these can be rectified before going for "big bang" data collection.
  3. give you initial glimpse on your tested results after initial data analyses based on your hypotheses formulated - so that you can adjust your research model, operationalization of your constructs / variables, hypotheses etc. as appropriate.
  • asked a question related to Data Quality
Question
3 answers
I am looking for information on existing research on methodology and tools intended to collect data samples representing instances of a novel ill-structured subject domain. This task is, in some sense, inverse with regard to Big Data processing: I have to collect, as much as possible, data instances of a novel domain I familiar not well and for which I am capable to formulate only the basic set of ontology concepts. I need this data set (let it Big Data of any type) to design afterwards an empirical model of dependencies peculiar to the domain attribute while using a Big Data processing technology (in fact. machine learning technology I have) while mining the collected data set. This is a kind of Ontology-driven Data Search and Collection problem. Could someone help me?
Relevant answer
Answer
Dear prof. Gorodetsky,
I'd suggest you to analyze the existing literature on your "ill-structured" domain by means of topic modeling techniques. By inferring relationships across topics you can build the dependency model you are after. In the attached paper you can find a methodology we have designed in order to perform such an analysis.
Regards,
Andrea De Mauro
  • asked a question related to Data Quality
Question
6 answers
I have a particle sample which will aggregate to form bigger micrometer sized clusters as I heat them up. I want to use dynamic light scattering(DLS) to determine the size of the aggregates during the heating up process. However, given the polydispersity of the aggregates, the DLS instrument always told me that "data quality too poor because the sample is too polydispersed". I am wondering is the Z-average size I get still reliable? Can I rely on the size distribution from the instrument or I need to use other methods to derive the size distribution of the polydispersed aggregates? Thanks very much!
Relevant answer
Answer
I am afraid that a simple Z - sizer is not an appropriate instrument for such measurements. To be sure in a result of the measurements, they  should be done at 2 or 3 scattering angles and it should be checked that the correlation function processing program allows a repetition of the location of size distribution peaks. We should also note that when particles are large (and their amount is small) in the different times the scattering volume can be provided by different particles. Therefore at each scattering angle few measurements are needed. So for DLS measurements of complex size distribution you must use Malvern DLS sizer with opportunity for angle changing or analogous Nanotrec sizer or Photocor FCN sizer with good program for correlation function expantion on exponents. For Photocor the program DynaLS is sufficient, for other sizers compatible software is needed. Look for the DLS installation of this class.
  • asked a question related to Data Quality
Question
9 answers
My data will be the quality audit reports of 45 HEIs in the Sultanate of Oman.
How can I validate/confirm the proposed model/framework?
Thanks in advance.
Relevant answer
Answer
Yes, I think SEM should help. But a model should enumerate indigenous and exogenous variables with well grounded hyptheses and should not have either type-I and/or type-II errors.
But, it would be necessary to validate with case based inputs.
  • asked a question related to Data Quality
Question
3 answers
I am looking to create a system that proactively monitors data stores for data quality, where the data quality rules are expressed in a language understandable by a lay-person (e.g, Attempto). The language constructs could then be parsed into the equivalent data store-native language (e.g., SQL, NoSQL, etc.).
If you know of any research in this area, I would be most appreciative.
Relevant answer
Answer
Fellegi and Holt (JASA 1969) rediscovered one of the fundamental theorems of logic programming.  The mathematics of the models drive the implementations.  Winkler (2003) demonstrated theoretically how edit rules could be logically connected with imputation procedures so that there were guarantees that the joint distributions of fields would be preserved in a principled manner.  The mathematics guarantee that the entire set of rules are logically consistent (including rules on the joint distributions that are standard in statistics).  All Fellegi-Holt-based systems assure logical consistency.
A number of papers such a Galhardas et al. and many papers on variants of symbolic logic assure that rules in languages can be converted to the types of rules (sometimes) in the Fellegi-Holt model.  What the rules-conversion do not assure is the logical consistency of the systems.  We found in small situations that it was very straightforward to convert the English-type rules to suitable tables (sometimes in less than four hours).  In much larger situations, such general rules-based methods would have been very useful.  We never had the resources to pursue the additional features in our production systems.
In some of our systems, we need to convert 100s or 1000s of rules into table format.  The conversion is tedious.  Once in the table-format, it is relatively straightforward (but may take much computer time) to check the logical consistency.
  • asked a question related to Data Quality
Question
3 answers
We assume that a composite service consists of a number of atomic service.
Given the quality of  each atomic service, how to measure the overall quality of the composite service ?
For example, we can define that: the higher mean value the higher overall quality, and the lower variance the higher overall quality. However, the question is how to aggregate mean value and variance ? Is there any metrics? 
Relevant answer
Answer
I would use weight of each atomic system, which is a measure that gives size and/or relevance of the system. That's something you have to come up with.
Then simply multiply each weight factor with the respective DQ factor and then use such values to do avg, stddev, etc.
  • asked a question related to Data Quality
Question
2 answers
I am familiar with Actor-Network Theory , i applied ANT as tool to construct network collaboration, whereas  i want to use ANT with data quality as metric to choose actor network .
Thank you in advance
Salim
Relevant answer
Answer
Hi Mohammed
I have a small list of ANT papers which might be interesting for you. I also pulled out a much shorted list of those which address quality issues. I hope that helps.
Regards
Bob
enc
  • asked a question related to Data Quality
Question
3 answers
In situation where it is not feasible to quantify DQ dimension due to non involvement of a data set , is it suffice to show the subjective measures directly asked from the customer? 
To be more clear, satisfaction of a customer in meeting the desired requirement cannot be objectively measured. In this case is customer satisfaction feedback be sufficient for the successful approval of the project while presenting to a committee?
Relevant answer
Answer
There are many studies linking subjectively obtained, survey-based measures of customer satisfaction with its outcomes. For a summary, see:
Mittal, Vikas and Frennea, Carly, Customer Satisfaction: A Strategic Review and Guidelines for Managers (2010). MSI Fast Forward Series, Marketing Science Institute, Cambridge, MA, 2010.
  • asked a question related to Data Quality
Question
7 answers
Specifically I'm interested how to use ANT in governance of information system, i want also to use ANT with quality data?
Relevant answer
Answer
Hi There
It has a huge list of IS Models/Theories in alphabetical order and Actor-Network is there. Click on Actor-Network Theory and scroll down to where is says "IS articles that use the theory" .... there are plenty and some are very recent. Enjoy and Good-luck.
  • asked a question related to Data Quality
Question
46 answers
I have over 5 years developing predictive models with years of experience as a researcher, statistical analyst as well as data scientist. One aspect that I have experienced within the big data sphere as well as predictive modeling landscape is that a lot of emphasis are either place on data quality and quantity, the experience and expertise of the modeler, the kind of system that is being used to build the model, validate, test, and continue to monitor and assess the model quality and performance over time. With this said, I would like to see what others here on Research Gate think are some of the challenging task building either a statistical or predictive models and what were some of the strategies you employed to help address those challenges? What were some of the tradeoffs you had to make and how would you approach similar situation in the future?
Information provided to this inquiry will be used for personal and professional growth.
Relevant answer
Answer
Hello,
For me, possibly the most challenging is/was/will be to identify a niche within a vast amount of knowledge to be able to introduce meaningful research questions that would be novel and could enhance understanding of mechanisms underlying psychological disturbances (Yes, I am a scientist and a psychologist).
Sounds like a philosophical statement. But it matches your question - very broad and philosophical too.
Regards, Witold
  • asked a question related to Data Quality
Question
3 answers
The aim of data fusion(DF) is basically increasing the quality of data through the combination and integration of multiple data sources/sensors. My research is on assessing this impact of DF on DQ(data quality) hence I would appreciate the academic materials to backup your conclusions.
I have being trying to link DF methods to the DQ dimensions that are mostly impacted on to no avail.
Relevant answer
Answer
DF is improving DQ when (and only when) the different data input streams are to some degree correlated. If not, DF does not make any sense.
Sorry - I cannot give you academic materials because I have none on this topic. The above is from experience and some own work in this area.
  • asked a question related to Data Quality
Question
1 answer
During data entry, we can correct errors before mining by creating forms through question ordering during question reformulation.
  • asked a question related to Data Quality
Question
7 answers
We are doing surveillance in a scattered geographical area. How can we assess data quality? Should we compute error rates or directly observe how data is being collected?
Relevant answer
Answer
There is a direct correlation between source of data, collection of data and error rate when it comes to quality of data. Personally, I think you should pay more attention to how data is being collected because this is the most likely source of data quality breach. The more accuracy you have in data collection, the less the error rate you will observe.
DD=DCM+ER
ER=DD/DCM  
Where
DD=Data Destination (Constant variable- Always known)
DCM=Data Collection Methods
ER=Error Rate
  • asked a question related to Data Quality
Question
3 answers
We are intended to assess quality of data. Can we use LQAS strategy for assessment of data quality.
Relevant answer
Answer
This is an accepted method of sampling in public health and is equivalent to stratified sampling. The sample sizes maybe smaller but allows the development of hypotheses.
WHO does recommend this form of sampling for large studies such immunisation coverage
  • asked a question related to Data Quality
Question
3 answers
I would love some pointers to existing work/papers on sparsity in large data sets. What I am looking for are not the (important) usual papers on how to compute given sparsity in large data sets; I am instead thinking about how one might use the fact of sparsity and a non-homogeneous distribution of features and relationships to characterize overall solution spaces/data set spaces into regions of greater interest and less interest. 
Relevant answer
Answer
There are a couple of things going on here that can lead to sparsity.
1.  Incompleteness - this can be dealt with assuming or testing that your data set complies with random sampling.
2.  Censored Data - sparsity could result from the inability (instrumental or otherwise) to measure certain things.  This is often referred to as "Survival Statistics" and you can use that as a search term to find useful information.
3.  An unusual selection function.  Selection functions are very important and are generally unknown at the onset.  They occur depending upon how data was gathered and the (unknown) bias or limitations of that data gathering.  All data samples, in the end, are the result of some selection function convoluted with the intrinsic distribution. This produces the sample distribution 
  • asked a question related to Data Quality
Question
1 answer
If yes, could you give me an example? In my opinion the Maturity Models for MDM / DQ are unspecific / generic for all business branches in the most cases.
Thanks and best regards
M. Gietz
Relevant answer
Answer
Thank you for bringing this matter to our attention!. It is still a great help and a real pleasure to read your posts