Science topic

Data Quality - Science topic

Explore the latest questions and answers in Data Quality, and find Data Quality experts.
Questions related to Data Quality
  • asked a question related to Data Quality
Question
3 answers
In a lot of machine learning projects, especially with real-world messy datasets, the issues of a model’s performance are usually traced back to either the model’s architecture or the quality and structure of the input data. It is critical to note though, that in trying to assess the two factors, the sources of error might not be as easy to delineate. Improvement can come from architectural changes, hyperparameter tuning and optimization heuristics, just as much as it can come from better data preprocessing, relabeling, or reconsidering the features used for representation.
How do you go about this decision? When do you reach the point where you stop refining the model and pivoting to concentrate on the dataset? Are there empirically defined learning curves, analytical tools, or other indicators that tell you whether you’ve reached a “data ceiling” instead of a “model ceiling?” I’d like to hear about frameworks, intuitions, or concrete examples across various domains of vision, language, and sensor data that you they have found helpful.
Relevant answer
Answer
A few factors—such as how well the model is doing, how excellent the data is, and how much important information the data holds would help researcher decide whether to enhance the model or cleanup the data. Usually, if the model continues to underperform after testing several configurations, it is preferable to enhance the data by including more examples or greater details. But if the data is already strong, improving the model itself might be more effective. Best outcomes come from enhancing both over time, depending on what you discover from each stage. hope this helps.
  • asked a question related to Data Quality
Question
6 answers
How suitable are existing sustainability development impact evaluation frameworks for assessing the effectiveness of sustainable international business practices by multinational enterprises in developing economies?
NB: This question evaluates to what extent current sustainability impact evaluation frameworks account for the unique economic, social, and environmental conditions of developing economies.
Relevant answer
Answer
We can measure the impact by some statistical methods as the correlation coeffecient between the companies' works results (as an independent variable) and the economic development results (as a dependent variables), we can analyse this correlation statistically by the autocorrelation of every company results. The quantitative measures are effective in these cases.
  • asked a question related to Data Quality
Question
3 answers
Hi ,
I’m using a GLMM for my sleep study because it handles missing data and different observation lengths well. My data is unbalanced since this was a field study, not a controlled lab setting. Shift situations varied within and between subjects, and we had a longer pre-phase (~3 weeks) compared to post (~1 week).
We measured 9 shift situations across two groups (control/intervention) in a pre-post design:
  • Pre-Control: 437, Pre-Intervention: 425
  • Post-Control: 191, Post-Intervention: 210
All situations per time/group stratification were recorded at least 15 times, except for one case with only 5 records.
A chi-squared test showed a significant imbalance (X²(24) = 70.02, p < .001). I calculated the Cluster Size Coefficient of Variation (C.V.) = 0.42, based on this referencehttps://api.istex.fr/ark:/67375/WNG-DQ4BGH16-P/fulltext.pdf?auth=ip,fede&sid=google,istex-view .
Does the GLMM (lmer(sleep_variable ~ time * group + shift_situation + (1 | subject)) already account for this imbalance, or do I need to use weighted sleep data to improve data quality?
Relevant answer
Answer
Broadly speaking GLMMs are fine with imbalance and short of getting more data there's not much better as an option in most cases.
They are however not magic. The big concern is simply lack of information where you have small cell sizes in something like a factorial design or just sparse information because of the imbalance. This leads to imprecise estimates (as it should - less information implies noisier estimates). In other words just because it copes well with imbalance doesn't mean you don't have low statistical power in relation to some research Qs. As a rule of thumb power to detect a difference in means is maximised when there is balance and limited more by the smaller n for imbalanced samples.
A full Bayesian model with informative priors might help with this but in my view that depends on being able to justify the priors.
GLMMs also have shrinkage - which is one one reason why they work well with imbalance. All estimates are shrunk towards the estimates for typical observations - this dampens down some of the noise in estimating with imbalance. In a sense they trade off precision for bias - the estimates are arguably more biased within sample but more precise than in models without shrinkage. They should also generalise better out of sample.
  • asked a question related to Data Quality
Question
1 answer
The production industry faces several challenges when implementing AI, including:
1. *Data Quality and Availability*: AI algorithms require high-quality and relevant data to learn and make accurate predictions. However, production data can be noisy, incomplete, or inconsistent, which can affect AI model performance.
2. *Integration with Existing Systems*: AI solutions often require integration with existing production systems, such as ERP, MES, and SCADA systems. This can be a complex and time-consuming process, especially when dealing with legacy systems.
3. *Explainability and Transparency*: AI models can be difficult to interpret, making it challenging to understand why a particular decision was made. This lack of transparency can lead to trust issues and regulatory concerns.
4. *Security and Privacy*: Production data can be sensitive, and AI systems must ensure that data is protected from unauthorized access and breaches.
5. *Scalability and Performance*: AI solutions must be able to handle large volumes of production data and perform complex calculations in real-time, which can be a challenge for many production environments.
6. *Lack of Standardization*: The production industry lacks standardization in data formats, communication protocols, and AI frameworks, making it difficult to develop and deploy AI solutions that can work seamlessly across different systems and environments.
7. *Talent and Skills*: The production industry often lacks the necessary talent and skills to develop, deploy, and maintain AI solutions, which can lead to implementation delays and costs.
8. *Change Management*: Implementing AI solutions often requires significant changes to business processes, organizational structures, and employee roles, which can be difficult to manage and may require significant cultural shifts.
To address these challenges, production companies can:
1. *Develop a clear AI strategy* that aligns with business goals and objectives.
2. *Invest in data quality and management* to ensure that AI algorithms have access to high-quality data.
3. *Collaborate with AI vendors and partners* to develop customized AI solutions that meet specific production needs.
4. *Develop internal AI talent and skills* through training and education programs.
5. *Implement change management processes* to ensure a smooth transition to AI-powered production systems.
6. *Monitor and evaluate AI performance* to ensure that AI solutions are meeting business objectives and identify areas for improvement.
By addressing these challenges and developing effective AI strategies, production companies can unlock the full potential of AI and achieve significant improvements in efficiency, productivity, and innovation.
Relevant answer
Answer
By addressing the challenges in data quality, system integration, security, scalability, and talent development, while fostering a culture of transparency and change management, production companies can unlock the full potential of AI. This holistic approach will enable manufacturers to drive operational efficiencies, improve product quality, and ultimately gain a competitive edge in the market. Friday Ameh
  • asked a question related to Data Quality
Question
1 answer
Sub-Research Questions
1. What are the socio-economic impacts of poor data quality on financial forecasting in developing economies?
• This question aims to explore how data quality issues specifically affect financial forecasting in developing economies, where data infrastructure may be less robust. It addresses the broader socio-economic implications, such as the impact on economic growth, investment decisions, and financial stability.
2. How can emerging technologies, such as artificial intelligence and blockchain, be leveraged to improve data quality in financial forecasting?
• This question focuses on the potential of emerging technologies to enhance data quality. It explores how AI and blockchain can be used to ensure data accuracy, integrity, and reliability in financial forecasting.
Relevant answer
Answer
Data Quality and Financial Forecasting:
Data quality is crucial for accurate and reliable financial forecasting. High-quality data ensures precise predictions, while poor data can lead to flawed decision-making, financial mismanagement, and loss of trust. Accurate data supports budgeting, investment decisions, and risk assessments, whereas poor data risks causing inaccurate forecasts and financial instability.
Sub-Research Questions:
  1. Socio-Economic Impacts in Developing Economies: In developing economies, poor data quality can hinder economic growth, deter investment, and jeopardize financial stability. Misguided policies, inefficient resource allocation, and inadequate risk assessments can result in slower development and negative socio-economic outcomes, especially for marginalized communities.
  2. Emerging Technologies to Improve Data Quality: Emerging technologies like AI and blockchain can enhance data quality in financial forecasting:AI: Automates data cleaning, improves predictive accuracy, and uses NLP for insights from unstructured data. Blockchain: Ensures data integrity with tamper-proof records, provides transparency, and improves traceability, making financial data more reliable and trustworthy for forecasting.
Together, AI and blockchain can mitigate data quality issues, ensuring more accurate and reliable financial forecasting.
  • asked a question related to Data Quality
Question
1 answer
Hi. I construct a multidimensional data quality indicators for a low-cost wireless sensors network. Currently, it is tested using synthetic data that reproduce data quality issues such as accuracy, timeliness, completeness and reliability. Is there any testing methods using real world datasets that can test the robustness of the indicators?
Relevant answer
Answer
The multidimensional data quality indicators that I developed for the low-cost sensors is based on the outcome of my systematic literature review in the attachment.
  • asked a question related to Data Quality
Question
2 answers
I am seeking specific Generative AI Use Cases. It would be more helpful if the use cases are in the context of cybersecurity or data quality realms.
Thank you in advance.
Relevant answer
Answer
For a software engineer, generative AI can assist with code generation. GitHub Copilot and ChatGPT are great tools for speeding up development. Another use case is bug detection and fixing. AI can analyze codes and identify and fix bugs. In other words, generative AI can be a personal assistance for software developers.
  • asked a question related to Data Quality
Question
5 answers
In agriculture, AI technologies are increasingly used for tasks like crop monitoring and yield prediction. However, inconsistent data quality can introduce biases, affecting decision-making. It's important to understand the measures being implemented to enhance the reliability of these AI technologies to ensure they provide accurate and fair outcomes.
Relevant answer
Ya Shi He
Great point! XAI is crucial for transparency and trust, especially in agriculture where understanding model predictions can help identify biases or errors caused by poor data quality. Thanks for highlighting that!
  • asked a question related to Data Quality
Question
3 answers
Dear ResearchGate community,
I am currently using the ELEFAN_GA function in TropFishR to analyze monthly length frequency data and estimate growth parameters in fisheries biology. During the optimization process for bin size and moving average, I encountered a puzzling issue: in certain cases, the estimated Linf values (asymptotic length) were smaller than the corresponding Lmax values (maximum observed length). However, it is expected that Linf should be greater than Lmax.
Despite ensuring data quality and adhering to model assumptions, this discrepancy persists. Therefore, I am seeking valuable insights from the research community to identify potential causes behind this issue and explore possible strategies to resolve it effectively.
Relevant answer
Answer
Hello, Ragavan,
Actually, there is no reason why Linf should be greater than Lmax.
Lmax may be above or below Linf (see Schwamborn, 2018).
Schwamborn, R. (2018). How reliable are the Powell–Wetherall plot method and the maximum-length approach? Implications for length-based studies of growth and mortality. Reviews in Fish Biology and Fisheries, 28(3), 587-605.
Within a simple VBGF (von Bertalanffy growth function) growth model, and considering a healthy, pristine population, a large number of individuals will reach adulthood and grow to a size that is close to the mean asymptotic size of the population Linf (i.e., the theoretical mean size at infinite age). Some old individuals may be larger than Linf, and others smaller than Linf. The mean length of the old individuals should be somewhere below Linf. This does not mean that all individuals are smaller than Linf.
Within such a population, the Lmax/Linf ratio will be defined by the intra-population variability in Linf (Schwamborn, 2018), and the mortality/growth ratio (Z/K ratio).
If all individuals were much smaller than Linf, this would indicate a "dwarfed" population (Schwamborn & Schwamborn, 2021, https://panamjas.org/pdf_artigos/PANAMJAS_16(1)_57-78.pdf), e.g., due to growth overfishing.
Don't forget that it is important to estimate Linf and K within a near-unconstrained search space (not constraining or fixing Linf a priori), by using search spaces that are as wide as possible, where the search space limits will not interfere with your fit algorithm.
Also, you should ideally repeat the ELEFAN_GA model fit procedure many, many times, with different “seed” values, to avoid being stuck in local maxima within the response surface. You can also do this in an automatized way, hundreds of times, and obtain 95% confidence intervals for Linf and K by bootstrapping with the "fishboot" package (using the function ELEFAN_GA_boot, Schwamborn et al., 2019)
Schwamborn, R., Mildenberger, T. K., & Taylor, M. H. (2019). Assessing sources of uncertainty in length-based estimates of body growth in populations of fishes and macroinvertebrates with bootstrapped ELEFAN. Ecological Modelling, 393, 37-51.
#To load fishboot (using devtools):
library(devtools) install_github("rschwamborn/fishboot")
Good luck with your analyses. Don't hesitate to contact me if you need any help.
Best regards,
Ralf
--
************************************************************************
Prof. Dr. Ralf Schwamborn
Depto. de Oceanografia
Universidade Federal de Pernambuco (UFPE)
50670-901 Recife-PE
Brazil
Tel.: +55-(0)81 – 2126-8225, 81-2126-8226
E-mail: ralf.schwamborn(at)ufpe.br, rschwamborn(at)gmx.net
************************************************************************
  • asked a question related to Data Quality
Question
1 answer
Good morning to everyone,
I have a problem with high value of RMSEA in MGCFA with categorical variables with the WLSMV estimation method in Mplus.
I test a model consist of 4 variables on 4-point scale. I compare 28 countries.
Results of testing configural invariance are:
Chi-square: 1884.026
Degrees of Freedom: 57
P-Value: 0.000
RMSEA: 0.130
90 Percent C.I. : 0.125 - 0.135
Probability RMSEA <=.05: 0.000
CFI: 0.991
TLI: 0.972
If I do the same analysis but I set variables as continuous, so results are good (RMSEA 0,08; CFI 0,980; TLI 0,939).
Can anyone please thoroughly suggest me how to overcome this problem of the inadequate (poor) value of RMSEA? 
Thank you very much.
Relevant answer
Answer
Good morning Radka.
Did you find a solution for your problem? I am performing a similar analysis with the WLSMV estimator. When I treat the variables as ordinal, my RMSEA value is greater than .1, even reaching .2 in some models. When I treat them as continuous, the RMSEA value drops to less than .08.I'm not sure what I can do. Thank you!
  • asked a question related to Data Quality
Question
3 answers
In the era of big data and artificial intelligence (AI), where aggregated data is used to learn about patterns and for decision-making, quality of input data seems to be of paramount importance. Poor data quality may lead not only to wrong outcomes, which will simply render the application useless, but more importantly to fundamental rights breaches and undermined trust in the public authorities using such applications. In law enforcement as in other sectors the question of how to ensure that data used for the development of big data and AI applications meet quality standards remains.
In law enforcement, as in other sectors, the key element of ensuring quality and reliability of big data and AI apps is the quality of raw material. However, the negative effects of flawed data quality in this context extend far beyond the typical ramifications, since they may lead to wrong and biased decisions producing adverse legal or factual consequences for individuals,Footnote11 such as detention, being a target of infiltration or a subject of investigation or other intrusive measures (e.g., a computer search).
source:
Relevant answer
Answer
EDUARD
I would also strongly suggest looking at the nature of “outliers. “ IME, they may point to
1) Enhanced data collection methods and/or metrics ( respectively, improving future  efforts, but sometimes remarkable improvements in model validities)
2) Breakthroughs in understandings (pointing to new research directions, eg  important genetic polymorphism, and a inanticipated mechanism for reducing disease, transmission, or immediate product improvement opportunity)
ALVAH
 Alvah C. Bittner, PhD, CPE
  • asked a question related to Data Quality
Question
2 answers
Since I found out that there is a correlation between Timeliness and Semantic Accuracy (I'm studying linked data quality dimensions assessment, trying to evaluate a dimension quality -in this case Timeliness- from another dimension (Semantic Accuracy)), I presumed that regression analysis is the next step in this matter.
-the Semantic accuracy formula I used is: msemTriple = |G ∧ S| / |G|
msemTriple measures the extent to which the triples in the repository G (the original LOD dataset) and in the gold standard S have the same values.
-the Timeliness formula I used is:
Timeliness((de)) = 1-max⁡{1-Currency/Volatility,0}
where :
Currency((de)) = (1-(lastmodificationTime(de )-lastmodificationTime(pe ))/(currentTime-startTime))*Ratio (the Ratio measures the extent to which the triples in the the LOD dataset (in my case wikidata) and in the gold standard (wikipedia) have the same values.)
and
Volatility((de)) = (ExpiryTime(de )-InputTime(de ))/(ExpiryTime(pe )-InputTime(pe ) )
(de is the entity document of the datum in the linked data dataset and pe is the correspondent entity document in the gold standard).
NB: I worked on Covid-19 statistics per country as a dataset sample, precisely Number of cases, recoveries and deaths
Relevant answer
Answer
  • asked a question related to Data Quality
Question
2 answers
I wonder what testing methods are for checking data quality before analysing data regarding PLS-SEM application. Now, I know about the Common method bias (CMB) — Common Method Bias with Random Dependent Variables for checking data quality.
Relevant answer
Answer
Thank you for your answear. Honestly, I cannot fully understand what you are mentioning. Would you please give me some examples or documents? Deborah J Hilton
  • asked a question related to Data Quality
Question
8 answers
Dear ResearchGate community,
I'm currently working with a large dataset of phytoplankton species, and I'm looking for a way to routinely update the taxonomy of these species. Specifically, I'm interested in identifying changes to scientific names, accepted names, synonyms, and higher-level taxonomic classifications.
I've tried using various R packages such as worrms to retrieve taxonomic information from online databases, but I've encountered some challenges with data quality and consistency.
I would appreciate any suggestions or advice on how to efficiently and accurately update the taxonomy of my dataset on a regular basis. Are there any reliable databases or APIs that I should be using, like AlgaeBase? Are there any best practices for handling taxonomic data in R?
Thank you for your help!
Nicolas
Relevant answer
Answer
Hi Hana,
You have several ways to proceed. The first one is to use the worrms package (https://docs.ropensci.org/worrms/articles/worrms.html). This package can be a good alternative. The second one, which is a bit more complex, is to request an API key from the administrator of the Algaebase website (https://www.algaebase.org/api/). Afterwards, you will need to create a function in R that allows you to retrieve the information in JSON format and convert it into a dataframe. By the end of the year, I will post on my GitHub the functions that I have created/used to update my data.
Alternatively, you also have https://docs.ropensci.org/taxize/index.html. You can find a lot of information on this website as well: https://www.marinedatascience.co/data/.
  • asked a question related to Data Quality
Question
1 answer
Good afternoon,
During the data quality stage, if your dataset includes only categorical variables and as a result, you cannot use imputation methods, to which degree do you judge that the quality of your data is bad 5%, 10%, or more? I think it depends on your research project, research question, and models performed but if you can add the rate of missing values of your study, it can help me to have an idea of this percentage.
Thanks for your help,
Ines Francois
Relevant answer
Answer
I presume that you're talking about instances other than ignorable missing values. (An example of this would be a skip logic survey, such as: Q1: Are you married? If not, skip to question #4. Anyone answering "No" to Q1 shouldn't have responded to Q2 or Q3.)
If the sheer amount of missing data doesn't exceed 5%, you likely don't have to worry much about working with only complete cases. Some sources say that up to 10% can be tolerated.
However, it is important to check whether missingness on a variable is in any way related to other characteristics of a case, which could include demographic attributes, and/or the way a case responded to some other item. If it is, then you must be more cautious about how to proceed.
You can impute responses for categorical variables; it's just more easily done for continuous variables.
Finally, do note that a lot of times, cases with "no response" or "refusal to respond" are treated as if they had elected an otherwise viable option (hence, a three-category item could functionally be treated as a four-category item, with missing responses assigned to the fourth category). This is because numerous studies have shown that sometimes, these folks _are_ different in one or more important ways from those that did respond.
Good luck with your work.
  • asked a question related to Data Quality
Question
3 answers
..
Relevant answer
Answer
Dear doctor
I quoted the following from web hoping being satisfying
"Traditionally, data engineers write data quality rules using SQL. This manual method works well when there are dozens, or even hundreds of tables, but not when there are ten thousand or more. In a modern, data-driven organization, data engineers can never keep up with demand for data quality scripts.
New data quality automation (DQA) tools replace manual methods with ML models. You can view this product segment as a subset of data observability, which addresses both data quality and data pipeline performance. There are three different approaches to DQA: automated checks, automated rules, and automated monitoring. Each approach has its pros and cons, but collectively they represent the future of data quality management.
The following three vendors embody these approaches, respectively.
  • Ataccama employs the automated checks approach, which uses ML to classify incoming data at the row and column level and automatically apply data quality rules written by data engineers.
  • First Eigen uses the automated rules approach, which uses ML to generate data quality rules, which consist of standard quality checks (nulls, duplicates, etc.) and complex correlations between data columns and values.
  • BigEye uses the automated monitoring approach, which uses ML to detect anomalies in the data rather than apply rules. The tool monitors changes to tables at fixed intervals, triggering alerts if it detects an uncharacteristic shift in the profile of the data."
Dr.Sundus Fadhil Hantoosh
  • asked a question related to Data Quality
Question
5 answers
Hello everyone, I am checking the quality of some RNA-seq data with FASTQC and I am getting results that are not clear to me. Is this kind of result normal?
Relevant answer
Answer
The plot shows that the average quality per base alongside your 150 pb reads is very high which is good. This result is kind of normal when the sequencing has been outsourced. Most companies will give you prefiltered fastq files containing only reads with high quality. You can ask your sequencing provider if that was the case, although sometimes you can also find that info in the report they send together with your fastq files.
  • asked a question related to Data Quality
Question
4 answers
How can one do case-control matching to randomly match cases and controls based on specific criteria (such as sociodemographic matching - age, sex, etc.) in order to improve data quality?
Relevant answer
Answer
In case-control studies, you can do it in multiple ways. I assume you already know how to get the controls. Imagine, you are performing secondary analysis, you have a dataset with 200 patients, here you have 30 people of SLE (case), and 170 people without SLE (control), as we all know matching 1 case with '1 to 4' controls provides adequate power, don't match more than 4 controls for 1 case.
First, match 1:1, that is 1 control for 1 case. As sex is almost always binary, you don't have to manipulate this variable. But age is not binary, so you have to determine the age range you would go. For instance, a 20-year-old man could be matched with 18-25-year-old men in your dataset, this way you could get. Create a categorical variable for age by specifying that range. Also, create a binary variable of who are cases and who are controls. Then, export these 60 people from the dataset and can perform your analysis. If you want to get 1:2, the procedure is the same (export 90 people), but whatever you do, prespecify the hypothesis, sample size, and analysis plan in the beginning. Best of luck!
You can also see propensity score matching or inverse proportional treatment weighting but that procedure is a little complex and you need somebody with advanced statistical analysis knowledge to do that.
  • asked a question related to Data Quality
Question
3 answers
Hello everybody,
I am currently writing my final thesis final about data quality, in particular about consistency. Therefore I am looking for a labeled IoT timeseries dataset for consistency detecting. Does anybody know, where I can find such dataset?
Or does anybody know where I can get a labeled IoT-timeseries dataset for anomaly detection?
Thank you for your help!
Relevant answer
Answer
Hello,
For the general case of time series anomaly detection, several benchmarks have been recently proposed. These two benchmarks contain labeled time series from different domains (for instance, room occupancy detection from temperature, CO2, light, and humidity [1] or accelerometer sensors of a wearable assistant for Parkinson's disease patients [2]).
You may find the links to the benchmarks below. Good luck with your thesis!
[1] Luis M. Candanedo and Véronique Feldheim. 2016. Accurate occupancy detection of an office room from light, temperature, humidity, and CO2 measurements using statistical learning models. Energy and Buildings 112 (2016), 28–39. https://doi.org/10.1016/j.enbuild.2015.11.071
[2] Marc Bachlin, Meir Plotnik, Daniel Roggen, Inbal Maidan, Jerey M. Hausdor, Nir Giladi, and Gerhard Troster. 2010. Wearable Assistant for Parkinson's' Disease Patients With the Freezing of Gait Symptom. IEEE Transactions on Information Technology in Biomedicine 14, 2 (2010), 436–446. https://doi.org/10.1109/TITB. 2009.2036165
  • asked a question related to Data Quality
Question
1 answer
Hello, I'm trying to calculate the results for a product system by selecting the following options:
  • Allocation method - None;
  • Impact assessment method - ReCiPe Midpoint (I) / ReCiPe 2016 Endpoint (I);
  • Calculation type - Quick results / Analysis;
  • Include cost calculation and Assess data quality.
Well, the results are always a list of zeros for every item in the LCI. I've already tried to do the following actions to solve the problem, however I didn't have any success:
  • Increased the maximal memory to 5000 MB;
  • Validated the database (it returned back with zero errors);
  • Opened the SQL editor and executed the query: select p.ref_id, p.name from tbl_processes p inner join tbl_exchanges e on p.id = e.f_owner where e.id = 7484497 (got the reference ID and the name of the process where the exchange occurred and searched for it, opened the process and didn't find any error message with more details or a warning popup).
The openLCA version I'm working on is 1.11.0. Thank you very much for all the help. Best regards, Beatriz Teixeira
Relevant answer
Answer
You might try deleting the current product system and make new flows, processes and product system. Seems like some mistake has been made in previous steps.
  • asked a question related to Data Quality
Question
3 answers
Answers from different disciplines are welcome.
Relevant answer
Answer
Dear Rami Alkhudary,
You may want to review the data below:
Factual Accuracy and Trust in Information: The Role of Expertise
_____
How Do You Know If Information Is Accurate? How To Evaluate Information Sources
_____
What Should I Trust? Individual Differences in Attitudes to Conflicting Information and Misinformation on COVID-19
  • asked a question related to Data Quality
Question
1 answer
I am currently working with GANs or to be specific CGAN (Conditional Generative Adversarial Networks) for synthetic signal generation. To improve the generated data quality which means to increase the synthetic data similarity to original data, I have already analyzed and observed improvement by performing several hypertuning combinations for the discriminator and generator, such as modified momentum, iteration and learning rate. On top of that, batch normalization and layer number manipulation helped in additional improvement. My question is what other parameters should be a must to look into excluding that of general neural networks?
Relevant answer
Answer
hi,
quantitative techniques for evaluating GAN generator models are listed below.
  • Average Log-likelihood.
  • Coverage Metric.
  • Inception Score (IS)
  • Modified Inception Score (m-IS)
  • Mode Score.
  • AM Score.
  • Frechet Inception Distance (FID)
  • Maximum Mean Discrepancy (MMD)
for more info:
Best wishes..
  • asked a question related to Data Quality
Question
1 answer
In the era of information and reasoning, we are shown several scientific pieces of information either in print form or online globally. Despite the appreciable access to information the originality, novelty, and quality of information are substandard. For example, a large number of researches done in the developing world are either published in reputable journals or on the shelf. However, implementation of these research findings is scarce.
This could be due to data quality or the quality and quantity of the research team involved. The Issues that could affect the quality of research in developing countries include but are not limited to;
· Availability of limited resources to support research projects
· Inadequate time devoted to research projects because people who teach at the university level in developing countries are rarely full-time professors and usually have several jobs.
· The theoretical nature of research methodology in the curriculum, so students become professionals without the practical knowledge of how to do research.
· Limited access to journals, search engines, and databases and high subscription cost that is beyond the reach of the budgets of both individual professionals and university libraries.
· Weak ethical review committee to verify ethical treatment of human subjects.
· Rationing research funds to several colleges and department, which lead to limited competition and an increased chance of doing weak research
· Weak institutional structure and lack of empowerment to research staff
· Poor data management systems and lack of databases
· Availability of poor research guidelines and poor composition of the research team (i.e. failure to involve all relevant expertise in developing proposals and conducting analysis and interpretation of findings)
In the face of the above challenges, using real-world health be a solution to data quality problems? If, what are possible changes using real-world health data in developing countries?
Relevant answer
Answer
Developing countries can make use of a lot of health research conducted in developed countries, for these researches are scientifically proven, while those of developing countries are of short data for the universities there do not pay enough to do excellent researches.Regards.
  • asked a question related to Data Quality
Question
14 answers
How to maintain data quality in qualitative research? How to ensure quality in qualitative data collection as well as data analysis?
Relevant answer
Dear Prof. Dr. Devaraj Acharya , several methods could be used to ensure qualitative research analysis, such as Guba and Lincoln’s Concepts for defining and investigating quality in qualitative research as the following RG link. Kindly visit.
Kind Regards,
  • asked a question related to Data Quality
Question
7 answers
Hello, I am trying to get a metagenomic analysis and found Novogene whose prices are pretty cheap (almost 1/3 of our university core). Does anyone have any experience with this company? about the data quality or reliability?
Please let me know,
Thank you,
Relevant answer
Answer
Hi Alba.
I had experience with Novogene, both for whole-genome and metagenomics.
The sequencing is very good, and I had no problems at all in terms of quality.
I don´t recommend to subscribe for their bioinformatic analysis, since the analyses performed are pretty basic and can be done easily by the researcher, and many times you need more complex analysis that are not included in the price.
Let me know if you need more information.
Ricardo
  • asked a question related to Data Quality
Question
3 answers
What metrics do people use for quantifying the quality of timeseries data obtained from sensors such speed, acceleration, relative velocity, lidar point clouds? Also, how do you define the quality of such timeseries data?
#timeseries #timeseriesdata #datanalysis #ts #quality #dataquality #metric #sensors
Relevant answer
Then in your case, I would do a comparative analysis of the variables. For example, I would build a scatterplot between variables on the reference data, and I would also build the same scatterplot between variables in the data under study.
  • asked a question related to Data Quality
Question
3 answers
Dear Han,
After comparing annual flux ET data with CMFD precipitation, it seems that ET at almost all six sites is even greater than precipitation. Therefore, I wonder if it possible that the runoff or groundwater played a crutical role in these sites. Otherwise, It's about the flux data quality problem.
Please refer to the attachment for details
Cheers,
Ziqi
Relevant answer
Answer
I think you need to analyze the hydrology of the study area. At the same time, I think, there may be mistakes
  • asked a question related to Data Quality
Question
4 answers
Dear community,
We are working with attention check items in several questionnaires to improve the data quality (compare ). We are also measuring several constructs (such as embodiment, immersion, presence, motivation, self-efficacy etc.) which established questionnaires recommend measuring with many items (sometimes >30). This length makes it infeasible given participants' limited attention and time. Thus, we have shortened some scales. I would like to justify why this is acceptable given the higher data quality due to the attention check items. Unfortunately, I could not identify any literature that indicates this. Are you aware of anything in this direction? Please also feel free to point out any literature regarding shortening scales or the trade-off of long scales and time.
Thank you!
Relevant answer
Answer
This paper and its reference list should help:
Marsh, H. W., Martin, A. J., & Jackson, S. (2010). Introducing a short version of the physical self description questionnaire: new strategies, short-form evaluative criteria, and applications of factor analyses. Journal of Sport and Exercise Psychology, 32(4), 438-482.
  • asked a question related to Data Quality
Question
3 answers
Usually taqman qPCR gene expression assays using taqman fast advanced mastermix require 20ul reaction volumes. I wanted to scale down reaction volumes to 10 or 5ul to save on reagents but I was wondering if this is possible without sacrificing data quality.
Relevant answer
Answer
Yes, you can make 10 ul reaction mixture for taqman! However, during running PCR you need to make sure that you have changed the the total volume to 10ul instead of 20ul.
Best of luck
  • asked a question related to Data Quality
Question
4 answers
Hello,
Looking for studies about data quality improvement considering fixed model. what are the options in case of limited data? (<10000 samples)
Regards
Relevant answer
Answer
Hi Johnson Masinde , thank you for your answer. I requested the file to have more details but it seems answering partially. I'm more interested in the data improvement methods in case of limited datasets.
  • asked a question related to Data Quality
Question
3 answers
I have found reports mentionning the limits of both but apart from answers on the scope (EF is broader, CF is more specific, they are also quite interrelated), i can't find a comparison on the quality of the data itself.
Relevant answer
Answer
ecological footprint is better, because it is comprehensive data and include Carbone footprint as I think.
  • asked a question related to Data Quality
Question
6 answers
is there any data-set of smart-meter data (energy data) which labeled with data quality flags
Thanks
Aslam
Relevant answer
Answer
Collection of real time and crisp data and its validation.
  • asked a question related to Data Quality
Question
5 answers
Hi!
I'm working on a research and we have some kind of hydrologic datasets(e.g. river discharge, precipitation, temperature and etc.).
We need to be sure about the quality of our datasets.
Is there any test (preferably new ones) to be sure about quality datasets?
Thanks
Relevant answer
Answer
I have a few ideas. Try to find out who installed the equipment and collected the data, who interpreted and edited data for final data set entry. What were their methods and quality control. Do you have just digital data, or examples of strip charts, punch tapes, site visits with comments, etc.? Who worked up data, how did they deal with missing records? What equipment were used, was it installed correctly for research, etc.? Are there documents such as a data library, publications, pictures? Can you visit sites of data collection, and look over any remnants as staff gauges, benchmarks of cross sections or bridge sites, historical aerial photos covering area, streamflow records used for stage discharge curves, adjustments with time. Location of rainfall sites relative to watershed measured and topography. I think you would find in discussing this topic with the US Geology Service professionals and research hydrologists or others involved with this work, a substantial amount of time is used in development of a quality data set. Someone should have documented the specifics, but realizing how sometimes this is not done or personnel changes and may cleanout and dump files or records, even reports, much can unfortunately be lost with time. Some research grade stations like USFS Coweeta Hydrologic Experiment Station or agencies like USGS take much effort in cataloging, filing, logging and storing past records, notes, etc. for posterity.
Also, plot the time series data out. Do the data of the various stations seem to correlate with each other. Are the readings frequent as many times per day, which is a good sign, or infrequent as daily readings, not as good quality. Do the flow hydrographs coincide with rainfall events? Temp changes with day and night, rainfall events, etc. Are there other streamflow data stations in vicinity which have good correlation with records taking into account some time differences in storms and response.
  • asked a question related to Data Quality
Question
6 answers
I used Zetasizer Nano ZS to measure my polystyrene microparticles, and the data quality report suggests the zeta potential distribution is poor. I've checked the SFR spectral quality data, all of my 6 replicates for the same sample is below 1. In the technote, poor zeta potential distribution could originate from improper sample concentration, high conductivity samples, and less than enough sub runs. I've checked the derived count rate, which is 13,000-18,000 kcps with attenuator setting is 7. The sub run number is 22 for all samples. Since I saw blacking effect on the electrode after measurement, which suggests electrode degradation. Therefore, I suspect the conductivity is relatively high. However, the dispersant composition only includes 50mM Tris Buffer with some trehalose and glycerol. I'm not sure whether the monomodal will be required in this case and more sub runs will be helpful? Your answers will be highly appreciated.
Relevant answer
Answer
Hi Juelu Wang , you can see the conductivity of your measurements on the Zeta Report or also listed in the Zeta workspace. Yes, it appears that you have enough scattering intensity signal. If your overall mean zeta potential is very close to zero then it may be difficult to improve the data quality. The diffusion barrier method may improve. Here are a few comments that may be useful:
  • asked a question related to Data Quality
Question
6 answers
My research team and I collected a batch of data this month(~150 workers) from Mturk. Despite having many quality checks embedded in the survey (e.g., multiple attention checks, reverse coded items) we still feel that the data are suspicious. We ran a similar study one year ago and one of our measures assesses sexual assault perpetration rates. We used the same measure in our current study and the perpetration rates are unusually high this time. Is anyone else having trouble finding quality participant responses in Mturk? Does anyone have suggestions for how we could target better participants? Are there any forums or blog posts that we should be aware of that will help us better understand what is going on? Any information would help and be greatly appreciated!
Thanks in advance!
Relevant answer
Answer
Unfortunately, I found another recent study on the quality decrease of MTurk data: An MTurk Crisis? Shifts in Data Quality and the Impact on Study Results by Michael Chmielewski, Sarah C. Kucker.
  • asked a question related to Data Quality
Question
5 answers
KIndly suggest the mean of removing outliers from data set so as to improve data quality.
Relevant answer
Answer
You can apply some basic clustering techniques i.e. k-means, DBSCAN and so on for the outlier detection and removal.
  • asked a question related to Data Quality
Question
16 answers
Text classification task, if data quantity is low but data quality is not low. We could use data augment methods for improvement.
But the situation is that data quantity is not low and data quality is low. (noise in the labels, or training data accuracy low)
The way I get the low quality data is by unsupervised methods or rule-based methods. In detail, I deal with a multi-label classification task. First I crawl web page such as wiki and use regex-based rule to mark the label. The model input is the wiki title and the model output is the rule-matched labels from wiki content.
Relevant answer
Answer
The low quality data may be noisy and untrained as you have written such data could be reprocessed but if data is irrelevant , out of context and not explained the problem and requirement then the data would be futile . The unsupervised method or rule based method could help by the understanding the new problem and the rules those would help to take decisions .
  • asked a question related to Data Quality
Question
7 answers
Hello, I am trying to get one RNA-Seq with a lot of samples and found Novogene whose prices are pretty cheap (almost 1/3 of our university core). Does anyone have any experience with this company? about the data quality or reliability?
Please let me know,
Thank you,
Relevant answer
Answer
Excellent quality. I have run RNAseq, BSseq and genome sequencing (our own libraries and novogene-built libraries) though their facility and always get fantastic data.
  • asked a question related to Data Quality
Question
3 answers
Good-quality data is an essential indicator in clinical practice or research activities. There are various attempts to define data quality, which are heterogeneous, domain-specific. I am looking for current and published data quality evaluation frameworks particular to data from electronic health records.
  • asked a question related to Data Quality
Question
5 answers
To perform data quality assessment in the pre-processing data phase (Big Data Context), should data profiling being performed before data sampling (on the whole data set), or is it ok to have profiled on a subset of the data?
If we consider the second approach, how sampling is done without having information about the data (even some level of profiling)?
Relevant answer
Answer
Hadi Fadlallah , yes. That should decrease computational expenses, help to perform an investigation of a subset instead of the whole set. That is similar to a data science process where a small dataset is analyzed afterwards the methods are applied on big data set.
  • asked a question related to Data Quality
Question
4 answers
Given the data set, I want to identify the faulty records/data points values in the original data and then try to rectify it. The data-set is a mixture of numerical and categorical variables ( Total: 200 variables) with 2 million records.
I have tried Frequent pattern mining to achieve this which gives rules for variables and values in the data. ( Works well but takes time.)
I wanted to understand if something similar to this can be achieved by deep learning capabilities with some more insights.
Relevant answer
Answer
Sorry I'm late to the party but if you know which datum is faulty you should be able to solve the problem. If you don't know how do you decide?. Which came first the chicken or the egg?
David Booth
BTW outliers often get a bad rep but are actually useful.. How do you tell the outliers from the approx. 2.5 % of data that lie in a tail. see the attached
  • asked a question related to Data Quality
Question
5 answers
We often describe 3Bs of data quality issues, namely Broken, Bad, and Background. "Broken Data" means most data are collected in different time by different people. Sometime the history of data has missing data set. "Bad Data" means data has outliers which might be caused by noise, wrong collecting setup, or degraded/broken sensors, etc. "Background of Data" means the collected data lacks of working environment info. For example, jet engine data without weather, wind speed, and air density data and it will be difficult to analyze the fuel consumption issues. We also need a closed-loop data system which allow users quickly assess if the data is useful and usable. It not, users can further improve the data collection system to avoid collecting useless data.
Relevant answer
Answer
This is like saying a road grader is poor because it doesn't do well on a concrete pavement. It is like saying the fish tastes red, Red is not a property of fish tasting and machine learning is not a form of data.. WORDS HAVE MEANINGS. Thanks to Prof. Arnold Insel for the fish tasting example. David Booth
  • asked a question related to Data Quality
Question
14 answers
Data Quality impacts the accuracy and meanings of machine learning for industrial applications. Many industry practitioners do not know how to find/use the right data. There are two levels of data: visible vs. invisible. Most the visible data are from problem areas or based on our experiences. General questions for the visible data are: First, how to find the useful data? Second, how to evaluate which data is usable? Third, which data is most critical? As for the invisible data, vert often people use either an ad-hoc or trial-and-error approach to find and seek but often the work can not be reproduced by others. We need a systematic approach to address the data quality issues in AI for industrial applications. We welcome people to share their research experiences and thoughts.
Relevant answer
Answer
Collecting and filtering of huge amounts of data in a distributed system is a pretty challenging task I faced.
In the early stages, our team trained a basic ML model that had low accuracy, and the reason was - bad data. So, the most crucial part is EDA or Exploratory data analysis to investigate trends and get some knowledge from the data. Our team discovered which fields are more applicable than others and how we can use them to increase accuracy.
Also, it turned out that these factors were not unevenly distributed, for instance, Factor 1 has 99% presence in the data, whereas Factor 2 has only 1% of it, and the task was to predict Factor 2. So, we collected the data appropriately again, iterate through all the ML processes and managed to get much better accuracy (more than 90%).
Consequently, a dataset for machine learning / AI/data science ought to be carefully selected and processed. Its collection is more engineering than a research task but still has a lot of importance.
  • asked a question related to Data Quality
Question
6 answers
Dear community,
I would be very greatful if someone can advise me a package in R / RStudio to analyse the ddCt values from a qPCR. Ideally, the package has some example data and would have tools for the whole pipeline of analysis (import of data, quality control, analysis and visualization)
I checked a few but was so far not very satified so far, Thanks for the help!
Relevant answer
Answer
Dear Ruslan,
I think it is worth checking out the EasyqpcR (https://www.bioconductor.org/packages/release/bioc/html/EasyqpcR.html). It is is a very comprehensive and well documented package and has been used in a variety of most recent publications. Major advantage is that it can be run via a GUI (using gWidgets; see: https://bioc.ism.ac.jp/packages/release/bioc/vignettes/EasyqpcR/inst/doc/vignette_EasyqpcR.pdf) which makes it very easy to use.
If you are looking for a more advanced tool to analyze your qPCR data which can cope with raw intensity values and perform melting curve analysis etc. I recommend the qpcR package (https://academic.oup.com/bioinformatics/article/24/13/1549/238435) which can do quite a lot of cool stuff.
If you are running a high through put experimental test design maybe check out this tool: HTqPCR (https://academic.oup.com/bioinformatics/article/25/24/3325/235116)
To facilitate qPCR data import there is another nice tool available called ReadqPCR (https://www.bioconductor.org/packages/release/bioc/html/ReadqPCR.html).
For more options check out the comprehensive review from 2014
about qPCR analysis tools which compares among others, 9 R packages. Maybe you will find something more interesting ;) Good luck!
  • asked a question related to Data Quality
Question
1 answer
I am comparing datasets of different quality (uncertainty) by assessing model performance of species distribution models.
Would it be correct to use cross-validation in this case?
Since training and testing data contain the same level of uncertainty, I would expect model performance to be inflated in case of bad quality, and I doubt that the difference in model performance between two different datasets will represent the difference in data quality correctly, unless both are validated with the same validation set.
I am aware that validation with external structured data is always the best option in this case, but suppose that this is not available.
Kind regards,
Camille
Relevant answer
Answer
  • asked a question related to Data Quality
Question
4 answers
Dear all,
I would like to invite you to participate in a research project, about the development of a data governance framework for smart cities. The aim of this research project is to analyse the data requirements of smart cities and to propose a suitable data governance framework that will facilitate decision making, improve operational efficiency as well as ensure data quality and regulatory compliance. Once completed, the framework will be applied to NEOM (www.NEOM.com), a greenfield project involving the design and build of a mega smart city from the ground up, where cutting edge technology will form the backbone of the city’s infrastructure.
To participate in the survey, please click on the following link.
Your support and input will be very much appreciated.
Yours sincerely,
Topé
Relevant answer
Answer
Thank you all for the support!!
  • asked a question related to Data Quality
Question
1 answer
At our core, we are concerned with data quality as well as the quality of our statistical analyses.
  • What kind of processes have you implemented, just for yourself or within a research team, to check for transparency and reproducibility of your results?
  • Have you any formal or informal peer-reviewing experience with regard to your statistical analysis?
  • What kind of info is mandatory in your syntax files?
Thanks!
Relevant answer
Answer
In some instances we use other softwares to perform the statistical analysis based on the syntax developed and report this in the write-up for data quality and assurance purposes.
  • asked a question related to Data Quality
Question
3 answers
Hi we are taking some DLS measurements in a Zetasizer Nano ZS for starches nanoparticles size. We know that we could measure in the Nano ZS90 but we dont have it. We have seen some works in literature that claim they have measure with the same Zetasizer starch nanoparticles. We know that it is difficult to measure starch nanoparticles due to the dry nature of the samples and the water dissolution difficulty. The main problem is that there is an aggregation in the sample and there is poor data quality.
Is there any suggestion for the samples preparation, or an adjustment that sould be done in the Zetasizer that we ignore?
Also we are aware of the fact that there should be also a confirmation of the particles size with a SEM measurement.
Relevant answer
Answer
Panagiotis Loukopoulos If you're starting with a dry powder you have a fused collection of sub- and post-micron aggregates and agglomerates. Have you measured the SSA of your starch? If so, what is it? The behavior of the material will reflect the bulk size in terms of properties such as flowability, dusting tendency, filter blockage etc. Look at Figure 2 in the attached. Nowhere do you see free, independent, discrete particles < 100 nm. Thus your measurements should be taken on laser diffraction and not DLS. Many starches are also in the 20 micron region for mean size and thus, again, not applicable to DLS. Takse a look at this recent webinar:
May 28th, 2019 Dispersion and nanotechnology
  • asked a question related to Data Quality
Question
5 answers
I have got a slag sample that contains high amount of iron in it. I conducted xrd using Cu tube, on the powdered sample with very high background (because of iron efflorescence). Then to improve the data quality i got it done with Co tube but the signal to noise ratio is very low. I have attached the .raw files and the screenshot of data for both from the Cu and Co tubes. I know of only one option i.e. increase time per step. Are there any other settings that can improve the signal to noise ratio? Please let me know if any additional information is required?
Relevant answer
Answer
Your Co tube case apparently suffers from low photon flux at the detector (compared to the Cu tube case). Do you have taken the same sample?
It seems to me that you might have got an experimental issue here.
For example the sample height adjustment needs to be improved.
Please play around with the height position of the sample with respect to the x-ray beam (rotation axis of your set up).
Good luck
  • asked a question related to Data Quality
Question
5 answers
While developing a questionnaire to measure several personality traits in a somewhat unconventional way, I now seem to be facing a dilemma due to the size of my item pool. The questionnaire contains 240 items, theoretically deduced from 24 scales. Although 240 items isn't a "large item pool" per se, the processing time for each item is averages on ~25 seconds. This yields an overall processing time of over >1.5 hours - way to much, even for the bravest participants!
In short, this results in a presumably common dilemma: What aspects of the data from my item analysis sample to I have to jeopardize?
  • Splitting the questionnaire into parallel tests will reduce processing time, but hinder factor analyses.
  • Splitting the questionnaire into within-subject parallel tests over time will require unfeasible sample sizes due to a) drop-out rates and b) eventual noise generated by possibly low stability over time.
  • An average processing time over 30 minutes will tire participants, jeopardize data quality in general.
  • Randomizing the item order and tolerating the >1.5 hours of processing time will again require an unfeasible sample size, due to lower item-intercorrelations.
I'm aware that this probably has to be tackled by conducting multiple studies, but that doesn't solve most of the described problems.
This must be a very common practical obstacle and I am curious to know how other social scientists tackle it. Maybe there even is some best practice advise?
Many thanks!
Relevant answer
Answer
Sounds like you've created an instrument which is sound as it is based on theory - but this version should only be the pilot version, rather than the final instrument. As you see - there are too many items, which will affect the quality of the data collected. Have you conducted any sort of pilot study - factor analysis - and tested to see how those items relate? You'll probably find some items are redundant, and possibly even some scales ... use EFA to explore how the items load - then as you delete items - the number of items in each scale will decrease - you can then delete scales on the basis of insufficient items.
I find David De Vaus' work on survey design and validation very useful:
  • asked a question related to Data Quality
Question
14 answers
Data quality is crucial in data science project and how can we improve the data quality before it get into analysis?
Relevant answer
Answer
Sampling is the basic part of data science . In statistical quality assurance few imperical formule are available of which variance is most important.
  • asked a question related to Data Quality
Question
14 answers
I'd like you to participate in a simple experiment.  I'd like to limit this to simple linear and multiple linear regression with continuous data.  Do you have such a regression application with real data?  Some cases with several independent variables might be helpful here.  If you have done a statistical test for heteroscedasticity, please ignore that, regardless of result.  We are looking at a more measureable degree of heteroscedasticity for this experiment.
To be sure we have reasonably linear regression, the usual graphical residual analysis would show no pattern, except that heteroscedasticity may already show with e on the y-axis, and the fitted value on the x-axis, in a scatterplot.
///////////////////////////////
Here is the experiment:
1) Please make a scatterplot with the absolute values of the estimated residuals, the |e|, on the y-axis, and the corresponding fitted value (that is, the predicted y value, say y*),  in each case, on the x-axis.  Then please run an OLS regression through those points.  (In excel, you could use a "trend line.")  Is the slope positive?  
A zero slope indicates homoscedasticity for the original regression, but for one example, this would not really tell us anything.  If there are many examples, results would be more meaningful. 
2) If you did a second scatterplot, and in each case put the absolute value of the estimated residual, divided by the square root of the fitted value, that is  |e|/sqrt(y*), on the y-axis, and still have the fitted value, y*, on the x-axis, then a trend line through those points with a positive slope would indicate a coefficient of heteroscedasticity, gamma, of more than 0.5, where we have y = y* + e, and e = (random factor of estimated residual)(y*^gamma).  Is the slope of this trend line positive in your case? 
If so then we have estimated gamma > 0.5.  (Note, as a point of reference, that we have gamma = 0.5 in the case of the classical ratio estimator.) 
I'd also like to know your original equation please, what the dependent and independent variables represent, the sample size, whether or not there were substantial data quality issues, and though it is a very tenuous measure, the r- or adjusted R-square values for the original linear or multiple linear regression equation, respectively.  I want to see whether or not a pattern I expect will emerge.
If some people will participate in this experiment, then we can discuss the results here. 
Real data only please. 
Thank you!
Relevant answer
Answer
Thanks @David -
I take it that it is OK for me to use the data in Appendix 1? Would you happen to have an excel file of those data?
Thanks again - Jim
  • asked a question related to Data Quality
Question
3 answers
Dear colleagues,
I've been working about the qualities of political participation for a while (theoretically as well as empirically).
In one of our latest projects we found that political participation can have negative impacts on trust in institutions and self-efficacy under certain circumstances. Does anyone have similar data or studies? Especially interesting would be data from countries in transition or young democracies. Thank you.
Relevant answer
Answer
Political anything can have negative connotation to many people.
  • asked a question related to Data Quality
Question
4 answers
The study by Josko and Ferreira (2017), explained a use for Data Visualization(DV) in quality assessment paradigms which they call a “Data Quality Assessment process (DQAp):
  • They highlight that the problem with using DV in this manner is not in the value of what it can provide visually, but the complexity and knowledge required. 
  • They indicate the need for DV tools to be contextually aware of what is considered “Quality” vs “Defect” therefore requiring such methods to be constructed based on specific requirements which will not be possible for all sources of data.
What is your thought regarding the use of Data Visualization tools as a DQAp? Let's discuss!!
Relevant answer
Answer
Thank you for your detailed answer. May I assume that you agree but there need to be more things for data quality assessment?
  • asked a question related to Data Quality
Question
3 answers
I'm working in a Big Data ingestion layers that integrate and preprocess Remote Sensing Big data . i would like to know if there are some other similar platforms or works in order to compare my results. the comparison will be in the exceution time and in data quality.
Relevant answer
Answer
In addition of sources written by Kilian Vos there is a Big Data Platefore created by FAO to standarised big Maps produced from remote sensig :
  • asked a question related to Data Quality
Question
1 answer
I need to know what is the required quality for different kinds of water demand in a building, like drinking, shower,...A quality index for domestic water and the required amount for every application is very helpful. What is the index that I should search for, and is there any reference for that?
Relevant answer
Answer
I think you should refer to standard number 1053. You can find the quality requirement for drinking water
  • asked a question related to Data Quality
Question
1 answer
I wonder if anyone can help with references to the health status of professional Tea tasters, given that Fluoride is absorbed directly through the oral cavity?
I am looking for data on the quality of their teeth, any oral cancer, plus wider potential impacts on the rest of the tasters bodies.
Is the development of electronic "tongues" related to health issues as well as seeking objectivity?
Relevant answer
Answer
Tea tasting is the process in which a trained taster determines the quality of a particular tea. Due to climatic conditions, topography, manufacturing process, and different clones of the Camellia sinensis plant (tea), the final product may have vastly differing flavours and appearance
  • asked a question related to Data Quality
Question
4 answers
Dear colleagues,
I'm currently researching supplier sustainability assessment and am looking to find any literature on data/response validation
Companies (but also individuals) are often asked to report on sustainability practices and progress. These responses are subjected to varying degrees of validation; ranging from no further checks to complete validation of every response.
Do you have some recommendations (also from non-SSCM channels) or examples of cases / findings where the respondents' data quality was improved through validation?
Many thanks in advance!
Iain Fraser
Relevant answer
Answer
The ISIMIP project has a lot data on sustainability:
  • asked a question related to Data Quality
Question
2 answers
Hello,
I've completed sequencingy my library, and starting my analyses. I'm new to using mothur, and I request assistance with evaluating the my data's quality. I've included a subset from one of my samples containing output of fasta, fastq, and qual files. The subset contains the same three individual organisms among all files.
I understand the fasta files contain DNA sequences for later identification. It's the output from the fastq and qual files that giving me the most confusion. For example, what do the numbers signify in the qual file?
Please help me understand what the fastq and qual files are saying of the fasta file.
Thank you,
Gary
Relevant answer
Answer
* Fastq is the sequence + quality information. This file is basically provided by the sequencing company. It is used to control the quality of the reads going in further pipeline. Fastq file is converted to Fasta after adapter trimming and quality assessment.
* Fasta files are the sequence files which have sequence attached to a specific identifier. These are generated after the quality filtering, and size of file is smaller than fastq.
* Qual files are the Fastq minus fasta, this only has quality information and generated after spliting fastq file.
If you are using a pipeline which filters out good quality reads (your threshold), you don't have to think about what fasta and qual file has. Fasta files will become comprehensive after dereplication and clustering the reads into OTUs. Therefore, just follow the protocol. Meanwhile, please go through the manual of mothur to understand how the process works. I suppose you should do it before you start the actual analysis.
  • asked a question related to Data Quality
Question
10 answers
Hi everyone,
I am pretty new in SEM and I am seeking clarification regarding model fit through AMOS.
While performing the analysis I obtained good model fit indices values exept RAMSEA and CMIN/DF.
For a further comprehension of the model: I used the Maximum Lilkelihood Estimation; the model has 4 exogenous (C, D, E,F) and 2 endogenous (A, B) variables; the total amount of the items (responses provided on the Likert scale) of the questionnaire is 36; the scores of the items were averaged for each variable.
These are other details:
N= 346
CMIN/DF=6.9
NFI=0.98
CFI=0.98
SRMR=0.04
GFI=0.99
RMSEA=0.13
The standardised parameter for the model are:
A <-- C* 0.11
A <-- D 0.08
A <-- E*** 0.36
A < -- F*** 0.22
B <-- A*** 0.46
B <-- C 0.05
B <-- F 0.09
Significance of Correlations:
*** p < 0.001
** p < 0.010
* p < 0.050
The model explaines 28% of the variance in B and 38% in A
Can anyone please thoroughly suggest me how to overcome this problem of the inadequate (poor) value of RMSEA?
Can it be a problem related to data quality? I also read in Kenny et al. (2015) that in case of low d.f.( in my case d.f.=2) RMSEA has more probability to be over 0.05.
Kenny, D. A., Kaniskan, B., & McCoach, D. B. (2015). The performance of RMSEA in models with small degrees of freedom. Sociological Methods & Research, 44(3), 486-507.
Relevant answer
Answer
CMIN/df is not really used in these large sample studies, so we can safely ignore it. This leaves the outlier of the RMSEA. In my experience, simple models (1 or 2 factors and a low number of items) tend to follow the pattern in your results.
The main difference of RMSEA compared to the CFI (and other measures) is that RMSEA is a measure of absolute fit and does not have any corrections based on model compexity. It appears you may have a very simple model (2 df), although you don't specify here the exact model, so that assumtion be wrong.
Now, the SRMR is also an absolute measure of item fit, but you have a good SRMR. However, according to David A. Kenney (http://davidakenny.net/cm/fit.htm), the SRMR is biased in favor of low df, which you certainly have.
What does this mean? Well overall you have a simple model that captures a good amount of the variance considering how simple it is, but there's still a large absolute error. Do you need to alter you model? Well maybe. If there are clear theoretical reasons to do so, you should. However, considering all fit metrics, it is probably "acceptable," but not "good." So I wouldn't change the model unless there is a good theoretical rationale. Then you'll want to provide a good explanation of the fit metrics and any biases in any article or paper you write based on this model.
So is it a problem? Maybe. Depends on how complex your model really is and if there is a theoretical rationale to update your model.
Is it a problem of data quality? Maybe, but probably not. Most SEM techniques can adapt for variance, missing data and so on. It depends on the estimator you use and the degree of these issues. you don't say what estimator you are using or if the data are continuous, categorical or interval.
Is it because of the low df? Probably, in as much as you probably have a simple model and there are some inherrent biases in some of the measures.
  • asked a question related to Data Quality
Question
3 answers
I am trying to develop a Machine learning algorithm for Data Quality. The goal is given a data set , I should be able to identify the "bad" records in the given data set.
I tried one class SVM, I am little confused, I assume that i need to train the model with "good" data only. but if I have to do this then I would need to know that what are the good records and this not the goal.
If I train the model with good and bad both types / class of records then oneClass SVM is not givng the good accuracy. is it a good approach.
Relevant answer
Answer
No worries. You're welcome.
If the target labels are in your data, then you have a classification task (not clustering).
At the very early stage, make sure that your dataset is prepared well. Then, try many classification algorithms, where each of them needs to be evaluated with k-fold cross-validation, for instance.
You need sometimes to tweak the classifier based on the nature of the data, such as selecting the right kernel.
Cheers,
Dr. Samer Sarsam
  • asked a question related to Data Quality
Question
4 answers
I have heard people suggesting me to track the dominant eye when doing single eye eye-tracking, because the "dominant eye will give more accurate data". However in normal subjects the fixation locations of two eyes should be quite consistent. Is the data quality significantly different between dominant eye-tracking and non-dominant eye-tracking?
Relevant answer
Answer
No, unfortunately. I am still working on it. However, take a look at the study attached. In the opposite way, they consider a fixation accurate where the dominant eye is located on the target. Also, what eye tracker are you using? I am using the tobii x3-120 and in their manual they recommend using the dominant eye for more accurate precision. I am finding in my study the dominant eye is more accurate too.
Paterson, K. B., Jordan, T. R., & Kurtev, S. (2009). Binocular fixation disparity in single word displays. Journal of Experimental Psychology: Human Perception and Performance, 35(6), 1961-1968. doi:10.1037/a0016889
  • asked a question related to Data Quality
Question
4 answers
Hello all,
I want to created a composite index that will measure the well-being state of UE countries. My data set is panel (15 years, 28 countries an 15 indicators). I want to created the index based on Principal components technique, but i am not sure what should i do before.
What i was thinking to apply is:
-> removing outliers: i do not think that is necessary to remove them for this analyse. Is this right?
-> check the seasonality of data
-> check the stationarity of data with Unit Root test: Some variables need to be standardized. Should i standardize them by log them? Id i ise the diff tehnique, i will loose some rows.
One mention here is that the SPSS has the option to standardize variables, but when i take them in eViews in order to check the unit root test, then i see that they need to be standardized (my guess is that SPSS standardization means normalization)
-> check Alpha Cronbach values (it should be over 0,6 to continue applying PCA)
-> PCA
Thank you!
Relevant answer
Answer
The index will be used for classification and for prediction (probably with KNN).
  • asked a question related to Data Quality
Question
4 answers
Hi! Can anyone share some practical experience about Bacterial WGS on HiSeq 2500 Technology (average depth of 200X coverage)?.
I would like to know mainly 1. About data quality 2. Average Read length 3. Error rate 4. Advantage over MiSeq.
Thanks.
Relevant answer
Answer
agree with @ Ajit kumar Roy
regards
  • asked a question related to Data Quality
Question
3 answers
I am looking for data analysis tools that can be added into a new database (legacy system migration, without database from legacy system) which takes structured data (pre-determined format, seen as correct) as an input.
Relevant answer
Answer
There are many data analytic tools (usually commercial) that claim that they can provide information about the quality of your files (and some also claim that they can subsequently provide files where the errors are 'corrected'.) The methods are called 'profiling' tools. I am not aware of any that work in a minimally effective manner. Difficult errors are determining 'duplicates' in files using quasi-identifying information such as name, address, date-of-birith, etc. Two records may be duplicates (represent the same person or business) even when the quasi-identifying information as representational or typographical error. Any quantitative information from the two records representing the same entity may have slight (or major) differences. If the records having missing values associated with the data that an individual wishes to analyze, then the missing values should be filled-in with a principled method that preserves joint distributions (e.g., Little and Rubin book on missing data, 2002). The 'corrected' data may also need to satisfying edit constraints (such as a child under 16 cannot be married).
htpps://sites.google.com/site/dinaworkshop2015/invited-speakers
  • asked a question related to Data Quality
Question
3 answers
How can I check for climatic data quality aside comparison with other station's data or checking for outliers?
  • asked a question related to Data Quality
Question
6 answers
We have a wearable eye-trackers in our lab (SMI ETG-2w), and I hope to start some experiments using it. I notice that many desktop eye-trackers provide a sampling rate at 500-1000 Hz, and allow 9-points calibration before experiment; whereas ours only has a sampling rate at 60 Hz (though it can be upgraded to 120 Hz if we pay some extra money), and only allows 3-points calibration.
We hope to do some serious neuroscience & psychophysics experiments, analyzing saccades, fixations, and pupil size, and the subjects will sit in a lab. No out-doors experiments are currently planned. Now I have some doubt on whether our eye-tracker can provide enough precision & accuracy, as in pilot runs when we show dots on random locations on the screen and let our subjects fixate on it, our eye-tracker could not reliably give the correct fixation locations on some dots.
Does wearable eye-trackers always provide worse data than desktop eye-trackers? I hope someone with some experience on both kinds of eye-trackers can help me know what levels of data quality I can expect at most.
Relevant answer
Answer
Hey Zhou,
I would add, that it critically depends on what you mean by "serious neuroscience & psychophysics experiments".In the linked question, I have collected some related questions to ET use.
In general, if you don't need subject mobility, more restraints are better (head/chin rest, bite bar, etc.) and will yield better results (in terms of accuracy and precision).
In addition, stationary ETs are currently still faster by a large margin. The fastest VOG-based head-mounted ET is around 300 Hz whereas for stationary, head-fixed VOG-ETs you can got up to 2000 Hz. Since head-mounted ETs use a (low frame rate) scene camera to record the participants environment, many systems don't see the need to go to higher sampling rates.
For fixation analyses, which is more common in psychology & related fields, low sampling rates are not a problem.
But as soon as you go for saccades, sampling rate becomes a critical factor.:
  • Small saccades, or even worse: mircosaccades, have very short durations (<30ms) which you simply cannot record with low sampling systems (see Johannes' example).
  • Peak velocities of all saccades except the largest ones get severely distorted at sampling rates below 500Hz. Upsampling can get you down to 250 Hz or even 120 Hz (see Mack2017). With some elaborate bulk of signal processing procedures you might get acceptable results even at 60 Hz (see Wierts2008).
  • Onset latencies, and thus durations, also have increased jitter with lower sampling rates due to onsets between samples (see Andersson2010).
Finally, the calibration makes a difference (see Nystroem2013).
TL;DR: Provide more info on what you plan to do. If you analyses saccade peak velocities, get a stationary ET.
Hope that helps, Greetings, David
  • asked a question related to Data Quality
Question
11 answers
I'm new to NGS anlysis, there are many QC (quality control ) software for NGS data, which one is best? I'm using https://usegalaxy.org so it's better to be available in this web sever. or is there any better program for QC to run on my laptop? is it heavy computation for a laptop (core i3 cpu and 4 Gb ram)?
Relevant answer
Answer
There are many tools actually. If you find my book chapter useful, it could help you hopefully. Although title says it's basically about RNA-Seq data but you would get idea about many tools that apply to NGS data QC. Below is the link of my chapter.
  • asked a question related to Data Quality
Question
3 answers
I've used REDCAP for several years but we're initiating data collection on a new project that includes 10 parent survey instruments and 13 forms that research assistants are entering standardized assessment data into. I'm trying to figure out if there is a way to restrict double data entry to just the assessment data (not the surveys) and if there are recommendations about user permissions / data access groups. Any advice from REDCAP users with experience with double entry would be greatly appreciated!
Relevant answer
Answer
The RedCap administrator is able to provide access to double entry of data under user rights for your project. You can further define if one or two users will capture the data. At our university it is fairly easy to get the RedCap administrator to make changes to a project.
All the best wit your RedCap project.
Ref:
  • asked a question related to Data Quality
Question
5 answers
Fellow researchers,
I am looking for an article (review?) that investigates if quick/timely/near real-time reporting of health care quality data for benchmarking / quality improvement purposes results in higher acceptance of such data by providers.
I want to make the argument that if you want to use such data for quality management you need to have it quickly, ie not wait for months/years before reprots are published. Reviews that I found emphasize all sorts of contextual factors for the success of quality improvement interventions in healthcare but not timeliness of reporting. Can anybody help? Many thanks!
Relevant answer
Dear Christoph Kowalski, it is not exactly what you are asking for (i.e., a review or study reporting on the increased acceptance or satisfaction with the reporting system) but 'timelines' and 'punctuality' in the provision of analytical data is considered as a quality criteria for any monitoring system (even more so in the case of clinical performance analysis systems). 
As an example, the Eurostat Quality Assurance Framework establish both 'timeliness' and 'punctuality in the delivery of metrics' as their 13 principle (page 30, http://ec.europa.eu/eurostat/documents/64157/4372717/Eurostat-QAF-V1-2.pdf) defining 'timeliness' as 'the availability of the information needed to take action at the appropriate moment of decision making', meaning that there has not necessary to be from a precise period, but be ready and available to the professional (i.e., health or otherwise) to facilitate an informed decision. In clinical settings, is established depending on contextual needs. One of the best examples you could use to argue your point is to follow the analogy of the laboratory test in the emergency care as an information system allowing timely decision making based on current data on the patient, thus promoting action to change an urgent situation. This example can be extended to other health information systems already mentioned in the other answers above. 
Similarly, 'punctuality' to deliver those metrics following a predefined delivery plan is also stated as a data quality principle as it sets the expectative of the user of the information system on the data availability within acceptable margins, facilitating the use of the information provided.
Both 'timeliness' and 'punctuality' are quality dimensions that you should take into account when trying to create an information system to be useful in the clinical context. 
  • asked a question related to Data Quality
Question
2 answers
the basic concern is whether or not the existence of a regulated environment improves data quality and reproducibility
Relevant answer
Answer
don't know.  First I'd want to see the data, and then see what makes sense....  however, since I haven't seen anything... no worries
thanks for replying,
best regards,
joe
  • asked a question related to Data Quality
Question
3 answers
which machine learning techniques used for SAR data quality assessment?
Relevant answer
Answer
  • asked a question related to Data Quality
Question
3 answers
Background: I'm using an online survey company to launch a questionnaire nationally. The survey is to identify factors affecting online shopping intention, so there will be a lot of scale questions, as well as respondents' demographic information. I'm asking the company to get 50 responses before carrying on the full survey.
Question: Would you suggest some quick stat tests using the pilot data, to check if the responses are of high quality? (whether the profile is diversified / whether they answered the questions carefully / etc)
Many thanks for your help!
Li
Relevant answer
Answer
Would you suggest some quick stat tests using the pilot data, to check if the responses are of high quality? (whether the profile is diversified / whether they answered the questions carefully / etc)
You should always conduct a pilot study / test before going for "big bang" data collection.  The purpose of the pilot study include the following which is not an exhaustive list:
  1. checking the reliability of your survey questionnaire especially for those new questions you'd developed on your own to measure certain constructs / variables e.g. Cronbach Alpha Reliability / Composite Reliability scores.
  2. evaluating whether your future respondents can fully understand your survey questionnaire - if not these can be rectified before going for "big bang" data collection.
  3. give you initial glimpse on your tested results after initial data analyses based on your hypotheses formulated - so that you can adjust your research model, operationalization of your constructs / variables, hypotheses etc. as appropriate.
  • asked a question related to Data Quality
Question
3 answers
I am looking for information on existing research on methodology and tools intended to collect data samples representing instances of a novel ill-structured subject domain. This task is, in some sense, inverse with regard to Big Data processing: I have to collect, as much as possible, data instances of a novel domain I familiar not well and for which I am capable to formulate only the basic set of ontology concepts. I need this data set (let it Big Data of any type) to design afterwards an empirical model of dependencies peculiar to the domain attribute while using a Big Data processing technology (in fact. machine learning technology I have) while mining the collected data set. This is a kind of Ontology-driven Data Search and Collection problem. Could someone help me?
Relevant answer
Answer
Dear prof. Gorodetsky,
I'd suggest you to analyze the existing literature on your "ill-structured" domain by means of topic modeling techniques. By inferring relationships across topics you can build the dependency model you are after. In the attached paper you can find a methodology we have designed in order to perform such an analysis.
Regards,
Andrea De Mauro
  • asked a question related to Data Quality
Question
6 answers
I have a particle sample which will aggregate to form bigger micrometer sized clusters as I heat them up. I want to use dynamic light scattering(DLS) to determine the size of the aggregates during the heating up process. However, given the polydispersity of the aggregates, the DLS instrument always told me that "data quality too poor because the sample is too polydispersed". I am wondering is the Z-average size I get still reliable? Can I rely on the size distribution from the instrument or I need to use other methods to derive the size distribution of the polydispersed aggregates? Thanks very much!
Relevant answer
Answer
I am afraid that a simple Z - sizer is not an appropriate instrument for such measurements. To be sure in a result of the measurements, they  should be done at 2 or 3 scattering angles and it should be checked that the correlation function processing program allows a repetition of the location of size distribution peaks. We should also note that when particles are large (and their amount is small) in the different times the scattering volume can be provided by different particles. Therefore at each scattering angle few measurements are needed. So for DLS measurements of complex size distribution you must use Malvern DLS sizer with opportunity for angle changing or analogous Nanotrec sizer or Photocor FCN sizer with good program for correlation function expantion on exponents. For Photocor the program DynaLS is sufficient, for other sizers compatible software is needed. Look for the DLS installation of this class.
  • asked a question related to Data Quality
Question
9 answers
My data will be the quality audit reports of 45 HEIs in the Sultanate of Oman.
How can I validate/confirm the proposed model/framework?
Thanks in advance.
Relevant answer
Answer
Yes, I think SEM should help. But a model should enumerate indigenous and exogenous variables with well grounded hyptheses and should not have either type-I and/or type-II errors.
But, it would be necessary to validate with case based inputs.
  • asked a question related to Data Quality
Question
4 answers
I am looking to create a system that proactively monitors data stores for data quality, where the data quality rules are expressed in a language understandable by a lay-person (e.g, Attempto). The language constructs could then be parsed into the equivalent data store-native language (e.g., SQL, NoSQL, etc.).
If you know of any research in this area, I would be most appreciative.
Relevant answer
Answer
Fellegi and Holt (JASA 1969) rediscovered one of the fundamental theorems of logic programming.  The mathematics of the models drive the implementations.  Winkler (2003) demonstrated theoretically how edit rules could be logically connected with imputation procedures so that there were guarantees that the joint distributions of fields would be preserved in a principled manner.  The mathematics guarantee that the entire set of rules are logically consistent (including rules on the joint distributions that are standard in statistics).  All Fellegi-Holt-based systems assure logical consistency.
A number of papers such a Galhardas et al. and many papers on variants of symbolic logic assure that rules in languages can be converted to the types of rules (sometimes) in the Fellegi-Holt model.  What the rules-conversion do not assure is the logical consistency of the systems.  We found in small situations that it was very straightforward to convert the English-type rules to suitable tables (sometimes in less than four hours).  In much larger situations, such general rules-based methods would have been very useful.  We never had the resources to pursue the additional features in our production systems.
In some of our systems, we need to convert 100s or 1000s of rules into table format.  The conversion is tedious.  Once in the table-format, it is relatively straightforward (but may take much computer time) to check the logical consistency.
  • asked a question related to Data Quality
Question
3 answers
We assume that a composite service consists of a number of atomic service.
Given the quality of  each atomic service, how to measure the overall quality of the composite service ?
For example, we can define that: the higher mean value the higher overall quality, and the lower variance the higher overall quality. However, the question is how to aggregate mean value and variance ? Is there any metrics? 
Relevant answer
Answer
I would use weight of each atomic system, which is a measure that gives size and/or relevance of the system. That's something you have to come up with.
Then simply multiply each weight factor with the respective DQ factor and then use such values to do avg, stddev, etc.
  • asked a question related to Data Quality
Question
2 answers
I am familiar with Actor-Network Theory , i applied ANT as tool to construct network collaboration, whereas  i want to use ANT with data quality as metric to choose actor network .
Thank you in advance
Salim
Relevant answer
Answer
Hi Mohammed
I have a small list of ANT papers which might be interesting for you. I also pulled out a much shorted list of those which address quality issues. I hope that helps.
Regards
Bob
enc
  • asked a question related to Data Quality
Question
3 answers
In situation where it is not feasible to quantify DQ dimension due to non involvement of a data set , is it suffice to show the subjective measures directly asked from the customer? 
To be more clear, satisfaction of a customer in meeting the desired requirement cannot be objectively measured. In this case is customer satisfaction feedback be sufficient for the successful approval of the project while presenting to a committee?
Relevant answer
Answer
There are many studies linking subjectively obtained, survey-based measures of customer satisfaction with its outcomes. For a summary, see:
Mittal, Vikas and Frennea, Carly, Customer Satisfaction: A Strategic Review and Guidelines for Managers (2010). MSI Fast Forward Series, Marketing Science Institute, Cambridge, MA, 2010.
  • asked a question related to Data Quality
Question
7 answers
Specifically I'm interested how to use ANT in governance of information system, i want also to use ANT with quality data?
Relevant answer
Answer
Hi There
It has a huge list of IS Models/Theories in alphabetical order and Actor-Network is there. Click on Actor-Network Theory and scroll down to where is says "IS articles that use the theory" .... there are plenty and some are very recent. Enjoy and Good-luck.
  • asked a question related to Data Quality
Question
46 answers
I have over 5 years developing predictive models with years of experience as a researcher, statistical analyst as well as data scientist. One aspect that I have experienced within the big data sphere as well as predictive modeling landscape is that a lot of emphasis are either place on data quality and quantity, the experience and expertise of the modeler, the kind of system that is being used to build the model, validate, test, and continue to monitor and assess the model quality and performance over time. With this said, I would like to see what others here on Research Gate think are some of the challenging task building either a statistical or predictive models and what were some of the strategies you employed to help address those challenges? What were some of the tradeoffs you had to make and how would you approach similar situation in the future?
Information provided to this inquiry will be used for personal and professional growth.
Relevant answer
Answer
Hello,
For me, possibly the most challenging is/was/will be to identify a niche within a vast amount of knowledge to be able to introduce meaningful research questions that would be novel and could enhance understanding of mechanisms underlying psychological disturbances (Yes, I am a scientist and a psychologist).
Sounds like a philosophical statement. But it matches your question - very broad and philosophical too.
Regards, Witold
  • asked a question related to Data Quality
Question
3 answers
The aim of data fusion(DF) is basically increasing the quality of data through the combination and integration of multiple data sources/sensors. My research is on assessing this impact of DF on DQ(data quality) hence I would appreciate the academic materials to backup your conclusions.
I have being trying to link DF methods to the DQ dimensions that are mostly impacted on to no avail.
Relevant answer
Answer
DF is improving DQ when (and only when) the different data input streams are to some degree correlated. If not, DF does not make any sense.
Sorry - I cannot give you academic materials because I have none on this topic. The above is from experience and some own work in this area.
  • asked a question related to Data Quality
  • asked a question related to Data Quality
Question
7 answers
We are doing surveillance in a scattered geographical area. How can we assess data quality? Should we compute error rates or directly observe how data is being collected?
Relevant answer
Answer
There is a direct correlation between source of data, collection of data and error rate when it comes to quality of data. Personally, I think you should pay more attention to how data is being collected because this is the most likely source of data quality breach. The more accuracy you have in data collection, the less the error rate you will observe.
DD=DCM+ER
ER=DD/DCM  
Where
DD=Data Destination (Constant variable- Always known)
DCM=Data Collection Methods
ER=Error Rate
  • asked a question related to Data Quality
Question
3 answers
We are intended to assess quality of data. Can we use LQAS strategy for assessment of data quality.
Relevant answer
Answer
This is an accepted method of sampling in public health and is equivalent to stratified sampling. The sample sizes maybe smaller but allows the development of hypotheses.
WHO does recommend this form of sampling for large studies such immunisation coverage
  • asked a question related to Data Quality
Question
3 answers
I would love some pointers to existing work/papers on sparsity in large data sets. What I am looking for are not the (important) usual papers on how to compute given sparsity in large data sets; I am instead thinking about how one might use the fact of sparsity and a non-homogeneous distribution of features and relationships to characterize overall solution spaces/data set spaces into regions of greater interest and less interest. 
Relevant answer
Answer
There are a couple of things going on here that can lead to sparsity.
1.  Incompleteness - this can be dealt with assuming or testing that your data set complies with random sampling.
2.  Censored Data - sparsity could result from the inability (instrumental or otherwise) to measure certain things.  This is often referred to as "Survival Statistics" and you can use that as a search term to find useful information.
3.  An unusual selection function.  Selection functions are very important and are generally unknown at the onset.  They occur depending upon how data was gathered and the (unknown) bias or limitations of that data gathering.  All data samples, in the end, are the result of some selection function convoluted with the intrinsic distribution. This produces the sample distribution 
  • asked a question related to Data Quality
Question
1 answer
If yes, could you give me an example? In my opinion the Maturity Models for MDM / DQ are unspecific / generic for all business branches in the most cases.
Thanks and best regards
M. Gietz
Relevant answer
Answer
Thank you for bringing this matter to our attention!. It is still a great help and a real pleasure to read your posts
  • asked a question related to Data Quality
Question
5 answers
My current research project is about the data quality in child injury prevention. In the course of my study, I predominately make use of qualitative methods.    
I'm currently drafting the interview questions for expert interviews, asking about perceptions and expectations of stakeholders, working in child injury prevention, on current data systems in terms of quality (utility). I wondered if anyone had any great interview questions for this type of interview which they are willing to share?
Thanks for your help.
Regards, Nicola
Relevant answer
Answer
I hope that you can find someone who can share this kind of specific information, because as you probably already know, most qualitative researchers do not include the content of their interview questions in their publications. I suspect, however, that you will have to do what almost everyone else in qualitative research has done: write your own questions and hope for the best.
As general resources on qualitative interviewing (including writing questions), I would recommend:
Spradley, The Ethnographic Interview.
Rubin & Rubin, Qualitative Interviewing: The Art of Hearing Data.
  • asked a question related to Data Quality
Question
10 answers
I manage a research participant registry. I have just come on board and the data from participant surveys has been being input by volunteers. I have checked the first 100 surveys for errors and have found around a 60% error rate (around 60% of the surveys have at least one entry error). I plan to double enter all of the current surveys at 100%. However, outside of more extensive volunteer training, I am looking for measures to ensure data integrity for the  future surveys.
Relevant answer
Answer
One error on 60% of surveys does not constitute a 60% error rate -- the rate would be dependent on # of responses entered.  Actually, only l error on 60 out of 100 surveys doesn't sounds unusual.  All manual data entry is subject to error. The best way is to double enter, as you are doing. If you could load the survey on to a telephone survey software, you would reduce your error rate and lessen the # hours in data entry.
  • asked a question related to Data Quality
Question
7 answers
We have performed group ICA in two groups of patients (BOLD fMRI data). One group is controls, one is with very severe developmental brain abnormalities (mixed group). In controls, ICA revealed 6 components (mostly bilateral), while in the diseased group, we get 30-40 components on the group level, and the components are small and focal. Data preprocessing is the same, data quality is also the same (although the brains themselves might show some anatomical heterogeneity as well).
I am very curious how this excessive number of IC can be interpreted. We may assume that the diseased group have impaired brain functioning, even no brain functioning (e.g. due to neural migration disorders) at the respective areas.
Thank you.
András
Relevant answer
Answer
There is very little that the number of components found could tell you about brain functioning or connectivity, since many of the components could be related to noise, artefacts or be vascular components. You should take some considerations into account. Did you perform group ICA with temporal concatenation? Did you use automatic estimation of the number of ICs?
If you really want to compare the integrity of the resting state networks in a group of patients vs. controls your approach is not adequate.  I suggest that you go for a entire sample group analysis, and after identification of the real resting state networks you apply a two-groups differences analysis. Take into account structural differences into account. In the end, differences found would reflect differences in network integrity/pattern. 
  • asked a question related to Data Quality
Question
8 answers
I have two groups of microarray for comparison (control and treatment). The treated group has 3 arrays. The control group has only 2 arrays because 3rd array data quality is not good. How to proceed with the analysis of this dataset? Should I still include that bad quality array in comparison or exclude it from the analysis? How does it influence my results?
Relevant answer
Answer
I would use an empirical Bayes method. the package limma allows you to do that. 
you will have to fit a linear model with lmFit() then run the Bayes analysis with eBayes(). your p values will be in the $p.value attribute 
i.e.
data <- this is your original data matrix (no NAs)
myLabels <-cbind(Grp1=1,Grp2vs1=c(rep(1,length(Exp1)),rep(0,length(Exp2))))
fit <- lmFit(data,myLabels)
fit <- eBayes(fit)
p_vals <- fit$p.value[,2]
  • asked a question related to Data Quality
Question
5 answers
As we develop assays that are more high content and have a large amount of data, I'm wondering what methods people are using to QC the data. Is it a spot check or a thorough review of all data points? Thank you for your thoughts.
Relevant answer
Answer
First question is what kind of data sets and what is the format of the data sets. I think these aspects are important to know before answering your question, correctly.
  • asked a question related to Data Quality
Question
10 answers
I have a climate database with daily weather data by weather stations in a region. Most of the data is real measurements from the weather stations' equipment. However, some data (few or many days of the year) are not from measurement; they are instead estimated based on some mathematical functions of the neighbouring stations' data. If I use all the data set to produce a surface through GIS interpolation, I will be estimating attributes based on some previous estimations. This will incur on accumulated errors difficult to calculate. I would appreciate some suggestions of references covering this issue in the literature, and your informed opinion.
Relevant answer
Answer
Dear Simone,
If possible, I suggest a Leave-One-Out-Cross-Validation approach, in a sense that you calculate the interpolated grid with all data first, then remove single data points and interpolate again. The difference will provide a simple measure of the effect for individual data points. Please keep in mind, that the interpolation algorithm and the parameters used will have a strong influence on the results.
I hope this helps a bit.
Regards,
David
  • asked a question related to Data Quality
Question
10 answers
I finished my PhD recently and would like to collaborate with anyone that may work on this to crosscheck our findings related to metadata quality, metadata training and metadata design. If interested, or if you have any resources, please drop a line.
Relevant answer
Answer
@Efstratios, yes indeed, metadata quality should come into the discussion since high quality metadata and metadata that are being revised and preserved, also allow the resources they describe to be accessible and reusable for a long period of time. I am also really interested to take a look at the costs involved. I already took a swing at measuring some costs involved in the metadata enrichment process in my thesis (http://www.slideshare.net/nikospala/metadata-quality-issues-in-learning-repositories) and this is also something that would be of interest. What you can also see in the attached presentation is an approach that was applied in three different federations of repositories to enhance metadata quality. It may also be of relevance to your project. If you do develop a repository for the Tate's collection that is mentioned, I would be interested to see how metadata are being developed in that context. I would also be interested to see the tools or documentation that you plan to develop which will help in the preservation of data. I also worked on some simple tools so it would be interesting to see how this can be taken to the next step. I see no Greek partner there, so I am guessing no events will take place in Greece, but in any case, I would be interested to keep in touch if there's anything that would be in line with my research! Thank you so much for the answer!
  • asked a question related to Data Quality
Question
13 answers
There is a lot of open data around, but for many purposes, especially in enterprise context, indicators and automatic measurements to determine data quality is a must have.
Relevant answer
Answer
True, I am currently working on a survey on data quality assessment methodologies, dimensions and metrics (www.semantic-web-journal.net/content/quality-assessment-methodologies-linked-open-data, under review).
In the meanwhile, you could take a look at http://stats.lod2.eu/, which gives a high level overview of "quality" of data sets.
  • asked a question related to Data Quality
Question
4 answers
I would like to be able to clearly distinguish between multiple voices. Has anyone had experience with the Microcone?
Relevant answer
Answer
We trialled the Microcone for a different application - it seemed useful for multiparty conversation and was reasonable at separating voices. We ended up going with a different set-up because we had different aims. Otherwise, I've had some success recording large groups using a single portable audio recorder, but it can be difficult to distinguish voices unless the voices are already known to you. Hope that helps!
  • asked a question related to Data Quality
Question
13 answers
Many projects use metadata. They are backbone to many data warehouse systems. There is a major drawback with metadata, though. The notion of metadata quality is neither agreed on nor even clearly laid out.
Bruce and Hillmann said 10 years ago 'Like pornography, metadata quality is difficult to define. We know it when we see it, but conveying the full bundle of assumptions and experience that allow us to identify it is a different matter. For this reason, among others, few outside the library community have written about defining metadata quality. Still less has been said about enforcing quality in ways that do not require unacceptable levels of human effort.'
What can we do about it? And if we could not agree on a definition of metadata quality, would it not be a failed concept?
Relevant answer
Answer
Michael, I'll answer your question from a different perspective to Arjun & would like to separate the issues that arise in your provocative title "Is metadata a failed concept?" from the commentary around "metadata quality".
I have participated for many years in international standards development efforts relating to metadata schemas (in organisations like IEEE LTSC, IMS Global, DCMI & SC36). All these organisations have responded -- & are still responding -- to the consequences of the digital revolution in terms of information management & discovery. Successful standards are usually judged as being "fit for purpose" & rarely can add value beyond that. There's plenty of debate out there as to the usefulness of metadata standards or profiles of them but there's also plenty of success stories too. But another point I'd like to make is that "metadata" is not just a term that describes metadata schemas. There's an enormous amount of metadata that gets used in virtually every web service you can think of -- whether it is in the form of XML, RDF, RSS, etc; "date posted" information; or whether it's just a tag cloud or all the other data that gets collected as "analytics". Any content or data that can somewhow be expressed in terms of "who, what, when, & where" while also relating to some other content or data is essentially metadata. So, i don't think "metadata" is a failed concept -- to the contrary, it is as you indicate the "backbone of many warehouse systems" as well as what enables so much systems interoperability on the web.
The issue of metadata quality is of a totally different order for me. If you're managing systems or curating content that requires high quality metadata then it's best to make sure you're drawing on expert input. In many other situations the metadata quality may be of questionable quality or value -- but it's a consequence of how social media works. in time, I'd expect it to improve.
  • asked a question related to Data Quality
Question
2 answers
Health Care Quality Measurement may involve making measurement tradeoffs. The attached presentation on AMI quality measurement describes some of these tradeoffs and some approaches to handling them. What has your experience been in dealing with these kinds of measurement challenges?
Relevant answer
Answer
We struggle with this a lot in developing quality indicators, especially because we most often rely on hospital administrative claims data, which lacks much of the clinical detail that ideally would be nice to include in measurement. So already in using claims data we've made a trade-off sacrificing the clinical detail found in the medical record for the decreased time and cost of using readily-available claims data. As we develop indicators, we think about the intended purpose (quality improvement, public reporting, pay for performance, research) and how various trade-offs in sensitivity and specificity will impact use of the measure for these various purposes. Sometimes internal validity is more important than generalizability, so then more precision is needed at the cost of more stringent criteria and smaller samples. Other times its more important that a measure be generalizable, and interpretable, even when we know some specificity will be lost. We constantly emphasize thinking about the aggregate rate (for an area, hospital, health plan, etc) rather than individual cases. And above all, we keep in mind this principle: Don't let the perfect become the enemy of the good.
  • asked a question related to Data Quality
Question
2 answers
In 50 years’ experience in chemical analysis I had many problems adopting published new and claimed superior analytical methods to “real world” samples. Should the peer reviewers and editors ask for a more thorough validation and analysis of CRM’s?
Relevant answer
Answer
It depends on the purpose of the published method and on the Journal. If this "new" method is applicable to some (e.g.lab samples) samples it is useful that it has been published.
But if this method has been published with the purpose of using it for real world (natural) samples, I agree with you, that this should be shown in the paper and not only on CRM's.