Science topic
Data Quality - Science topic
Explore the latest questions and answers in Data Quality, and find Data Quality experts.
Questions related to Data Quality
Dear ResearchGate community,
I'm currently working with a large dataset of phytoplankton species, and I'm looking for a way to routinely update the taxonomy of these species. Specifically, I'm interested in identifying changes to scientific names, accepted names, synonyms, and higher-level taxonomic classifications.
I've tried using various R packages such as worrms to retrieve taxonomic information from online databases, but I've encountered some challenges with data quality and consistency.
I would appreciate any suggestions or advice on how to efficiently and accurately update the taxonomy of my dataset on a regular basis. Are there any reliable databases or APIs that I should be using, like AlgaeBase? Are there any best practices for handling taxonomic data in R?
Thank you for your help!
Nicolas
Hello everyone, I am checking the quality of some RNA-seq data with FASTQC and I am getting results that are not clear to me. Is this kind of result normal?
How can one do case-control matching to randomly match cases and controls based on specific criteria (such as sociodemographic matching - age, sex, etc.) in order to improve data quality?
Hello everybody,
I am currently writing my final thesis final about data quality, in particular about consistency. Therefore I am looking for a labeled IoT timeseries dataset for consistency detecting. Does anybody know, where I can find such dataset?
Or does anybody know where I can get a labeled IoT-timeseries dataset for anomaly detection?
Thank you for your help!
Hello,
I'm trying to calculate the results for a product system by selecting the following options:
- Allocation method - None;
- Impact assessment method - ReCiPe Midpoint (I) / ReCiPe 2016 Endpoint (I);
- Calculation type - Quick results / Analysis;
- Include cost calculation and Assess data quality.
Well, the results are always a list of zeros for every item in the LCI.
I've already tried to do the following actions to solve the problem, however I didn't have any success:
- Increased the maximal memory to 5000 MB;
- Validated the database (it returned back with zero errors);
- Opened the SQL editor and executed the query: select p.ref_id, p.name from tbl_processes p inner join tbl_exchanges e on p.id = e.f_owner where e.id = 7484497 (got the reference ID and the name of the process where the exchange occurred and searched for it, opened the process and didn't find any error message with more details or a warning popup).
The openLCA version I'm working on is 1.11.0.
Thank you very much for all the help.
Best regards,
Beatriz Teixeira
Answers from different disciplines are welcome.
I am currently working with GANs or to be specific CGAN (Conditional Generative Adversarial Networks) for synthetic signal generation. To improve the generated data quality which means to increase the synthetic data similarity to original data, I have already analyzed and observed improvement by performing several hypertuning combinations for the discriminator and generator, such as modified momentum, iteration and learning rate. On top of that, batch normalization and layer number manipulation helped in additional improvement. My question is what other parameters should be a must to look into excluding that of general neural networks?
In the era of information and reasoning, we are shown several scientific pieces of information either in print form or online globally. Despite the appreciable access to information the originality, novelty, and quality of information are substandard. For example, a large number of researches done in the developing world are either published in reputable journals or on the shelf. However, implementation of these research findings is scarce.
This could be due to data quality or the quality and quantity of the research team involved. The Issues that could affect the quality of research in developing countries include but are not limited to;
· Availability of limited resources to support research projects
· Inadequate time devoted to research projects because people who teach at the university level in developing countries are rarely full-time professors and usually have several jobs.
· The theoretical nature of research methodology in the curriculum, so students become professionals without the practical knowledge of how to do research.
· Limited access to journals, search engines, and databases and high subscription cost that is beyond the reach of the budgets of both individual professionals and university libraries.
· Weak ethical review committee to verify ethical treatment of human subjects.
· Rationing research funds to several colleges and department, which lead to limited competition and an increased chance of doing weak research
· Weak institutional structure and lack of empowerment to research staff
· Poor data management systems and lack of databases
· Availability of poor research guidelines and poor composition of the research team (i.e. failure to involve all relevant expertise in developing proposals and conducting analysis and interpretation of findings)
In the face of the above challenges, using real-world health be a solution to data quality problems? If, what are possible changes using real-world health data in developing countries?
How to maintain data quality in qualitative research? How to ensure quality in qualitative data collection as well as data analysis?
Hello, I am trying to get a metagenomic analysis and found Novogene whose prices are pretty cheap (almost 1/3 of our university core). Does anyone have any experience with this company? about the data quality or reliability?
Please let me know,
Thank you,
What metrics do people use for quantifying the quality of timeseries data obtained from sensors such speed, acceleration, relative velocity, lidar point clouds? Also, how do you define the quality of such timeseries data?
#timeseries #timeseriesdata #datanalysis #ts #quality #dataquality #metric #sensors
Dear Han,
After comparing annual flux ET data with CMFD precipitation, it seems that ET at almost all six sites is even greater than precipitation. Therefore, I wonder if it possible that the runoff or groundwater played a crutical role in these sites. Otherwise, It's about the flux data quality problem.
Please refer to the attachment for details
Cheers,
Ziqi
Dear community,
We are working with attention check items in several questionnaires to improve the data quality (compare ). We are also measuring several constructs (such as embodiment, immersion, presence, motivation, self-efficacy etc.) which established questionnaires recommend measuring with many items (sometimes >30). This length makes it infeasible given participants' limited attention and time. Thus, we have shortened some scales. I would like to justify why this is acceptable given the higher data quality due to the attention check items. Unfortunately, I could not identify any literature that indicates this. Are you aware of anything in this direction? Please also feel free to point out any literature regarding shortening scales or the trade-off of long scales and time.
Thank you!
Usually taqman qPCR gene expression assays using taqman fast advanced mastermix require 20ul reaction volumes. I wanted to scale down reaction volumes to 10 or 5ul to save on reagents but I was wondering if this is possible without sacrificing data quality.
Hello,
Looking for studies about data quality improvement considering fixed model. what are the options in case of limited data? (<10000 samples)
Regards
I have found reports mentionning the limits of both but apart from answers on the scope (EF is broader, CF is more specific, they are also quite interrelated), i can't find a comparison on the quality of the data itself.
is there any data-set of smart-meter data (energy data) which labeled with data quality flags
Thanks
Aslam
Hi!
I'm working on a research and we have some kind of hydrologic datasets(e.g. river discharge, precipitation, temperature and etc.).
We need to be sure about the quality of our datasets.
Is there any test (preferably new ones) to be sure about quality datasets?
Thanks
I used Zetasizer Nano ZS to measure my polystyrene microparticles, and the data quality report suggests the zeta potential distribution is poor. I've checked the SFR spectral quality data, all of my 6 replicates for the same sample is below 1. In the technote, poor zeta potential distribution could originate from improper sample concentration, high conductivity samples, and less than enough sub runs. I've checked the derived count rate, which is 13,000-18,000 kcps with attenuator setting is 7. The sub run number is 22 for all samples. Since I saw blacking effect on the electrode after measurement, which suggests electrode degradation. Therefore, I suspect the conductivity is relatively high. However, the dispersant composition only includes 50mM Tris Buffer with some trehalose and glycerol. I'm not sure whether the monomodal will be required in this case and more sub runs will be helpful? Your answers will be highly appreciated.
My research team and I collected a batch of data this month(~150 workers) from Mturk. Despite having many quality checks embedded in the survey (e.g., multiple attention checks, reverse coded items) we still feel that the data are suspicious. We ran a similar study one year ago and one of our measures assesses sexual assault perpetration rates. We used the same measure in our current study and the perpetration rates are unusually high this time. Is anyone else having trouble finding quality participant responses in Mturk? Does anyone have suggestions for how we could target better participants? Are there any forums or blog posts that we should be aware of that will help us better understand what is going on? Any information would help and be greatly appreciated!
Thanks in advance!
KIndly suggest the mean of removing outliers from data set so as to improve data quality.
Text classification task, if data quantity is low but data quality is not low. We could use data augment methods for improvement.
But the situation is that data quantity is not low and data quality is low. (noise in the labels, or training data accuracy low)
The way I get the low quality data is by unsupervised methods or rule-based methods. In detail, I deal with a multi-label classification task. First I crawl web page such as wiki and use regex-based rule to mark the label. The model input is the wiki title and the model output is the rule-matched labels from wiki content.
Hello, I am trying to get one RNA-Seq with a lot of samples and found Novogene whose prices are pretty cheap (almost 1/3 of our university core). Does anyone have any experience with this company? about the data quality or reliability?
Please let me know,
Thank you,
Good-quality data is an essential indicator in clinical practice or research activities. There are various attempts to define data quality, which are heterogeneous, domain-specific. I am looking for current and published data quality evaluation frameworks particular to data from electronic health records.
To perform data quality assessment in the pre-processing data phase (Big Data Context), should data profiling being performed before data sampling (on the whole data set), or is it ok to have profiled on a subset of the data?
If we consider the second approach, how sampling is done without having information about the data (even some level of profiling)?
Given the data set, I want to identify the faulty records/data points values in the original data and then try to rectify it. The data-set is a mixture of numerical and categorical variables ( Total: 200 variables) with 2 million records.
I have tried Frequent pattern mining to achieve this which gives rules for variables and values in the data. ( Works well but takes time.)
I wanted to understand if something similar to this can be achieved by deep learning capabilities with some more insights.
We often describe 3Bs of data quality issues, namely Broken, Bad, and Background. "Broken Data" means most data are collected in different time by different people. Sometime the history of data has missing data set. "Bad Data" means data has outliers which might be caused by noise, wrong collecting setup, or degraded/broken sensors, etc. "Background of Data" means the collected data lacks of working environment info. For example, jet engine data without weather, wind speed, and air density data and it will be difficult to analyze the fuel consumption issues. We also need a closed-loop data system which allow users quickly assess if the data is useful and usable. It not, users can further improve the data collection system to avoid collecting useless data.
Data Quality impacts the accuracy and meanings of machine learning for industrial applications. Many industry practitioners do not know how to find/use the right data. There are two levels of data: visible vs. invisible. Most the visible data are from problem areas or based on our experiences. General questions for the visible data are: First, how to find the useful data? Second, how to evaluate which data is usable? Third, which data is most critical? As for the invisible data, vert often people use either an ad-hoc or trial-and-error approach to find and seek but often the work can not be reproduced by others. We need a systematic approach to address the data quality issues in AI for industrial applications. We welcome people to share their research experiences and thoughts.
Dear community,
I would be very greatful if someone can advise me a package in R / RStudio to analyse the ddCt values from a qPCR. Ideally, the package has some example data and would have tools for the whole pipeline of analysis (import of data, quality control, analysis and visualization)
I checked a few but was so far not very satified so far, Thanks for the help!
I am comparing datasets of different quality (uncertainty) by assessing model performance of species distribution models.
Would it be correct to use cross-validation in this case?
Since training and testing data contain the same level of uncertainty, I would expect model performance to be inflated in case of bad quality, and I doubt that the difference in model performance between two different datasets will represent the difference in data quality correctly, unless both are validated with the same validation set.
I am aware that validation with external structured data is always the best option in this case, but suppose that this is not available.
Kind regards,
Camille
Dear all,
I would like to invite you to participate in a research project, about the development of a data governance framework for smart cities. The aim of this research project is to analyse the data requirements of smart cities and to propose a suitable data governance framework that will facilitate decision making, improve operational efficiency as well as ensure data quality and regulatory compliance. Once completed, the framework will be applied to NEOM (www.NEOM.com), a greenfield project involving the design and build of a mega smart city from the ground up, where cutting edge technology will form the backbone of the city’s infrastructure.
To participate in the survey, please click on the following link.
Your support and input will be very much appreciated.
Yours sincerely,
Topé
At our core, we are concerned with data quality as well as the quality of our statistical analyses.
- What kind of processes have you implemented, just for yourself or within a research team, to check for transparency and reproducibility of your results?
- Have you any formal or informal peer-reviewing experience with regard to your statistical analysis?
- What kind of info is mandatory in your syntax files?
Thanks!
Hi we are taking some DLS measurements in a Zetasizer Nano ZS for starches nanoparticles size. We know that we could measure in the Nano ZS90 but we dont have it. We have seen some works in literature that claim they have measure with the same Zetasizer starch nanoparticles. We know that it is difficult to measure starch nanoparticles due to the dry nature of the samples and the water dissolution difficulty. The main problem is that there is an aggregation in the sample and there is poor data quality.
Is there any suggestion for the samples preparation, or an adjustment that sould be done in the Zetasizer that we ignore?
Also we are aware of the fact that there should be also a confirmation of the particles size with a SEM measurement.
I have got a slag sample that contains high amount of iron in it. I conducted xrd using Cu tube, on the powdered sample with very high background (because of iron efflorescence). Then to improve the data quality i got it done with Co tube but the signal to noise ratio is very low. I have attached the .raw files and the screenshot of data for both from the Cu and Co tubes. I know of only one option i.e. increase time per step. Are there any other settings that can improve the signal to noise ratio? Please let me know if any additional information is required?
While developing a questionnaire to measure several personality traits in a somewhat unconventional way, I now seem to be facing a dilemma due to the size of my item pool. The questionnaire contains 240 items, theoretically deduced from 24 scales. Although 240 items isn't a "large item pool" per se, the processing time for each item is averages on ~25 seconds. This yields an overall processing time of over >1.5 hours - way to much, even for the bravest participants!
In short, this results in a presumably common dilemma: What aspects of the data from my item analysis sample to I have to jeopardize?
- Splitting the questionnaire into parallel tests will reduce processing time, but hinder factor analyses.
- Splitting the questionnaire into within-subject parallel tests over time will require unfeasible sample sizes due to a) drop-out rates and b) eventual noise generated by possibly low stability over time.
- An average processing time over 30 minutes will tire participants, jeopardize data quality in general.
- Randomizing the item order and tolerating the >1.5 hours of processing time will again require an unfeasible sample size, due to lower item-intercorrelations.
I'm aware that this probably has to be tackled by conducting multiple studies, but that doesn't solve most of the described problems.
This must be a very common practical obstacle and I am curious to know how other social scientists tackle it. Maybe there even is some best practice advise?
Many thanks!
Data quality is crucial in data science project and how can we improve the data quality before it get into analysis?
I'd like you to participate in a simple experiment. I'd like to limit this to simple linear and multiple linear regression with continuous data. Do you have such a regression application with real data? Some cases with several independent variables might be helpful here. If you have done a statistical test for heteroscedasticity, please ignore that, regardless of result. We are looking at a more measureable degree of heteroscedasticity for this experiment.
To be sure we have reasonably linear regression, the usual graphical residual analysis would show no pattern, except that heteroscedasticity may already show with e on the y-axis, and the fitted value on the x-axis, in a scatterplot.
///////////////////////////////
Here is the experiment:
1) Please make a scatterplot with the absolute values of the estimated residuals, the |e|, on the y-axis, and the corresponding fitted value (that is, the predicted y value, say y*), in each case, on the x-axis. Then please run an OLS regression through those points. (In excel, you could use a "trend line.") Is the slope positive?
A zero slope indicates homoscedasticity for the original regression, but for one example, this would not really tell us anything. If there are many examples, results would be more meaningful.
2) If you did a second scatterplot, and in each case put the absolute value of the estimated residual, divided by the square root of the fitted value, that is |e|/sqrt(y*), on the y-axis, and still have the fitted value, y*, on the x-axis, then a trend line through those points with a positive slope would indicate a coefficient of heteroscedasticity, gamma, of more than 0.5, where we have y = y* + e, and e = (random factor of estimated residual)(y*^gamma). Is the slope of this trend line positive in your case?
If so then we have estimated gamma > 0.5. (Note, as a point of reference, that we have gamma = 0.5 in the case of the classical ratio estimator.)
I'd also like to know your original equation please, what the dependent and independent variables represent, the sample size, whether or not there were substantial data quality issues, and though it is a very tenuous measure, the r- or adjusted R-square values for the original linear or multiple linear regression equation, respectively. I want to see whether or not a pattern I expect will emerge.
If some people will participate in this experiment, then we can discuss the results here.
Real data only please.
Thank you!
Dear colleagues,
I've been working about the qualities of political participation for a while (theoretically as well as empirically).
In one of our latest projects we found that political participation can have negative impacts on trust in institutions and self-efficacy under certain circumstances. Does anyone have similar data or studies? Especially interesting would be data from countries in transition or young democracies. Thank you.
The study by Josko and Ferreira (2017), explained a use for Data Visualization(DV) in quality assessment paradigms which they call a “Data Quality Assessment process (DQAp):
- They highlight that the problem with using DV in this manner is not in the value of what it can provide visually, but the complexity and knowledge required.
- They indicate the need for DV tools to be contextually aware of what is considered “Quality” vs “Defect” therefore requiring such methods to be constructed based on specific requirements which will not be possible for all sources of data.
What is your thought regarding the use of Data Visualization tools as a DQAp? Let's discuss!!
I need to know what is the required quality for different kinds of water demand in a building, like drinking, shower,...A quality index for domestic water and the required amount for every application is very helpful. What is the index that I should search for, and is there any reference for that?
I wonder if anyone can help with references to the health status of professional Tea tasters, given that Fluoride is absorbed directly through the oral cavity?
I am looking for data on the quality of their teeth, any oral cancer, plus wider potential impacts on the rest of the tasters bodies.
Is the development of electronic "tongues" related to health issues as well as seeking objectivity?
Dear colleagues,
I'm currently researching supplier sustainability assessment and am looking to find any literature on data/response validation
Companies (but also individuals) are often asked to report on sustainability practices and progress. These responses are subjected to varying degrees of validation; ranging from no further checks to complete validation of every response.
Do you have some recommendations (also from non-SSCM channels) or examples of cases / findings where the respondents' data quality was improved through validation?
Many thanks in advance!
Iain Fraser
Hello,
I've completed sequencingy my library, and starting my analyses. I'm new to using mothur, and I request assistance with evaluating the my data's quality. I've included a subset from one of my samples containing output of fasta, fastq, and qual files. The subset contains the same three individual organisms among all files.
I understand the fasta files contain DNA sequences for later identification. It's the output from the fastq and qual files that giving me the most confusion. For example, what do the numbers signify in the qual file?
Please help me understand what the fastq and qual files are saying of the fasta file.
Thank you,
Gary
Hi everyone,
I am pretty new in SEM and I am seeking clarification regarding model fit through AMOS.
While performing the analysis I obtained good model fit indices values exept RAMSEA and CMIN/DF.
For a further comprehension of the model: I used the Maximum Lilkelihood Estimation; the model has 4 exogenous (C, D, E,F) and 2 endogenous (A, B) variables; the total amount of the items (responses provided on the Likert scale) of the questionnaire is 36; the scores of the items were averaged for each variable.
These are other details:
N= 346
CMIN/DF=6.9
NFI=0.98
CFI=0.98
SRMR=0.04
GFI=0.99
RMSEA=0.13
The standardised parameter for the model are:
A <-- C* 0.11
A <-- D 0.08
A <-- E*** 0.36
A < -- F*** 0.22
B <-- A*** 0.46
B <-- C 0.05
B <-- F 0.09
Significance of Correlations:
*** p < 0.001
** p < 0.010
* p < 0.050
The model explaines 28% of the variance in B and 38% in A
Can anyone please thoroughly suggest me how to overcome this problem of the inadequate (poor) value of RMSEA?
Can it be a problem related to data quality? I also read in Kenny et al. (2015) that in case of low d.f.( in my case d.f.=2) RMSEA has more probability to be over 0.05.
Kenny, D. A., Kaniskan, B., & McCoach, D. B. (2015). The performance of RMSEA in models with small degrees of freedom. Sociological Methods & Research, 44(3), 486-507.
I am trying to develop a Machine learning algorithm for Data Quality. The goal is given a data set , I should be able to identify the "bad" records in the given data set.
I tried one class SVM, I am little confused, I assume that i need to train the model with "good" data only. but if I have to do this then I would need to know that what are the good records and this not the goal.
If I train the model with good and bad both types / class of records then oneClass SVM is not givng the good accuracy. is it a good approach.
I have heard people suggesting me to track the dominant eye when doing single eye eye-tracking, because the "dominant eye will give more accurate data". However in normal subjects the fixation locations of two eyes should be quite consistent. Is the data quality significantly different between dominant eye-tracking and non-dominant eye-tracking?
Hello all,
I want to created a composite index that will measure the well-being state of UE countries. My data set is panel (15 years, 28 countries an 15 indicators). I want to created the index based on Principal components technique, but i am not sure what should i do before.
What i was thinking to apply is:
-> removing outliers: i do not think that is necessary to remove them for this analyse. Is this right?
-> check the seasonality of data
-> check the stationarity of data with Unit Root test: Some variables need to be standardized. Should i standardize them by log them? Id i ise the diff tehnique, i will loose some rows.
One mention here is that the SPSS has the option to standardize variables, but when i take them in eViews in order to check the unit root test, then i see that they need to be standardized (my guess is that SPSS standardization means normalization)
-> check Alpha Cronbach values (it should be over 0,6 to continue applying PCA)
-> PCA
Thank you!
Hi! Can anyone share some practical experience about Bacterial WGS on HiSeq 2500 Technology (average depth of 200X coverage)?.
I would like to know mainly 1. About data quality 2. Average Read length 3. Error rate 4. Advantage over MiSeq.
Thanks.
I am looking for data analysis tools that can be added into a new database (legacy system migration, without database from legacy system) which takes structured data (pre-determined format, seen as correct) as an input.
How can I check for climatic data quality aside comparison with other station's data or checking for outliers?
We have a wearable eye-trackers in our lab (SMI ETG-2w), and I hope to start some experiments using it. I notice that many desktop eye-trackers provide a sampling rate at 500-1000 Hz, and allow 9-points calibration before experiment; whereas ours only has a sampling rate at 60 Hz (though it can be upgraded to 120 Hz if we pay some extra money), and only allows 3-points calibration.
We hope to do some serious neuroscience & psychophysics experiments, analyzing saccades, fixations, and pupil size, and the subjects will sit in a lab. No out-doors experiments are currently planned. Now I have some doubt on whether our eye-tracker can provide enough precision & accuracy, as in pilot runs when we show dots on random locations on the screen and let our subjects fixate on it, our eye-tracker could not reliably give the correct fixation locations on some dots.
Does wearable eye-trackers always provide worse data than desktop eye-trackers? I hope someone with some experience on both kinds of eye-trackers can help me know what levels of data quality I can expect at most.
I'm new to NGS anlysis, there are many QC (quality control ) software for NGS data, which one is best? I'm using https://usegalaxy.org so it's better to be available in this web sever. or is there any better program for QC to run on my laptop? is it heavy computation for a laptop (core i3 cpu and 4 Gb ram)?
I've used REDCAP for several years but we're initiating data collection on a new project that includes 10 parent survey instruments and 13 forms that research assistants are entering standardized assessment data into. I'm trying to figure out if there is a way to restrict double data entry to just the assessment data (not the surveys) and if there are recommendations about user permissions / data access groups. Any advice from REDCAP users with experience with double entry would be greatly appreciated!
Fellow researchers,
I am looking for an article (review?) that investigates if quick/timely/near real-time reporting of health care quality data for benchmarking / quality improvement purposes results in higher acceptance of such data by providers.
I want to make the argument that if you want to use such data for quality management you need to have it quickly, ie not wait for months/years before reprots are published. Reviews that I found emphasize all sorts of contextual factors for the success of quality improvement interventions in healthcare but not timeliness of reporting. Can anybody help? Many thanks!
the basic concern is whether or not the existence of a regulated environment improves data quality and reproducibility
which machine learning techniques used for SAR data quality assessment?
Background: I'm using an online survey company to launch a questionnaire nationally. The survey is to identify factors affecting online shopping intention, so there will be a lot of scale questions, as well as respondents' demographic information. I'm asking the company to get 50 responses before carrying on the full survey.
Question: Would you suggest some quick stat tests using the pilot data, to check if the responses are of high quality? (whether the profile is diversified / whether they answered the questions carefully / etc)
Many thanks for your help!
Li
I am looking for information on existing research on methodology and tools intended to collect data samples representing instances of a novel ill-structured subject domain. This task is, in some sense, inverse with regard to Big Data processing: I have to collect, as much as possible, data instances of a novel domain I familiar not well and for which I am capable to formulate only the basic set of ontology concepts. I need this data set (let it Big Data of any type) to design afterwards an empirical model of dependencies peculiar to the domain attribute while using a Big Data processing technology (in fact. machine learning technology I have) while mining the collected data set. This is a kind of Ontology-driven Data Search and Collection problem. Could someone help me?
I have a particle sample which will aggregate to form bigger micrometer sized clusters as I heat them up. I want to use dynamic light scattering(DLS) to determine the size of the aggregates during the heating up process. However, given the polydispersity of the aggregates, the DLS instrument always told me that "data quality too poor because the sample is too polydispersed". I am wondering is the Z-average size I get still reliable? Can I rely on the size distribution from the instrument or I need to use other methods to derive the size distribution of the polydispersed aggregates? Thanks very much!
My data will be the quality audit reports of 45 HEIs in the Sultanate of Oman.
How can I validate/confirm the proposed model/framework?
Thanks in advance.
I am looking to create a system that proactively monitors data stores for data quality, where the data quality rules are expressed in a language understandable by a lay-person (e.g, Attempto). The language constructs could then be parsed into the equivalent data store-native language (e.g., SQL, NoSQL, etc.).
If you know of any research in this area, I would be most appreciative.
We assume that a composite service consists of a number of atomic service.
Given the quality of each atomic service, how to measure the overall quality of the composite service ?
For example, we can define that: the higher mean value the higher overall quality, and the lower variance the higher overall quality. However, the question is how to aggregate mean value and variance ? Is there any metrics?
I am familiar with Actor-Network Theory , i applied ANT as tool to construct network collaboration, whereas i want to use ANT with data quality as metric to choose actor network .
Thank you in advance
Salim
In situation where it is not feasible to quantify DQ dimension due to non involvement of a data set , is it suffice to show the subjective measures directly asked from the customer?
To be more clear, satisfaction of a customer in meeting the desired requirement cannot be objectively measured. In this case is customer satisfaction feedback be sufficient for the successful approval of the project while presenting to a committee?
Specifically I'm interested how to use ANT in governance of information system, i want also to use ANT with quality data?
I have over 5 years developing predictive models with years of experience as a researcher, statistical analyst as well as data scientist. One aspect that I have experienced within the big data sphere as well as predictive modeling landscape is that a lot of emphasis are either place on data quality and quantity, the experience and expertise of the modeler, the kind of system that is being used to build the model, validate, test, and continue to monitor and assess the model quality and performance over time. With this said, I would like to see what others here on Research Gate think are some of the challenging task building either a statistical or predictive models and what were some of the strategies you employed to help address those challenges? What were some of the tradeoffs you had to make and how would you approach similar situation in the future?
Information provided to this inquiry will be used for personal and professional growth.
The aim of data fusion(DF) is basically increasing the quality of data through the combination and integration of multiple data sources/sensors. My research is on assessing this impact of DF on DQ(data quality) hence I would appreciate the academic materials to backup your conclusions.
I have being trying to link DF methods to the DQ dimensions that are mostly impacted on to no avail.
During data entry, we can correct errors before mining by creating forms through question ordering during question reformulation.
We are doing surveillance in a scattered geographical area. How can we assess data quality? Should we compute error rates or directly observe how data is being collected?
We are intended to assess quality of data. Can we use LQAS strategy for assessment of data quality.
I would love some pointers to existing work/papers on sparsity in large data sets. What I am looking for are not the (important) usual papers on how to compute given sparsity in large data sets; I am instead thinking about how one might use the fact of sparsity and a non-homogeneous distribution of features and relationships to characterize overall solution spaces/data set spaces into regions of greater interest and less interest.
If yes, could you give me an example? In my opinion the Maturity Models for MDM / DQ are unspecific / generic for all business branches in the most cases.
Thanks and best regards
M. Gietz