Science topic

# Data Curation - Science topic

Explore the latest questions and answers in Data Curation, and find Data Curation experts.
Questions related to Data Curation
• asked a question related to Data Curation
Question
Specifically, please I prefer the index from DATABANKS INTERNATIONAL'S Cross-National Time-Series (CNTS) Data Archive.
Thank you!
• asked a question related to Data Curation
Question
For a binary stellar system, does the stellar mass column of NASA Exoplanet Archive data mean the average mass of both host stars or something else? If not how can I find the mass/other parameters for both stars?
If there is some data, there is some spreadsheet that explains those data (I'm talking from experience, that happened to me once, I spent hours finding that paper but it existed nonetheless).
Now, without looking at that spreadsheet, I can suppose it's the *sum of the masses*; because we cannot measure the mass of a single star, but we can measure the sum of gravitating masses via Kepler's third law [T²~d³ and the constant of proportionality contains the sum].
Hope that this helps
• asked a question related to Data Curation
Question
I am currently doing my dissertation and I am trying to check the reliability for sub-scales I have newly created. The items comes from multiple measures as my data is from a data archive (so were collected from different studies). I have re-coded them all onto the same Likert scale but keep getting the error message "too few cases analysis not run". All missing values are coded the same (999-888) but from the help online, I am thinking it is as the missing values are scattered over numerous analysis variables? Is there any way round this? Any advice would be appreciated.....
Coefficient alpha is based on correlations, so you be having a problem with missing data for "complete" cases. But you should also check to see if your problem is structural in the sense that you are including items from two (or more) different surveys in the same analysis, because it would be impossible to calculate correlations when the variables are drawn from different data sets.
• asked a question related to Data Curation
Question
I'm not really a "stats guy", but while working on my doctoral thesis and final paper of my PhD project, I became aware that statistical analysis cannot be performed on compositional data (such as tephra oxide concentrations) in a mathematically sound way due to the constant-sum constraint - yet the average tephrochronologist, when doing a PCA or similar, uses the raw or normalised compositional data anyway (at least in the majority of cases, as far as I've seen in my reading). I've learned that log-ratio transformations are "the" solution to the constant-sum constraint and have seen that this has indeed been done in at least a handful of tephra studies (and that it's becoming more routinely applied for e.g. XRF data), but there's not really any widespread acknowledging of this issue and even less any commonly accepted procedure as to the data curation with log-ratio transformations.
I commented on this in my thesis, and during my defence I was encouraged to publish something about this, aimed specifically at the tephra community. So, I've started working on this "something", but I need to hear some opinions on the specifics of the log-ratio transformation part of what could be a formal suggestion for a common tephra compositional data curation procedure. What I want to discuss is what kind of log-ratio transformation should be used if we were to agree upon one particular way of doing it - which is something that would be preferable in order to allow straight-forward comparisons of results between different studies, kind of in the same way that biplots of particular elements have become more or less routine to display in tephra papers.
A log-ratio transformation is basically lrt(x) = log10(x/_) where the divisor of the fraction (i.e. "_") can vary. I've seen single element oxides being used as the divisor, but without real motivation as to why that particular oxide was selected. In my thesis I argued for a centred lrt, where the divisor is the geometric mean of all oxides, and it makes sense to use something like that because it may be less ambiguous than just selecting one oxide without explaining why. However, it has the issue that no oxide concentration can be 0 (because then the divisor will be 0, and you can't divide by 0) and in many cases at least one oxide will be below EPMA detection levels and register as 0 - particularly P2O5 but also MnO and MgO in some analyses. Furthermore, P2O5 is not always reported (especially in older publications) so using a centered lrt would probably require omitting P2O5 (and then still not work with some few analyses where MnO or MgO were 0).
My main question then is, what log-ratio transformation would YOU suggest to be the norm, and why? Maybe using a single oxide is the better option, but if so, which one? To my understanding, which variation of the transformation is used doesn't really matter as long as the same one is used for all data. So let's just decide how to do it! I would really appreciate your input and suggestions. Thanks in advance!
PS. I'm not here to argue for or against the necessity of log-ratio transformations (mathematically speaking, it is required if statistical analyses are to be applied to EPMA data), although the floor is open and thoughts on any aspect of this issue are welcome. DM me if you're interested in more details about the paper I'm preparing, too. :)
For the higher abundance, lower variation oxide, I might suggest Al2O3. Lower abundance oxides or elements have the problem of lower analytical precision in general as well as having precision which can vary greatly from lab to lab. Al2O3 also tends to be less impacted by beam damage effects or glass alteration both of which make the alkalis a poor choice.
On the problem of zeros and unreported data, this is a substantial data reporting issue and a misuse of detection limits. When the glass composition of a tephra is analyzed, there are almost always multiple analyses, and these are used to produce population summary statistics like means and standard deviations. This and the inherent randomness in X-ray counting are important to keep in mind when considering what to do when specific analyzed values on specific individual shards fall near or below detection limits. Consider what would happen to the mean and standard deviation of a population of analyses if one threw out half of a set of analyzed values because they fell below the detection limit and kept only those above. The result would completely misrepresent the nature of the distribution. I argued in an open review at Geochronology that all of the analyzed values must be reported always, and even negative concentrations must be reported! Again, this is necessary to represent the nature of the statistical distribution, to not skew the mean toward higher concentrations, and to provide an appropriate representation of the precision. More explanation here: https://gchron.copernicus.org/preprints/gchron-2020-34/gchron-2020-34-RC3.pdf
• asked a question related to Data Curation
Question
Distinguished peers,
I've had some problems finding a public repository of COVID-19 data that 1) were always reliable, 2) were automatically up-to-date, 3) provided a .csv file (or similar) through a friendly interface. Do you know any you'd like to share?
Sebastián
New Model for the COVID-19 Reported Cases and Deaths of Ghana in Accelerated Spread and Prediction of the Delayed Phase
• asked a question related to Data Curation
Question
COVID-ARC is a data archive that stores multimodal (i.e., demographic information, clinical outcome reports, imaging scans) and longitudinal data related to COVID-19 and provides various statistical and analytic tools for researchers. This archive provides access to data along with user-friendly tools for researchers to perform analyses to better understand COVID-19 and encourage collaboration on this research.
You deserve it well, well done. make it smooth and visible to read.
• asked a question related to Data Curation
Question
Hi all,
I need of one software to investigate specific list of gene related to cancer. I want to investigate specific biological process such as angiogenesis or hypoxia or migration.
N.B I m not a bioinformatic so I need just an easy software to have idea about my gene?
Thank you for helping.
BioTuring Browser can be helpful. It is an interactive platform for access and reanalyzing published single-cell RNA sequencing data.
• asked a question related to Data Curation
Question
With Covid19, we have seen digital transformation happening so much faster: everyone able to do this was suddenly working from home, using the internet and the available digital access and platforms. Education was also put on hold unless it moved online: online-school, online-education got a tremendous stimulus.
Now, what is fueling the data economy, what is the "new green oil" of this digitally transformed world? It's DATA, and it's data fairly priced for the stakeholders, starting with identified data owners.
Please see here a link to some books on the subject, reviewing the rational for data use in every aspect of life, business, markets, society, and looking at the creation of Data Market Places, as well as diving into the detailed equations of how to price and how to make it happen in economic terms, as Data Microeconomics:
Exactly. However, more planning is needed from now own. Thank you so much for your enlightening and innovative thoughts
• asked a question related to Data Curation
Question
Given your specific discipline. Have you ever irretrievably lost data of an ongoing research project? How did you handle it? Thanks in advance.
(Also, this is my story. A couple of years ago, in a study that included collection, preservation, identification and weighing of soil invertebrates, after an unfortunate event in the laboratory, the notebook that contained the weight notes of one of 10 sets of collected organisms, which belonged to the control group, was lost, so were the preserved organisms. I'm tagging this with an entomology lablel, so in case you're familiar with this topic: Would you consider trying some method of reconstructing the weight data or is there just nothing to do? There is no way to recover the notebooks, nor the preserved organisms).
Thank you.
I have not lost the data of any ongoing research project, because in parallel I work at home and do independent archiving. Experiments have been suspended, which is a waste of time.
• asked a question related to Data Curation
Question
We are working on a large number of building-related time series data sets that display various degrees of 'randomness'. When plotted, some display recognisable diurnal or seasonal patterns that can be correlated to building operational regimes or the weather (i.e. heating energy consumption with distinct signatures as hourly and seasonal intervals). However some appear to be completely random (Lift data that contains a lot of random noise).
Does anyone know if an established method exists that can be deployed on these data sets and provide some 'quantification' or 'ranking' of how much stochasticity exists in each data set?
No, there is nothing precisely like that.
"Random" is what we can not explain or predict (for whatever reason; it does not matter if there is no such possible explanation or if we are just not aware of one).
The model uses some predictors (known to us; like the time of the day, the wether conditions including the day in the year, etc.) and makes a prediction of the response (the energy consumption) - the response value we should expect, given the corresponding values of the predictors. You can see the model as a mathematical formula of the predictor values. The formula contains parameters that make the model flexible and adjustable to observed data (think of an intercept and a slope of a regression line, or the frequency and amplitude of a sinusoidal wave).
The deviation of observations from these expected values are called residuals. They are not explained by the model and are thus considered "random". This randomness is mathematically handled by a probability distribution: we don't say that a particular resudual will be this or that large; instead we give a probability distribution (more correctly, we give the probability distribution of the response, conditional on the predictors). Using this probability model allows us to find the probability of the observed data (what is called the likelihood) given any combination of chosen values of the model parameters. Usually, we "fit" these parameters to maximize this likelihood (-> maximum likelihood estimates).
Thus, given a fitted model (on a given set of observations), we have a (maximized) likelihood (which depends on the data and on the functional model and on the probability model).
This can be used to compare different models. One might just see which of the models has the largest (maximized) likelihood. There are a few practical problems, because models with more parameters can get higher likelihood s just because they are more flexible - not more "correct". This is tried to be accounted for in by giving penalties for the model flexibility. This leads to the formulation of different information criteria (AIC, BIC, DIC and alikes, that all differ in the way the penalties are counted).
So, after that long post, you may look for such ICs to compare different models. The limitation remains that the models are all compared only on the data that was used to fit them, without guarantee that they will behave similar for new data. So if you have enough data it might be wise to fit the models using only a subset of the available data and then check how well these models predict the rest of the data. It does not really matter how you quantify this; I would plot the differences of the models side-by-side in a boxplot or a scatterplot.
• asked a question related to Data Curation
Question
Hi everyone, I'm looking for sotware that helps me to clean data from bibliographic databases, i don't know how to program so I really need a tool that it's easy to use (maybe with a little bit of programming). I already used VantagePoint but I don't have more access to it :( Help me please
Hi Ana, you should try the bibliometrix package in R ( https://bibliometrix.org/ ). Even if you don´t know anything about programming there is a web-based application (biblioshiny). With this package, you can analyze everything you want by drag and drop the items you want to investigate.
• asked a question related to Data Curation
Question
I am performing an experiment using force sensors and IMUs to gather readings of orientation from a certain motion performed by the human body.
The sensor does not know when and what that particular muscle is doing. There are no video cameras to record this action. I have to design an experiment which is going to sync these sensor signals corresponding to the activity done by the body. Suppose a person is standing among takes an object that has the IMU(Inertial Measuring Unit) and then sits down and stands up and repeats this a couple of times, how do I identify from the IMU readings whether he is sitting or standing at that particular instant in the output signal?
1. Is there a statistical way of conducting this?
2.Or is there a way in which I can make the person sit and take individual readings, then stand and take individual activity readings then make the person sit and stand couple of times and match the signal from the individual activity readings?
3. Or give a time gap of 5 seconds or so between each activity and identify it that way? (how is the error sensitivity in this case)
NOTE-this is not exactly what I am doing, but consider hypothetically that I can take readings individually for each activity alone.
Im not sure whats your aim from this reading but why you don't try pressure sensors
• asked a question related to Data Curation
Question
I am currently working on collecting and curating data on how the various disciplines utilize games-based educational techniques in their curricula in higher ed. Do you have any experience, or know of any other experience, with utilizing games (digital, table-top, role-playing, simulation, or other) in your or a colleague's teaching? If so, how? Thanks!
Transforming this by entering interesting stories that urge learning through exploration and stimulating the motivation of learners in research. In addition, there can be scientific questions in the games that need an answer to move from one stage to another stage, thus making the learner search for the answer to the question
• asked a question related to Data Curation
Question
I have been recently studying proprietary voter files and data. While I know that voter files are for the most part public (paid or free), I am confused as to how companies match this data to other data.
For example, the voter files (public) never reveal who you voted for, or your behavioral attributes, and so on. So how do companies that sell this "enhanced" data match every "John Smith" to other data. How can they say that they have a profile on every voter? Wouldn't that require huge data collection? or are there models that simply do that job for them?
Hi Melissa. I don'. Do you mean private opinion surveys ? They just ask who the interviewee voted for or who they intend to vote for. If I misunderstood your question and it doesn't have anything to do with my answer, I'm really sorry.
• asked a question related to Data Curation
Question
We let rate about 60 short stories on valence and arousal.
We suspect that a few scorers have not read the short stories and have marked the possible answers more randomly or according to a pattern. My goal now is to find these bad scorers and remove them from the record. I want to be very careful and leave the scorers in the dataset in case of doubt.
I have chosen the following two criteria for exclusion: 1. Deviation from a range of expected values. 2. Deviation from the expected distribution of all rating values.
Criteria 1: If in more than 6 ratings, a scorer deviates by more than one standard deviation of the averaged ratings across all scorers, the ratings of the scorer will be removed from the dataset. If we assume the probability per rating that a reviewer randomly answers next to the expected range, after 6 short stories there's a probabilityof less than one percent.
Criteria 2: The distribution of all ratings across all scorers resulted in an equal distribution. If you allow an average deviation of 4 points for each scorer from the distribution of the rating values of all scorers for each score, all scorers with more than an average deviation of 4 points will drop out of the data set.
If either or both of these criteria apply to an scorer, the scorer's ratings are removed from the record.
• Is this legitimate?
• Are there better practices?
Yours sincerely
Hello David
In fact, I compared the data wiht and without the 8 excluded cases.
The differences are rather small, and all correlations for valence and for arousal between the whole sample and the 'reduced' sample are >.99 for all rating groups.
Best regards
Egon
• asked a question related to Data Curation
Question
I have been working on a project to collate species occurrence data inherent from unpublished student theses in an integrated database (currently published in GBIF) and still working on a systematic protocol of data validation. Expert review is really subjective and I got many findings that said "expert" estimation were not always more consistent than amateurs, student, or even public enthusiasts (feel free to message me for the papers I collected regarding this), thus my team was still struggling to find a way. Our current method is just independently evaluate the scientific names through taxonomic checklists and the geographic distribution were validated through available published literature mentioning the geographic distribution of each species. We occasionally ask experts but as we are working on many understudied taxa and geographical area, there was not many around.
I suppose it all depends on your study species. For the most part, I think experts in most fields are able to identify the species they're most knowledgeable about with relatively high accuracy, given they have enough information in the photo and geographic location to do so.
It's usually when someone gets a bit overzealous and identifies something to the species level when given minimal information and just going off of an educated guess for species most likely to be in the area.
It also depends on what the question for your study is. If you're doing an SDM for a species, you could always thin the records to about 100 and then self-verify (if you're confident in your abilities to do so). You could see if the species occurrence data has any corresponding NCBI molecular data and use DNA to verify species.
If you're using a dataset of 1000 + (or some other number where it isn't feasible to self verify each account) from inaturalist, you could query the data with >3 verified ID agreements with no "maverick" or disputed IDs. The likelihood of obtaining false positives should decrease with user agreement on a species identification.
• asked a question related to Data Curation
Question
I have several microarray gene expression datasets, but they use different naming systems for the probes. What are some simple ways to convert all the probes to the same type (e.g. Affymetrix)? Thanks.
Hey there,
Here you can easily convert different naming.
Hope this helps.
• asked a question related to Data Curation
Question
Dear colleague,
I am leader of the research project B-DATA Big Data Assemblages: Techniques and Actors. B-DATA has the intent to study data assemblages inside research centres and data infrastructures which produce and use open data and Big Data. The three main case studies are: the Consortium of European Social Science Data Archives (CESSDA, Norway)); Italian Statistical Office (ISTAT, Italy); the Web Science Institute of the University of Southampton (UK).
During my research i came accross to some of the topics that have been adressed by your project. I was thinkink if you were interested in making a comparison with what is happening here in Italy at ISTAT. I can have full access to this field.
This is just a first contact. If you are interested, we can have some contacts to understand the feasibility of that.
I attach an open access article that show some of the finding of B-DATA, that may help you to understand what I am researching.
I am looking forward to hearing from you.
Yours sincerely, Biagio Aragona.
Assistant Prfessor of Sociology
University of Naples Federico II
yes I hope obtain scholarship to study PhD
• asked a question related to Data Curation
Question
I am trying to aggregate a large data set by two identification variables (i.e., id1 and id2). My aggregate command in R drops the entire row if missing a value on any of the columns. For instance, even though row #1 has been observed on all, but one column (missing on only 1 of 20 columns), row #1 is dropped during the aggregation process. I was wondering if there is a way to aggregate (get mean or sums) the rows by the two identification variables without dropping the entire row for missing on 1 or 2 columns.
Here is my current R code for aggregating:
dfagg1<- data.frame(aggregate(.~id1+id2, data=dfsub,
FUN=function(x) c(mn=mean(x), n=length(x), na.rm=TRUE)))
Thank you!
You can simply use the `subset' command to select a part of the full data set with certain conditions.
dfagg1<-subset(dfsub, !is.na(id1) || !is.na(id2))
It will create a data excluding those subjects for whom id1 and id2 both are missing.
• asked a question related to Data Curation
Question
Most of the data curation workflows use the absence of Carbon atom as discrimination. Historically, despite the presence of carbon atom, several compounds, such as carbides, carbonates,cyanides etc. are considered inorganic.
there is also a grey area, such as tetrahalogenated compounds (i.e. CCl4) or sulfides that are considered organic or inorganic depending on definitions.
I was thinking about a SMARTS pattern to define all possible exceptions. It does make sense? Does anybody has already tried?
Nowadays most all the software can compute descriptors for inorganic compounds. For this reason, Is still true that you need to filter inorganic compounds for QSAR modeling?
SMARTS is how I'd address the problem, especially since you seem to want a custom-list of what counts as inorganic. That makes it very unlikely you'll find a pre-made filter that will give you the output that you want. That being said, I'd probably start by looking for absence of carbon, then do a second pass looking for the presence of metals. There are a lot of metals, but it's probably simpler to type those in than to try and imagine every possible organo-metallic ligand under the sun. If there is carbon, and is not any metal, chances are pretty good you have an organic compound.
As for docking algorithms (or, more specifically, scoring algorithms), it's true that they mostly handle inorganic compounds with some degree of success these days. But the exact quality of the output will depend upon the input: some inorganics will likely be handled better than others, just like some organic compounds are more reliably modeled by any given algorithm than others. Inorganics have the further problem that there are far fewer inorganic/protein co-crystal structures out there, so it's harder to benchmark an algorithm based on that sort of data.
At the end of the day, I suppose the big question is: what do you want? What sort of output are you looking for? If you're just doing high throughput virtual screening, I'd almost say to skip the filter at the start - see what your output looks like, and just mentally classify results as more or less reliable based on what the molecules look like, and whether they are organic or inorganic.
• asked a question related to Data Curation
Question
Hi everyone,
I'm a researcher who prepares content for workshops. Here are some lists I compiled to help me find concept-demonstrating resources
"8ACTIVITIES" to "b BRAINSTORMING WITH OTHERS"
********
Types of demonstrations
I'm looking for resources (videos, pictures, games, activities, articles etc..) that would explain or illustrate a concept in an engaging and/or simple way especially to a lay audience. It could be a straightforward explanation like this A video explaining the concept "Co-relation does not equal causation"
where the original intent of the video is to explain this concept. Or ,You can be creative; this illustration does not need to be originally made to explain this concept. The analogy can be made independently of the original intention of the resource. For example, if the concept is flexibility in your career, I can find a video of a person demonstrating physical flexibility even though the original intent of the video is not about career flexibility.
I attached a document showing two tables of examples on resources and demonstrations.
• asked a question related to Data Curation
Question
Please suggestions about finding numbers (statistics, figures) about a certain topic that you have in mind but not if the numbers are available. For example, number of hours people spend deliberating about which career to choose vs the number of hours doing that career. Or number of hours people spend deliberating about which career to choose vs number of hours doing they deliberate for much much less important decisions like what clothes they should buy?How do I find numbers related to career deliberation? Both tips about the general "number finding" issue and the specific "career deliberation" issue are appreciated .
email response from Paul Bradshaw (Birmingham uni) of onlinejournalismblog.com: " Hi Hashem, this is quite a broad question and so the answer is hard to give. There are many ways to find statistics/figures – for example using advanced search operators like site:ac.uk or filetype:xlsx will help you narrow your search, or specialist search engines like Zanran (which you mention) which focus on tables and charts. You can also try a two-step approach by first searching for organisations which are likely to collect that information (governmental, charities, academic, survey companies) and then approaching them. More broadly you can look at a system and identify which organisations collect which information within that system (e.g. Regulators do inspections, using gov.uk etc) In the specific example of career deliberation you may be best trying the two-step approach. It may be that recruitment companies have commissioned research of this type before, or you can access polling companies to find out if they’ve done anything on this.
Thanks,
• asked a question related to Data Curation
Question
For the same compound-protein interaction, using different in vitro assays may result in different activities. The XC50 may differ more than 100 times. For example, for functional assays, we sometimes use100% human serum or BS. We also use HSP buffer sometimes. All these data are curated in databases such as ChEMBL.
If we use all these data to train a machine learning model, it'll be too noisy and the models will certainly be bad enough. However, papers focusing on method development never try to address this problem. What can we do?
Two distinct problems need to be addressed here. First, an inevitable experimental error (sometimes very significant) occurs when measuring a specific biological effect. In this case, it would be highly recommended to analyze as many data sources as possible to select the most reliable one or to combine multiple data sets by averaging data for the same ligand or otherwise. Second, different assays may measure different phenomena, e.g., binding to a receptor vs. an antagonistic effect on the same receptor. Here, mixing the data would not make any sense. Two different QSAR models should be developed for respectively binding and antagonism.
• asked a question related to Data Curation
Question
We tested a potential anti-cancer treatment in balb/c mice injected with 4T1 cells (25 animals). We collected blood initially, at D7, D14, D21 and D28; We also collected supernatant of spleen cells placed for 24 hr in the incubator at the time of euthanasia. Very little is known about the mechanism of this treatment.
We looked at the expression of 25 cytokines using 25-plex assay; I now need to analyze the data with very little knowledge in immunology, and a lot of variation between samples.
Do you know of a good tool that can integrate data considering multiple cytokines to identify mechanism/ pathway? Or do you have any recommendations/advice?
Thank you!
Thank you Shen-An. I am in the process of educating myself, reading published literature. Tedious, but necessary!
• asked a question related to Data Curation
Question
I would like to publish one huge table as Supplementary resources to a forthcoming paper. The R package 'plotly' enables the creation of interactive tables which can be searched, filtered for results. See one good example from a personal website at:
I would like to publish such a table in a data repository, linked to the main paper manuscript. Usually such repositories take uploaded files . I'd like to find a science data repository which would enable the posting of the table directly, to any readers wishing to peruse the raw data.
Please, would anyone know of such an outlet or be able to recommend an elegant solution?