Science topic
Data Curation - Science topic
Explore the latest questions and answers in Data Curation, and find Data Curation experts.
Questions related to Data Curation
Specifically, please I prefer the index from DATABANKS INTERNATIONAL'S Cross-National Time-Series (CNTS) Data Archive.
Thank you!
For a binary stellar system, does the stellar mass column of NASA Exoplanet Archive data mean the average mass of both host stars or something else? If not how can I find the mass/other parameters for both stars?
I am currently doing my dissertation and I am trying to check the reliability for sub-scales I have newly created. The items comes from multiple measures as my data is from a data archive (so were collected from different studies). I have re-coded them all onto the same Likert scale but keep getting the error message "too few cases analysis not run". All missing values are coded the same (999-888) but from the help online, I am thinking it is as the missing values are scattered over numerous analysis variables? Is there any way round this? Any advice would be appreciated.....
I'm not really a "stats guy", but while working on my doctoral thesis and final paper of my PhD project, I became aware that statistical analysis cannot be performed on compositional data (such as tephra oxide concentrations) in a mathematically sound way due to the constant-sum constraint - yet the average tephrochronologist, when doing a PCA or similar, uses the raw or normalised compositional data anyway (at least in the majority of cases, as far as I've seen in my reading). I've learned that log-ratio transformations are "the" solution to the constant-sum constraint and have seen that this has indeed been done in at least a handful of tephra studies (and that it's becoming more routinely applied for e.g. XRF data), but there's not really any widespread acknowledging of this issue and even less any commonly accepted procedure as to the data curation with log-ratio transformations.
I commented on this in my thesis, and during my defence I was encouraged to publish something about this, aimed specifically at the tephra community. So, I've started working on this "something", but I need to hear some opinions on the specifics of the log-ratio transformation part of what could be a formal suggestion for a common tephra compositional data curation procedure. What I want to discuss is what kind of log-ratio transformation should be used if we were to agree upon one particular way of doing it - which is something that would be preferable in order to allow straight-forward comparisons of results between different studies, kind of in the same way that biplots of particular elements have become more or less routine to display in tephra papers.
A log-ratio transformation is basically lrt(x) = log10(x/_) where the divisor of the fraction (i.e. "_") can vary. I've seen single element oxides being used as the divisor, but without real motivation as to why that particular oxide was selected. In my thesis I argued for a centred lrt, where the divisor is the geometric mean of all oxides, and it makes sense to use something like that because it may be less ambiguous than just selecting one oxide without explaining why. However, it has the issue that no oxide concentration can be 0 (because then the divisor will be 0, and you can't divide by 0) and in many cases at least one oxide will be below EPMA detection levels and register as 0 - particularly P2O5 but also MnO and MgO in some analyses. Furthermore, P2O5 is not always reported (especially in older publications) so using a centered lrt would probably require omitting P2O5 (and then still not work with some few analyses where MnO or MgO were 0).
My main question then is, what log-ratio transformation would YOU suggest to be the norm, and why? Maybe using a single oxide is the better option, but if so, which one? To my understanding, which variation of the transformation is used doesn't really matter as long as the same one is used for all data. So let's just decide how to do it! I would really appreciate your input and suggestions. Thanks in advance!
PS. I'm not here to argue for or against the necessity of log-ratio transformations (mathematically speaking, it is required if statistical analyses are to be applied to EPMA data), although the floor is open and thoughts on any aspect of this issue are welcome. DM me if you're interested in more details about the paper I'm preparing, too. :)
Distinguished peers,
I've had some problems finding a public repository of COVID-19 data that 1) were always reliable, 2) were automatically up-to-date, 3) provided a .csv file (or similar) through a friendly interface. Do you know any you'd like to share?
Looking forward to reading your suggestions,
Sebastián
COVID-ARC is a data archive that stores multimodal (i.e., demographic information, clinical outcome reports, imaging scans) and longitudinal data related to COVID-19 and provides various statistical and analytic tools for researchers. This archive provides access to data along with user-friendly tools for researchers to perform analyses to better understand COVID-19 and encourage collaboration on this research.
Hi all,
I need of one software to investigate specific list of gene related to cancer. I want to investigate specific biological process such as angiogenesis or hypoxia or migration.
Can someone please help me to suggest one way to do this?
N.B I m not a bioinformatic so I need just an easy software to have idea about my gene?
Thank you for helping.
With Covid19, we have seen digital transformation happening so much faster: everyone able to do this was suddenly working from home, using the internet and the available digital access and platforms. Education was also put on hold unless it moved online: online-school, online-education got a tremendous stimulus.
Now, what is fueling the data economy, what is the "new green oil" of this digitally transformed world? It's DATA, and it's data fairly priced for the stakeholders, starting with identified data owners.
Please see here a link to some books on the subject, reviewing the rational for data use in every aspect of life, business, markets, society, and looking at the creation of Data Market Places, as well as diving into the detailed equations of how to price and how to make it happen in economic terms, as Data Microeconomics:
Given your specific discipline. Have you ever irretrievably lost data of an ongoing research project? How did you handle it? Thanks in advance.
(Also, this is my story. A couple of years ago, in a study that included collection, preservation, identification and weighing of soil invertebrates, after an unfortunate event in the laboratory, the notebook that contained the weight notes of one of 10 sets of collected organisms, which belonged to the control group, was lost, so were the preserved organisms. I'm tagging this with an entomology lablel, so in case you're familiar with this topic: Would you consider trying some method of reconstructing the weight data or is there just nothing to do? There is no way to recover the notebooks, nor the preserved organisms).
Thank you.
We are working on a large number of building-related time series data sets that display various degrees of 'randomness'. When plotted, some display recognisable diurnal or seasonal patterns that can be correlated to building operational regimes or the weather (i.e. heating energy consumption with distinct signatures as hourly and seasonal intervals). However some appear to be completely random (Lift data that contains a lot of random noise).
Does anyone know if an established method exists that can be deployed on these data sets and provide some 'quantification' or 'ranking' of how much stochasticity exists in each data set?
Hi everyone, I'm looking for sotware that helps me to clean data from bibliographic databases, i don't know how to program so I really need a tool that it's easy to use (maybe with a little bit of programming). I already used VantagePoint but I don't have more access to it :( Help me please
I am performing an experiment using force sensors and IMUs to gather readings of orientation from a certain motion performed by the human body.
The sensor does not know when and what that particular muscle is doing. There are no video cameras to record this action. I have to design an experiment which is going to sync these sensor signals corresponding to the activity done by the body. Suppose a person is standing among takes an object that has the IMU(Inertial Measuring Unit) and then sits down and stands up and repeats this a couple of times, how do I identify from the IMU readings whether he is sitting or standing at that particular instant in the output signal?
1. Is there a statistical way of conducting this?
2.Or is there a way in which I can make the person sit and take individual readings, then stand and take individual activity readings then make the person sit and stand couple of times and match the signal from the individual activity readings?
3. Or give a time gap of 5 seconds or so between each activity and identify it that way? (how is the error sensitivity in this case)
NOTE-this is not exactly what I am doing, but consider hypothetically that I can take readings individually for each activity alone.
I am currently working on collecting and curating data on how the various disciplines utilize games-based educational techniques in their curricula in higher ed. Do you have any experience, or know of any other experience, with utilizing games (digital, table-top, role-playing, simulation, or other) in your or a colleague's teaching? If so, how? Thanks!
I have been recently studying proprietary voter files and data. While I know that voter files are for the most part public (paid or free), I am confused as to how companies match this data to other data.
For example, the voter files (public) never reveal who you voted for, or your behavioral attributes, and so on. So how do companies that sell this "enhanced" data match every "John Smith" to other data. How can they say that they have a profile on every voter? Wouldn't that require huge data collection? or are there models that simply do that job for them?
We let rate about 60 short stories on valence and arousal.
We suspect that a few scorers have not read the short stories and have marked the possible answers more randomly or according to a pattern. My goal now is to find these bad scorers and remove them from the record. I want to be very careful and leave the scorers in the dataset in case of doubt.
I have chosen the following two criteria for exclusion: 1. Deviation from a range of expected values. 2. Deviation from the expected distribution of all rating values.
Criteria 1: If in more than 6 ratings, a scorer deviates by more than one standard deviation of the averaged ratings across all scorers, the ratings of the scorer will be removed from the dataset. If we assume the probability per rating that a reviewer randomly answers next to the expected range, after 6 short stories there's a probabilityof less than one percent.
Criteria 2: The distribution of all ratings across all scorers resulted in an equal distribution. If you allow an average deviation of 4 points for each scorer from the distribution of the rating values of all scorers for each score, all scorers with more than an average deviation of 4 points will drop out of the data set.
If either or both of these criteria apply to an scorer, the scorer's ratings are removed from the record.
- Is this legitimate?
- Are there better practices?
Thank you very much for your answers.
Yours sincerely
I have been working on a project to collate species occurrence data inherent from unpublished student theses in an integrated database (currently published in GBIF) and still working on a systematic protocol of data validation. Expert review is really subjective and I got many findings that said "expert" estimation were not always more consistent than amateurs, student, or even public enthusiasts (feel free to message me for the papers I collected regarding this), thus my team was still struggling to find a way. Our current method is just independently evaluate the scientific names through taxonomic checklists and the geographic distribution were validated through available published literature mentioning the geographic distribution of each species. We occasionally ask experts but as we are working on many understudied taxa and geographical area, there was not many around.
I have several microarray gene expression datasets, but they use different naming systems for the probes. What are some simple ways to convert all the probes to the same type (e.g. Affymetrix)? Thanks.
Dear colleague,
I am leader of the research project B-DATA – Big Data Assemblages: Techniques and Actors. B-DATA has the intent to study data assemblages inside research centres and data infrastructures which produce and use open data and Big Data. The three main case studies are: the Consortium of European Social Science Data Archives (CESSDA, Norway)); Italian Statistical Office (ISTAT, Italy); the Web Science Institute of the University of Southampton (UK).
During my research i came accross to some of the topics that have been adressed by your project. I was thinkink if you were interested in making a comparison with what is happening here in Italy at ISTAT. I can have full access to this field.
This is just a first contact. If you are interested, we can have some contacts to understand the feasibility of that.
I attach an open access article that show some of the finding of B-DATA, that may help you to understand what I am researching.
I am looking forward to hearing from you.
Yours sincerely, Biagio Aragona.
Assistant Prfessor of Sociology
University of Naples Federico II
I am trying to aggregate a large data set by two identification variables (i.e., id1 and id2). My aggregate command in R drops the entire row if missing a value on any of the columns. For instance, even though row #1 has been observed on all, but one column (missing on only 1 of 20 columns), row #1 is dropped during the aggregation process. I was wondering if there is a way to aggregate (get mean or sums) the rows by the two identification variables without dropping the entire row for missing on 1 or 2 columns.
Here is my current R code for aggregating:
dfagg1<- data.frame(aggregate(.~id1+id2, data=dfsub,
FUN=function(x) c(mn=mean(x), n=length(x), na.rm=TRUE)))
Thank you!
Most of the data curation workflows use the absence of Carbon atom as discrimination. Historically, despite the presence of carbon atom, several compounds, such as carbides, carbonates,cyanides etc. are considered inorganic.
there is also a grey area, such as tetrahalogenated compounds (i.e. CCl4) or sulfides that are considered organic or inorganic depending on definitions.
I was thinking about a SMARTS pattern to define all possible exceptions. It does make sense? Does anybody has already tried?
Nowadays most all the software can compute descriptors for inorganic compounds. For this reason, Is still true that you need to filter inorganic compounds for QSAR modeling?
Hi everyone,
I'm a researcher who prepares content for workshops. Here are some lists I compiled to help me find concept-demonstrating resources
"8ACTIVITIES" to "b BRAINSTORMING WITH OTHERS"
Request: It would be helpful for me to add to this collection. Please suggest in the comments section any other resources I could add.
********
Types of demonstrations
I'm looking for resources (videos, pictures, games, activities, articles etc..) that would explain or illustrate a concept in an engaging and/or simple way especially to a lay audience. It could be a straightforward explanation like this A video explaining the concept "Co-relation does not equal causation"
where the original intent of the video is to explain this concept. Or ,You can be creative; this illustration does not need to be originally made to explain this concept. The analogy can be made independently of the original intention of the resource. For example, if the concept is flexibility in your career, I can find a video of a person demonstrating physical flexibility even though the original intent of the video is not about career flexibility.
Please put your suggestion in the comments section below.
I attached a document showing two tables of examples on resources and demonstrations.
Please suggestions about finding numbers (statistics, figures) about a certain topic that you have in mind but not if the numbers are available. For example, number of hours people spend deliberating about which career to choose vs the number of hours doing that career. Or number of hours people spend deliberating about which career to choose vs number of hours doing they deliberate for much much less important decisions like what clothes they should buy?How do I find numbers related to career deliberation? Both tips about the general "number finding" issue and the specific "career deliberation" issue are appreciated .
For the same compound-protein interaction, using different in vitro assays may result in different activities. The XC50 may differ more than 100 times. For example, for functional assays, we sometimes use100% human serum or BS. We also use HSP buffer sometimes. All these data are curated in databases such as ChEMBL.
If we use all these data to train a machine learning model, it'll be too noisy and the models will certainly be bad enough. However, papers focusing on method development never try to address this problem. What can we do?
We tested a potential anti-cancer treatment in balb/c mice injected with 4T1 cells (25 animals). We collected blood initially, at D7, D14, D21 and D28; We also collected supernatant of spleen cells placed for 24 hr in the incubator at the time of euthanasia. Very little is known about the mechanism of this treatment.
We looked at the expression of 25 cytokines using 25-plex assay; I now need to analyze the data with very little knowledge in immunology, and a lot of variation between samples.
Do you know of a good tool that can integrate data considering multiple cytokines to identify mechanism/ pathway? Or do you have any recommendations/advice?
Thank you!
I would like to publish one huge table as Supplementary resources to a forthcoming paper. The R package 'plotly' enables the creation of interactive tables which can be searched, filtered for results. See one good example from a personal website at:
I would like to publish such a table in a data repository, linked to the main paper manuscript. Usually such repositories take uploaded files . I'd like to find a science data repository which would enable the posting of the table directly, to any readers wishing to peruse the raw data.
Please, would anyone know of such an outlet or be able to recommend an elegant solution?
I am working with herbarium specimens, and want to determine their thickness without causing any damage to specimens.