Science topic
Dataset - Science topic
Explore the latest questions and answers in Dataset, and find Dataset experts.
Questions related to Dataset
I have a dataset to do between-group analysis. There are missing in control group data so I did multiple imputation on the control group data and got 5 imputed dataset for control group.
When I run independent samples t test, do I use the pooled results of control group data and the original intervention group data?
Thanks.
tried using different computer programs( notepad, excel.. ) , could not open it. Using the climate datasets first time, and I am literally confused to how to access the data.
Where to get a free standard 3D face dataset that is available for download for research purpose?
Al-Salamo Alikom;
What is the beneficial of adding code for research reproducible in paper with testing in some datasets from the largest available datasets in number of citations for researcher and journal ?
Kind regards,
Osman
Good morning,
Can anyone suggest a dataset presenting historic reference evapotranspiration in the different provinces?
Thanks a lot !
Hello, I am currently conducting a moderated mediation analysis in AMOS and want to mean centre my IV and moderator. To calculate the mean, do I use the original dataset or the dataset where I have removed some items with low factor loadings.
Thank you.
skin disease image dataset
How to make google colab pro run faster with image dataset. I am already using with google-drive but it is too slow?
medical image analysis problems with datasets
I've done RNA-seq analysis on a dataset downloaded from GEO looking at immune gene expression in Asthmatic, COPD and normal epithelial lung cells. Trying to do a t-test for my statistical analysis, but I need to group my data into Asthmatic, Healthy and COPD samples/cells as it doesn't show up in R which samples belong to which group?
I have done 1:5 case-matching on my study, so my dataset have 100 intervention group and 500 control group.
When I run independent samples t test, do I use the 100 intervention group vs all 500 control group?
When I present participant characteristics, do I use 100 intervention group vs all 500 control group?
Thanks.
Hello! I am putting together a dataset of benthic macroinvertebrate monitoring / count data from estuaries and coastlines along the North American Coast (Canada and US). I know of larger datasets like the NCCOS National Benthic Inventory and EMAP, but I was wondering if anyone knew of other regional datasets. It would be preferable if the data were collected using Young-modified Van Veen grab samplers along with information on water quality and sediment quality, but any dataset recommendations will be greatly appreciated!
cycleGAN performs well on unpaired datasets, and the attention mechanism has become a hot topic in recent years, so can we combine attention and cycleGAN?
Is there such a project? Papers and code are preferably available, thanks.
Hello!
I would like to get the average curve from several curves on a plot. Is
there a way to do this in Excel?
(Background information: I have drawn three different curves for three different x-y data sets. However, each of these has a different amount
of XY points and different in length and I can't simply take the average across the row. Any
solutions?)
See the picture
Thanks!
I want to use image dataset (that is stored in my personal computer) in google colab. Please help.
Hello everyone!
I have a doubt regarding the forecasted data set. In this data set, the forecast of different lead hours is given with the first day of the month as the initialization day. How to find the lead hour forecasted data of the 2nd day of the month?
Do I have to purchase this or there are some other methods to get the data? Are there any alternate data sets?
Dear all,
I tried to download some 3D reflectivity datasets over CONUS. The Earthdata Search has NEXRAD mosaic but is only available for 3 months in 2020. The radar data for NEXRAD is also station data. The only one I found that is 3D (mosaic) is the National Reflectivity Mosaic data but it is not available for download at https://www.ncei.noaa.gov/maps/radar/ . Does anyone know how to download this data or where to acquire the 3D reflectivity dataset?
Best regards,
Haochen
Suppose I have a dataset f(x). I want to fit this dataset with a fitting function
g(f1,f2,f3,x) = a*x +b*f1(x)+c*f2(x)+d*f3(x). Here, f1(x), f2(x) and f3(x) are three different datasets. Can anyone tell me how to fit f(x) with g(f1,f2,f3,x)? I tried to fit in Origin using this method: https://www.originlab.com/doc/Tutorials/Fitting-Datasets, but it didn't work very well. Is there any other suggestion? Thanks for your help
Basically I have a great interest in brain signal processing and analysis. It would be great if anyone can help to find out an open access EEG or fNIRS dataset of hand movement or human gait.
I'm working on a supervised Classification task with seven classes. The problem is that the dataset is very large and hugely imbalanced, with the number of data points for the major class being 100 times the minor class.
First, I tried to subsample the dataset into a smaller balanced dataset randomly, and the highest accuracy I could obtain after tuning hyperparameters was around 90%.
Then I decided to train the tuned models over the whole dataset (70% of data for training and 30% for test), and surprisingly, the accuracy of the models reached 95% or higher.
My question now is, which procedure is the correct one? training over subsampled dataset and test for the large dataset or train and test over the whole dataset?
A bottom-up stepwise regression on a 140-variable, standardized dataset (all features have mean 0 and stddev 1), selected 10 variables as best predictors for a certain target.
The used stepwise regression first selected the predictor variable with the highest adjusted R2adj, then added the second predictor variable which increased R2adj the most, and so on, until R2adj started to decrease again (this happened after 10 added variables). All selected predictors needed to have p<0.05, or they where discarded. Hence, this stepwise regression implicitly ranked the 10 selected predictor variables, out of 140, from most important to less important, in terms of R2sq.
I expected that the associated absolute values of the regression coefficients of the above selected predictors would also decreased in value, together with the decline in added R2adj. This however turned not to be the case. For instance, the most important explaining predictor (in terms of R2adj), did not have the highest absolute value of its regression coefficient, when compared to the other 9 selected predictors. Remember that all predictors are standardized
What could be the reason for this?
For example, If we collected the data (Dataset.CSV) with 7 million records, we want to take a sample from 7 million records, just 1 million records.
What are the first step, second step, and so on........if the dataset needs the following steps...
(labeling\ numeric\ normalization\ balance\ sample 1 million records\ cleansing).
Additional question: Is the balancing okay if we do it for the normal records, not for the attack records (there is any problem)?
When we are carrying out a scientific research, is it better to use the public data set in this field or the self-made data set collected by ourselves?
I am interested in the study of features than can determine gender and age from short speech (from 1 to 9 seconds). The audios are from a public set (Mozilla common voice dataset), where the duration and the quality are variable.
I am close to submitting a paper for publication in an Economics Journal. My paper is based on my empirical cross-country analysis with a sample of 126 countries. This analysis includes around 20 variables averaged over the 10-year sample period. These variables come from multiple databases such as the World Bank's World Development Indicators and the IMF's International Financial Statistics. To create the dataset used in my analysis I simply downloaded each respective database into excel, removed the countries that are not in my sample, averaged each variable for the sample period, and then copy and pasted these variables into a column in my dataset. Is this an appropriate way to source and format data for academic research?
Pearson test can be used to find out the correlation between two continuous data sets.
Spearman test can be used to find out the correlation between two ordinal data sets.
Do we have any test to find out the correlation between ordinal data (mean score of Likert scale data set) and continuous data set (Academic performance in terms of exam scores)? If not, How can we do that?
Descriptive research such as research describing a large dataset such as a travel behaviour dataset via preliminary analysis.
Hello there,
I am searching for some freely available pixel-based (can be derived from satellites or mixed like CRU or IRI data library) datasets which have a resolution of less than 500m (preferably less than 100m). It would be nice if you name some!
Thank you so much for your attention and participation.
In a machine learning-based approach, a dataset used to check the accuracy of model prediction can be part of the training set?
So, with some difficulties I have been able to do wavelet analysis for a time series datasets that I have. The thing is, all these data sets can be combined to form a year long dataset, with some gaps as big as a month.
A solution to this is to interpolate the data. But considering that my data has a sampling rate of 10 mins with one month gap would not allow to pursue this solution.
For discontinuous(unevenly ) spaced data, instead of FFT, Lomb-scargle periodogram is used. If someone can suggest a hack like that for wavelet analysis, it'd be highly appreciated.
I would like to know why SBERT takes lesser time compared to BERT for large text data set
Where can I find a Twitter dataset for Preliminary Flu Outbreak Prediction Using Twitter Posts Classification?
Dataset: http://dx.doi.org/10.21227/781w-ef42
This dataset includes CSV files that contain the tweet IDs. The tweets have been collected by the model deployed here at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets, using filters: language “en”, and keywords “corona”, "coronavirus", "covid", "covid19" and variants of "sarscov2".
As per the Twitter Developer Policy, it is not possible for me to provide information other than the Tweet IDs (this dataset has been completely re-designed on March 20, 2020, to comply with data sharing policies set by Twitter). Note: This dataset should be solely used for non-commercial research purposes. A new list of tweet IDs will be added to this dataset every day. Bookmark the dataset page for further updates.
Dataset status as of May 24, 2020: 116,962,112 Global Tweets (EN)
I am looking for the dataset where the users along with the vehicles are in motion. If the data set contains any social information that would help me alot.
Collaborating Filtering or Content-based Recommender Systems.
I would like to simulate a disaster environment. Can any one provide me a dataset for urban/sub-urban uneven terrain please.
Is there any website form where i can dowload the free data sets related to the biomedical image processing or some other text related data to build but the Deep lerning model.
Hello,
I am studying Computer Science and I am currently working on my Bachelor thesis. For that, I am looking for suitable datasets. My goal is to apply Process Mining to these datasets to identify and analyze interesting processes. However, the problem is that these datasets need to be in a certain format to be suitable for Process Mining. The data needs to have a Case Id, Activity, and Timestamp column. In other words, the data needs to be activity-based so that processes with different activity sequences can be found.
I wanted to ask if someone has any idea where I could find such datasets? I'd be most interested in datasets in sectors such as energy, waste management, public work (but other input would be helpful as well). So far I mainly could find the datasets from previous years' BPI challenges.
Here is a short page with more information about Process Mining and the desired format (including a brief example):
Any feedback would be highly appreciated.
Thanks in advance,
Louis
I am training a Beta-VAE using BDD-100k driving dataset. Here are my hyperparameters: Adam optimizer, 0.0001 learning rate, and my latent dimension is 16, loss function is reconstruction loss(MSE) and KLD loss multiplied to Beta factor. After a while of training, the model seems learned something, but with different samples the exact same model's performance is absolutely different. Can anyone give me some hint for how to understand what is going on there? Thanks! Here are examples of same model generating different results.
For a dataset like
1.BCI Motor imagery EEG signals(example: BCI competition IV),
2.SEED dataset
which python library is best suitable for processing and feature extraction tasks?
I need a security dataset with challenges, I mean I want the accuracy to be low so that I can enhance it using ML techniques. I tried several datasets but they already have high accuracy without enhancement.
I an looking for an resting state (eyes closed) EEG datasets for any kind of psychiatric disorder. These can include, but not limited to
- Alcohol use disorder
- Acute stress disorder
- Addictive disorder
- Anxiety disorder
- Behavioral addiction disorder
- Schizophrenia
- Post traumatic stress disorder
- Depressive disorder
- Bipolar disorder
etc.
I would prefer if the datsets contain raw EEG data eg EDF files but. If anyone can assit i would really appreciate that. Thank you in advance
I am working on classifying sentiments for tweets dataset, in an unsupervised manner. I have used TextBlob Polarity, AFINN and Vader Sentiment Analyser for the Sentiment Classification. Among these, I have got relatively better results with Vader. However, the results are still not good enough in terms of accuracy. Vader gave an accuracy of around 50%.
Is there any way to improve the accuracy of Vader or, is there any other pre-trained model that can be used to provide a better classification?
Any help would be highly appreciated.
Thank You.
Dear Ones,
We are in the process of data analysis for our research study on HYPOGLYCAEMIA using SPSS version 23. One of the challenges we have just faced after completion of data entry being data on "time of interview" that was coded as NUMERIC during tool design (i.e. variable design on SPSS) could NOT be transformed as time as it read as a NUMERIC. Well, that was our MISTAKE!
Solution: we decided to change the variable TIME OF INTERVIEW back to DATE/TIME format, with specification of time into hh:mm, on "variable view" inside SPSS thinking that it will change it and read as TIME variable (in 24-hours) but to no avail.
At present, it has recorded a completely different time (actually forwarding time by 6 hours for each entry) instead of the original time planned.
e.g. for entry number 1, instead of 0930hrs supposed to be read, it currently reads 1530hrs. It does so for all other entries.
We also tried to convert the same variable "time of interview" back to DATE/TIME using HH:MM:SS but we ended up with a new problem.
i.e. at present, instead of say 09:30 Hrs, it reads 00:15
How to correct our mistake without jeopardising our dataset for the named variable?
I have the EEG DEAP dataset in .dat format , by the process I can see the complete data of each candidates . but I want to store those data in a CSV file . can you please help me regarding this ?
I want data sets of blood and bone cancer.I want to sequence it in the python by the use of Artificial Neural Networks
pls, is there anyone that can help me to download this dataset "Columbia MVSO Image Sentiment Dataset" I tried to use their link, which was mentioned in the paper, but it's not working!
I need the python code for molecularnet benchmark datasets ' to find the graph embedding for each dataset
Please let me know the name or URL of any comprehensive Bangla corpus data for SA or ER.
Hello,
I would like to compute the MAF of each SNP in my large data set. Is there a quick way to do so in R or in some module in bash?
Thanks,
Giulia
Hello all, I have a sequence alignment of ~2000 sequences, which is likely more than is necessary. If I begin to remove sequences manually or using some software program I'm sure I can reduce the number of gaps, but this will of course reduce the size of the alignment (and may introduce some amount of bias/subjectivity). Is it better to keep the larger dataset at the expense of greater gap character? Is there a rough criteria for minimum amount of gaps an alignment should contain for reconstruction? Thanks very much.
So I have calculated the accuracy of my model prediction using both training and testing datasets. And I found that it has higher testing accuracy than training accuracy. It is a normal condition? And how do I interpret this condition?
I want to train a CNN to segment Ground Glass Opacities (GGO) in Lungs CT-scans.
I would need a dataset with CT scans and corresponding masks indicating for every voxel if it is GGO or not (i.e. the ground truth for the segmentation).
Do you know any dataset like that?
Many thanks for your help!!
I am applying multiple regression analysis to my datasets for prediction purpose. To calculate relative contribution of each predictor I want to know about the much suitable method.
Can we simulate an IoT kind of network using the NetSim 5G library? I would also want to model different kinds of attacks and generate data set to train an ML classifier.
Hello!
I need help-
I have a data set with around 35-40% missing.
I work with SPSS, what can I do?
I am looking at change in technology anxiety over time in older adults.
Thanks in advance!
Kind regards,
Jessica
Non-Invasive Skin Cancer Diagnosis Using Hyperspectral Imaging
I need to do a PCA or spectroscopic dataset (LIBS) classification method in MATLAB. how can i get the scripts?
I have a nonlinear data set of continuous data points which consists of 141 rows & 5 columns (4 independent variables and 1 output).
For getting a good start what can be the machine learning algorithms I can choose for?
i want to look at country of origin by states to have communication lines with groups with specific language skills. analytics practice
I am planning to create a predictive model. However my approach uses a dataset which is not similar to prior works. So I can't apply my model on the previous datasets and those models will not work with my dataset. How can I validate my research in such cases?
The dataset is cheddar, which you could find it on 'faraway' package.
The random variable Y is 'taste', and X is H2S
I used this formula to calcualte MSE:
m=lm(taste~H2S, data=cheddar)
test.lm <- lm(taste~H2S, data=cheddar)
mean(test.lm$residuals^2)
The result was 109.538, however the right value of the MSE is 10.83^2.
Hello all, I have an hourly dataset, whose mean value is at around 30. I could tune the LSTM model to have RMSE = 1.7. Please let me know if it's an acceptable one or should it be further tuned.
I plan to use it in a machine learning class and I want the students to be motivated. Ideally it will be an image dataset.
Hi,
I have two data sets, one consists of make/age/fuel and another one includes make/age/fuel/ engine size but is a very larger dataset, I need to find engine size for cars in the first data set from the second one, what is the faster way in R or excel?
thank you
I have developed the multiple regression model and my responses size was 164. Now, I want to validate the model using new data set. Is there any rule for sample size that I can use? One colleague suggested I use 20% of sample used in the study to develop my model. Please, I need advice.
Hello,
if both analysis name is same, so please suggest the name of software for performing this analysis and which type dataset is needed.
Thank you
Devanand Maurya
I have one thousand frames converted from a video taken from a particular location. From that dataset, I need to detect the blurred images and segregate that alone, how to do that???
I have an analyzed RNA seq data set. The analysis part including differential gene expression, clustering analysis and enrichment analysis has been done. I am aware that the bioinformatic part is done and most of the analysis part is also done. Could someone please guide on how to extract the biological relevance from the data set. What should be the starting point for working with this data? Should I start by looking at the differentially expressed genes in different comparisons or start from the cluster analysis and try to look for the genes.
I am working on a school project and am having trouble finding data to reference. I know I’ve seen similar studies before. Thanks so much!!!
Hi
who knows how I can compare two datasets (different datatypes) the data types do not have the same parameters.
what is the best way that I can link both data? thanks
I need a dataset related to Vehicular ad hoc Networks for Reflection-based DDoS attacks.
Any idea or suggestion is welcomed.
Thanks you!
I have 79 final line items of questionnaire now i want to cluster the line items into distinct latent variable. kindly guide me how can i cluster the line items. thanks in anticipation.
I am using a panel dataset (N=73; T=9). Dataset Timeframe: 2010-2018
In the GMM estimate on the total dataset, the AR(1) and AR(2) values are fine.
But to investigate the impact of the European crisis, I had to split the data (5 Years during and immediately after the crisis, and the subsequent 4 years). But when GMM is run on the second set of data, (2015-2018), in one of the models, AR(1) and AR(2) values were not generated.
Is the result still usable? What are the potential problems of using this specific result?
time predicted from a predictive maintenance algoritm
Hello,
I needed to understanding this type analysis please suggested name of software and dataset type that is helpful for me doing this analysis.
Thank you
Devanand Maurya
looking for dataset which can useful for our project.
Hi All,
I'm carrying out an Ordinal Regression on my dataset. I have continuous predictors (about 12) and an ordinal response variable. When looking into Ordinal Regression in SPSS they have two different procedures to carry this out: PLUM and GENLIN. It is said that GENLIN is better because it is quicker and easier to carry out than PLUM. I wonder if GENLIN has other advantages?
Manyt thanks!
Laura
I am developing a ML model where i am working with medical specialties.
My dataset contains different specialties (General surgery, urology,etc...) . First I build a model with all my data with the specialties mixed, and then i apply the same ML algorithm at a specialty level. I achieve better evaluation metrics for all the Specialty models except for one specialty can someone help understand the reason?
Hello everyone!
I have RNA seq dataset for two groups knockout and wild types of mice samples. I have the normalized values in terms of quant all datasets. Please guide me how to perform PCA on the normalized values. I am not a bioinformatician, kindly suggest non-coding methods.
Thanks in advance!
Hello everyone! I am in search of a suitable dataset for the nail tracking application. I need a dataset, I found someone, but I want the images to be more variable. Please If you have one, response to my ask.
Hi everyone,
I have panel dataset of 4 periods and 29 countries. Which method/technique can be suitable for my dataset?
Thanks for your answer in advance.
I am working on a object recognition model that is able to detect whether the person in the picture is wearing any type of headwear or not (hat/cap/helmet/scarf/raincoat etc etc) . I am unable to find any large publicly available datasets of this kind. As of now I am resorting to writing scripts that scrape images from the web of people wearing hats/caps etc using bing image downloader API and Google image downloader API . Please let me know of any publicly available datasets of this kind. Thank you.
Hello everyone
I am a student working on a project about the prediction of the privacy policy applied to textual posts on facebook. The objective is to predict for a post of a specific user if they would share it with the public, their friends or some more specific audience.
In terms of previous works on the subject, I found some articles that do this:
Paper 3 : http://www.l3s.de/~herder/research/papers/2015/analyzing_and_predicting_privacy_settings.pdf
Paper 4 : https://www.scitepress.org/papers/2016/56897/56897.pdf (Section 4.2)
The first and second papers (and to a lesser extent, the fourth one too) use a model that is trained only on the data of the user, which gives them a model specific to that user, as opposed to the methodology in the third paper. The first one uses only 20 posts, and for the second the precise number is unknown (maybe more that 60)
My first question is: isn’t it too small of a dataset to train a text classification model ? The tf-idf vectors used would have high dimensions and the number of words that appear in multiple posts would be small.
I tried to replicate their results with some data collected thanks to some friends (asked 7 persons to label 20 different posts each), and using any model with tf-idf seem to give pretty bad results (they just act like dummy classifiers and predict the majority class)
I tried adding a small number of features next to the tf-idf vector like the length of the post or its positivity/negativity/objectivity score obtained with a sentiment analysis tool, but it doesnt seem to affect the model at all.
The first paper got a high accuracy with only 20 posts per user (it could maybe be due to the fact that the majority class had a high ratio of more that 70% ?), while the fourth one couldnt get past 65% with a much bigger dataset.
My second question is: am I missing something ? What do you think of the feasibility of the approach (using a very small dataset), and how to improve the results ?
I have two datasets (.edf) of EEG recordings, one for healthy people, one for depressive people.
Each of the recording has 20 channels. So far I opened the data in matlab with edfread() as a timetable.
How can I add a white noise in that timetable?
Dear All,
I am looking for some Partial discharge (PD) datasets to download. I would appreciate if you can mention some data sources from where I can download the PD datasets.
Regards,
Anis
Hello Friends,
I am applying ML algorithms (DT, RF, ANN, SVM, KNN, etc) in python to my dataset which has features and target variables as continuous data. For example, when I'm using DecisonTreeRegressor I get the r_2 square equal to 0.977. However, I'm interested to deploy the classification metrics like confusion matrix, accuracy score, etc. For this, I converted the continuous target values into categorical ones. Now when I'm applying the DecisionTreeClassifier, I get the accuracy square=1.0 which I think is overfitting. Then I applied the normality checks, and correlation techniques (spearman) but the accuracy remains the same.
My question is am I right to convert numeric data into categorical one?
Secondly, if both regressor and classifiers are used for the same dataset, will the accuracy be changed?
Need your valuable suggestions, please.
For details plz see the attached files.
Thanks for the time
We are building an Arabic speech emotion dataset with 508 recorded persons, and every person recorded ten exact phrases divided into five emotions. The WAV files are noise-free and will be converted to MFCC and LDD features.
The validation process is in progress manually by a team of neuro-linguistics and psychologists. The dataset will be a public free access dataset.
What is the process of publishing this dataset?
What is the best journal to publish in?
I am looking for a publically available video/image dataset from surveillance cameras, which is used to detection if there is violence like fight happened. Also,a dataset from surveillance cameras contains such class also helpful. I have collected UCF-crime, NTU-CCTV dataset, I want to know how to get more datasets like these or how to collect such videos by myself.
Hello Seniors I hope you are doing well
Recently I've read some very good research articles. In those articles datasets were taken from V-Dem, Polity and Freedom House. Though they have shared the link of supplementary datasets and the process of how they analyzed these datasets in SPSS or R in brief but I couldn't understand and replicate these findings. It may be because I am not very good at quantitative data analysis.
So I want to know how could I better understand this Datasets analysis easily like V-Dem etc. Is there any good course online, lectures or conference video etc. Or good book?
Article links
Any help would be appreciated.
Thanks in anticipation.
Date Price Volume Turnover
1/1/22 10 12 120
1/1/22 11 10 110
1/1/22 13 20 260
1/1/22 12 15 180
1/1/22 10 13 130
1/1/22 9 9 81
Once I sort turnover in ascending order for every day, I hv 81, 110, 120, 130, 180, 260. Now I need to create categorical variable 1 for 81 and 110, cat variable 2 for 120, and 130, and cat var 3 for 180, and 260. My dataset has many year data and for each day there are thousands of transactions.
I am looking for Nitrogen, Potassium, Phosphorus, Organic carbon stock , pH value, Added Nutrients features in soil dataset.
Kindly suggest me the relevant data source.
Hi
I have western blot data set for 3 experimental and three control samples for one housekeeping protein (actin) and three target proteins. Kindly guide me in details what type of stat I should apply. And how we normalize the data.
Thanks in Advance!
this database contains 494,414 face images from 10,575 actors of IMDb. Face
images comprise random pose variations, illumination, facial expression, and resolution.
I need to remove the background and take the front pos of the face image.
I want to extract data from graphs. Can anybody suggest to me a good tool for data collection for data set preparation-?
Dear Researchers,
Greetings!
I need an IFDB dataset for my research work. Can anyone help in this regard? I'm not able to download it directly from the website.
Let me know if anyone can help.
Thanks in advance.