Science topic

Dataset - Science topic

Explore the latest questions and answers in Dataset, and find Dataset experts.
Questions related to Dataset
  • asked a question related to Dataset
Question
7 answers
I have a dataset to do between-group analysis. There are missing in control group data so I did multiple imputation on the control group data and got 5 imputed dataset for control group.
When I run independent samples t test, do I use the pooled results of control group data and the original intervention group data?
Thanks.
  • asked a question related to Dataset
Question
4 answers
tried using different computer programs( notepad, excel.. ) , could not open it. Using the climate datasets first time, and I am literally confused to how to access the data.
Relevant answer
Answer
The JRA-55 data I have is actually in GRIB format (most probably). The file has no extension, but the JRA 55 documentation mentions that the data is in GRIB format.
  • asked a question related to Dataset
Question
6 answers
Where to get a free standard 3D face dataset that is available for download for research purpose?
Relevant answer
Answer
BU-3DFE
  • asked a question related to Dataset
Question
2 answers
Al-Salamo Alikom;
What is the beneficial of adding code for research reproducible in paper with testing in some datasets from the largest available datasets in number of citations for researcher and journal ?
Kind regards,
Osman
Relevant answer
Answer
According to No Free Lunch Theorem, there is no dominant algorithm in all dataset. Thus, making the code available in research paper with reproducible capability for several other research and other researcher point-of-view will increase the number of citation of the original paper and the rank of the journal
  • asked a question related to Dataset
Question
4 answers
Good morning,
Can anyone suggest a dataset presenting historic reference evapotranspiration in the different provinces?
Thanks a lot !
Relevant answer
  • asked a question related to Dataset
Question
4 answers
Hello, I am currently conducting a moderated mediation analysis in AMOS and want to mean centre my IV and moderator. To calculate the mean, do I use the original dataset or the dataset where I have removed some items with low factor loadings.
Thank you.
Relevant answer
Answer
Thank you all very much.
  • asked a question related to Dataset
Question
12 answers
skin disease image dataset
Relevant answer
Answer
  • asked a question related to Dataset
Question
2 answers
How to make google colab pro run faster with image dataset. I am already using with google-drive but it is too slow?
Relevant answer
Answer
Ajay Krishan Gairola Are you getting slower performance when you train the model or when you load the data? Perhaps you should use Tensorflow 2.0 data pipeline which is specifically optimized for such scenarios.
  • asked a question related to Dataset
Question
5 answers
medical image analysis problems with datasets
Relevant answer
Answer
Computer Vision. Discovery Radiomics. Evolutionary Deep Intelligence. Image Segmentation/Classification
General health and scientific research
  1. NLM's MedPix. A free online Medical Image Database with over 59,000 indexed and curated images from over 12,000 patients.
  2. The Cancer Imaging Archive (TCIA) ...
  3. Re3Data. ...
  4. V7 COVID-19 X-Ray dataset. ...
  5. COVID-19 image dataset. ...
  6. COVID-19 CT scans. ...
  7. CT Medical Images. ...
  8. Deep Lesion.
  9. http://www.search.ask.com/?gct=hp&o=APN10644A&qrsc=2871&l=dis&sver=3&apn_ptnrs=^AG5&dateOfInstall=2014-11-24&d=533_114&v=1.1_541
GOOD LUCK
  • asked a question related to Dataset
Question
2 answers
I've done RNA-seq analysis on a dataset downloaded from GEO looking at immune gene expression in Asthmatic, COPD and normal epithelial lung cells. Trying to do a t-test for my statistical analysis, but I need to group my data into Asthmatic, Healthy and COPD samples/cells as it doesn't show up in R which samples belong to which group?
Relevant answer
Answer
Hi,
you need to compare the disease versus control samples when doing the statistical test. I didn't fully get if the problem is that you lack the information of which samples belongs to which category or it is a coding problem. As for the former, usually datasets have a metadata file in which the sample names you find in the gene expression table are present, and the treatment information is included. If it is a coding problem, you can index the sample names to divide the data into Asthmatic, Healthy and COPD.
  • asked a question related to Dataset
Question
2 answers
I have done 1:5 case-matching on my study, so my dataset have 100 intervention group and 500 control group.
When I run independent samples t test, do I use the 100 intervention group vs all 500 control group?
When I present participant characteristics, do I use 100 intervention group vs all 500 control group?
Thanks.
Relevant answer
Answer
David Schmidt Thanks a lot :)
  • asked a question related to Dataset
Question
6 answers
Hello! I am putting together a dataset of benthic macroinvertebrate monitoring / count data from estuaries and coastlines along the North American Coast (Canada and US). I know of larger datasets like the NCCOS National Benthic Inventory and EMAP, but I was wondering if anyone knew of other regional datasets. It would be preferable if the data were collected using Young-modified Van Veen grab samplers along with information on water quality and sediment quality, but any dataset recommendations will be greatly appreciated!
Relevant answer
Answer
Marcus W. Beck Yes, I am still on the hunt for more data! Thank you for sharing the links to the Tampa Bay monitoring data, I did not know about them. I will take a look at them!
  • asked a question related to Dataset
Question
2 answers
cycleGAN performs well on unpaired datasets, and the attention mechanism has become a hot topic in recent years, so can we combine attention and cycleGAN?
Is there such a project? Papers and code are preferably available, thanks.
Relevant answer
Answer
Dear Chen Yijia ,
The application of deep learning in the field of drug discovery brings the development and expansion of molecular generative models along with new challenges in this field. One of challenges in de novo molecular generation is how to produce new reasonable molecules with desired pharmacological, physical, and chemical properties.
Regards,
Shafagat
  • asked a question related to Dataset
Question
4 answers
Hello!
I would like to get the average curve from several curves on a plot. Is there a way to do this in Excel?
(Background information: I have drawn three different curves for three different x-y data sets. However, each of these has a different amount of XY points and different in length and I can't simply take the average across the row. Any solutions?) See the picture
Thanks!
Relevant answer
The following python code could Average Multiple Curves and plot them.
I've shared it on Github and it's available from the following link:
  • asked a question related to Dataset
Question
8 answers
I want to use image dataset (that is stored in my personal computer) in google colab. Please help.
Relevant answer
Answer
Hi, you can use a dataset in Colab by using Google Drive.
  • asked a question related to Dataset
Question
1 answer
Hello everyone!
I have a doubt regarding the forecasted data set. In this data set, the forecast of different lead hours is given with the first day of the month as the initialization day. How to find the lead hour forecasted data of the 2nd day of the month?
Do I have to purchase this or there are some other methods to get the data? Are there any alternate data sets?
Relevant answer
Answer
Why date 2 to 30 or 31 is deactivate? how get the information of lead hour forecast of different dates of the month?
  • asked a question related to Dataset
Question
2 answers
Dear all,
I tried to download some 3D reflectivity datasets over CONUS. The Earthdata Search has NEXRAD mosaic but is only available for 3 months in 2020. The radar data for NEXRAD is also station data. The only one I found that is 3D (mosaic) is the National Reflectivity Mosaic data but it is not available for download at https://www.ncei.noaa.gov/maps/radar/ . Does anyone know how to download this data or where to acquire the 3D reflectivity dataset?
Best regards,
Haochen
Relevant answer
Answer
@Shafagat
Dear Shafagat,
Thank you for your reply. I am wondering do you know any more recent data? Something like this but not station data.
Best regards,
Haochen
  • asked a question related to Dataset
Question
7 answers
Suppose I have a dataset f(x). I want to fit this dataset with a fitting function
g(f1,f2,f3,x) = a*x +b*f1(x)+c*f2(x)+d*f3(x). Here, f1(x), f2(x) and f3(x) are three different datasets. Can anyone tell me how to fit f(x) with g(f1,f2,f3,x)? I tried to fit in Origin using this method: https://www.originlab.com/doc/Tutorials/Fitting-Datasets, but it didn't work very well. Is there any other suggestion? Thanks for your help
Relevant answer
Answer
I'm not sure if I understood correctly but I would use mathematica. One can simply make an interpolation for f1(x), f2(x) and f3(x):
(*Data to be fitted,coeffients like 0.005,0.12 etc to be found from the fitting*)
fxData=Table[{x,RandomReal[{0.997,1.003}] (0.005 x+0.12 Sqrt[x]+0.18 Exp[-x]-0.015 x^1.2)},{x,0,10,0.01}];
(*other data,for fxData,f1xData and so on one can use Import[],here are just random lists as an example*)
f1xData=Table[{x,Sqrt[x]},{x,0,10,0.01}];
f2xData=Table[{x,Exp[-x]},{x,0,10,0.01}];
f3xData=Table[{x,x^1.2},{x,0,10,0.01}];
(*Interpolation of f1x,f2x,f3x*)
f1x=Interpolation[f1xData];
f2x=Interpolation[f2xData];
f3x=Interpolation[f3xData];
(*fitting function*)
g[x_,a_,b_,c_,d_]:=a x+b f1x[x]+c f2x[x]-d f3x[x]
(*fitting*)
nlm=NonlinearModelFit[fxData,g[x,a,b,c,d],{a,b,c,d},x];
(*plot and parameter table*)
Show[{ListPlot[fxData,PlotRange->All],Plot[nlm[x],{x,0,10},PlotStyle->Red]}]
nlm["ParameterTable"]
Copy and paste should work :)
  • asked a question related to Dataset
Question
5 answers
Basically I have a great interest in brain signal processing and analysis. It would be great if anyone can help to find out an open access EEG or fNIRS dataset of hand movement or human gait.
Relevant answer
Answer
Check with the Georgia Tech Behavioral Medicine department. I know they were doing studies on individuals' gaits.
  • asked a question related to Dataset
Question
10 answers
I'm working on a supervised Classification task with seven classes. The problem is that the dataset is very large and hugely imbalanced, with the number of data points for the major class being 100 times the minor class.
First, I tried to subsample the dataset into a smaller balanced dataset randomly, and the highest accuracy I could obtain after tuning hyperparameters was around 90%.
Then I decided to train the tuned models over the whole dataset (70% of data for training and 30% for test), and surprisingly, the accuracy of the models reached 95% or higher.
My question now is, which procedure is the correct one? training over subsampled dataset and test for the large dataset or train and test over the whole dataset?
Relevant answer
Answer
The imbalanced data is a common problem especially in Medicine.
I suggest you need to perform the following:
1- Split your entire data into train and test sets. The ratio is usually 80% for train set and 20% for test set. Make sure you stratify your split meaning that your train and test should have the same imbalanced ratio as your original data. As an example you can use sklearn library which does split your data and also perform stratify as below:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True, stratify=y)
2- Use up-sampling or down-sampling techniques to balance your data ONLY on your train set. DO NOT perform this on your test sets! Your test should remain intact!
3- Fit your ML model on your train set which is already balanced from the previous step.
4- Evaluate your model accuracy on your test set which is already imbalanced!
  • asked a question related to Dataset
Question
4 answers
A bottom-up stepwise regression on a 140-variable, standardized dataset (all features have mean 0 and stddev 1), selected 10 variables as best predictors for a certain target.
The used stepwise regression first selected the predictor variable with the highest adjusted R2adj, then added the second predictor variable which increased R2adj the most, and so on, until R2adj started to decrease again (this happened after 10 added variables). All selected predictors needed to have p<0.05, or they where discarded. Hence, this stepwise regression implicitly ranked the 10 selected predictor variables, out of 140, from most important to less important, in terms of R2sq.
I expected that the associated absolute values of the regression coefficients of the above selected predictors would also decreased in value, together with the decline in added R2adj. This however turned not to be the case. For instance, the most important explaining predictor (in terms of R2adj), did not have the highest absolute value of its regression coefficient, when compared to the other 9 selected predictors. Remember that all predictors are standardized
What could be the reason for this?
Relevant answer
Answer
Hello Jan,
If your goal was to identify some "optimal" subset of IVs to best explain differences on an outcome of importance to you, stepwise models are among the least dependable means to accomplish this goal. In point of fact, you are in no way guaranteed to identify the best ensemble of IVs (for a given proportion of the total available IVs) via step methods. As well, they suffer from a number of other technical issues.
There are far better options available. Have a look at this list: https://scikit-learn.org/stable/modules/linear_model.html (and note the adaptive lasso option, among others).
Finally, with respect to your question: Why do the relative sizes of regression coefficients change as you add IVs to a model?
Because of overlap/redundancy among IVs (when excessive, this is referred to as multicollinearity). Later-entered variables can supplant earlier-entered variables, even though they might not have been as potent an individual IV as the first IV entered. If all IVs were genuinely independent of each other, then sequence of entry into a model would not change the explanatory power of a given IV as the model grew in number of predictors.
Good luck with your work.
  • asked a question related to Dataset
Question
2 answers
For example, If we collected the data (Dataset.CSV) with 7 million records, we want to take a sample from 7 million records, just 1 million records.
What are the first step, second step, and so on........if the dataset needs the following steps...
(labeling\ numeric\ normalization\ balance\ sample 1 million records\ cleansing).
Additional question: Is the balancing okay if we do it for the normal records, not for the attack records (there is any problem)?
Relevant answer
Answer
To randomly select rows (packets) by using Pandas DataFrame:
a)
df = df.sample() // Randomly select a single row
b)
df = df.sample(n=500) // Randomly select a specified number of rows. For example, to select 500 random rows, set n=500
  • asked a question related to Dataset
Question
7 answers
When we are carrying out a scientific research, is it better to use the public data set in this field or the self-made data set collected by ourselves?
Relevant answer
Answer
That can hardly be said in such abstract terms. It depends on how good the public data set is and how good your own is. The best thing is to use both and compare the results.
  • asked a question related to Dataset
Question
2 answers
I am interested in the study of features than can determine gender and age from short speech (from 1 to 9 seconds). The audios are from a public set (Mozilla common voice dataset), where the duration and the quality are variable.
Relevant answer
Answer
1. F0 would be one indicator for gender influence on voice.
2. many parameters have been evaluated to answer this question.
3. By acoustic analysis, i assume you mean pitch and intensity analysis. the parameters may differ depending on analysis technique one uses.
4. commonly age wise values are established as well as gender wise values are established for F0, harmonic to noise ratio, Jitter, Shimmer to name few. we use such norms routinely in clinics while trying to identifying abnormalities in voice.
5. cepstral peak- CPP and related parameters are often used in cepstral analysis
6. routine search on pubmed should give you lots of articles on this.
  • asked a question related to Dataset
Question
3 answers
I am close to submitting a paper for publication in an Economics Journal. My paper is based on my empirical cross-country analysis with a sample of 126 countries. This analysis includes around 20 variables averaged over the 10-year sample period. These variables come from multiple databases such as the World Bank's World Development Indicators and the IMF's International Financial Statistics. To create the dataset used in my analysis I simply downloaded each respective database into excel, removed the countries that are not in my sample, averaged each variable for the sample period, and then copy and pasted these variables into a column in my dataset. Is this an appropriate way to source and format data for academic research?
Relevant answer
Answer
No!
  • asked a question related to Dataset
Question
4 answers
Pearson test can be used to find out the correlation between two continuous data sets.
Spearman test can be used to find out the correlation between two ordinal data sets.
Do we have any test to find out the correlation between ordinal data (mean score of Likert scale data set) and continuous data set (Academic performance in terms of exam scores)? If not, How can we do that?
Relevant answer
Answer
If you're using the mean of multiple Likert-scale items, you're already treating the data as continuous (i.e., interval), in which case you can go ahead with Pearson correlation.
An alternative option is to use the median or mode of the Likert-scale items instead and conduct an ordinal regression.
  • asked a question related to Dataset
Question
6 answers
Descriptive research such as research describing a large dataset such as a travel behaviour dataset via preliminary analysis.
Relevant answer
Answer
  • asked a question related to Dataset
Question
3 answers
Hello there,
I am searching for some freely available pixel-based (can be derived from satellites or mixed like CRU or IRI data library) datasets which have a resolution of less than 500m (preferably less than 100m). It would be nice if you name some!
Thank you so much for your attention and participation.
Relevant answer
Answer
Dear Sakib,
There are literally thousands of freely available data sets worldwide. What exactly do you need? No instrument is perfect and each data set has its own advantages and drawbacks. You should select those that are most appropriate for your purposes, and in particular determine your accuracy requirements, as they will imply close looks at the calibration issues as well as considerations regarding post-processing. Once you have clearly identified the parameters you need, the spatial and temporal extents and resolutions required, and the minimum accuracy needed, then you can search for the best inputs for your purpose.
By the way, NASA does offer a wide range of data sets but it is not the only source of information: the European Space Agency (ESA), as well as national space agencies of Japan, China, France, UK or Brazil (and many others) also have worthwhile offerings. You will find useful links to those data sources by searching the web.
Best regards, Michel.
  • asked a question related to Dataset
Question
10 answers
In a machine learning-based approach, a dataset used to check the accuracy of model prediction can be part of the training set?
Relevant answer
Answer
A dataset in machine learning is, quite simply, a collection of data pieces that can be treated by a computer as a single unit for analytic and prediction purposes. This means that the data collected should be made uniform and understandable for a machine that doesn't see data the same way as humans do.
Most data can be categorized into 4 basic types from a Machine Learning perspective: numerical data, categorical data, time-series data, and text.
training set—a subset to train a model. test set—a subset to test the trained model.
The “training” data set is the general term for the samples used to create the model, while the “test” or “validation” data set is used to qualify performance. Perhaps traditionally the dataset used to evaluate the final model performance is called the “test set”.
  • asked a question related to Dataset
Question
4 answers
So, with some difficulties I have been able to do wavelet analysis for a time series datasets that I have. The thing is, all these data sets can be combined to form a year long dataset, with some gaps as big as a month.
A solution to this is to interpolate the data. But considering that my data has a sampling rate of 10 mins with one month gap would not allow to pursue this solution.
For discontinuous(unevenly ) spaced data, instead of FFT, Lomb-scargle periodogram is used. If someone can suggest a hack like that for wavelet analysis, it'd be highly appreciated.
Relevant answer
Answer
Dear Ranjan Kumar Sahu,
If you have sufficient datapoints. I would suggest you to analyze the period before and after gap separately.
  • asked a question related to Dataset
Question
2 answers
I would like to know why SBERT takes lesser time compared to BERT for large text data set
Relevant answer
Answer
A followup question: Doesn't BERT rely on context or embeddings of nearby words to produce embedding of a word?
  • asked a question related to Dataset
Question
4 answers
Where can I find a Twitter dataset for Preliminary Flu Outbreak Prediction Using Twitter Posts Classification?
Relevant answer
Answer
The following is a list of some datasets which might be helpful for your work.
Dataset #1
MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak - The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 (date when the first case of this outbreak was detected) to 23rd July 2022. Link to the dataset - https://doi.org/10.5281/zenodo.6898178
Dataset #2
Twitter Conversations about the COVID-19 Omicron Variant - It presents a total of 522,886 Tweet IDs of the same number of tweets about the SARS-CoV-2 Omicron Variant posted on Twitter since the first detected case of this variant on November 24, 2021. Link to the dataset - https://doi.org/10.5281/zenodo.6893676
Dataset #3
Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave - The dataset comprises a total of 52,984 Tweet IDs of the same number of tweets about online learning posted since the first detected case of the Omicron variant. Link to the dataset - https://doi.org/10.5281/zenodo.6837118
  • asked a question related to Dataset
Question
5 answers
This dataset includes CSV files that contain the tweet IDs. The tweets have been collected by the model deployed here at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets, using filters: language “en”, and keywords “corona”, "coronavirus", "covid", "covid19" and variants of "sarscov2".
As per the Twitter Developer Policy, it is not possible for me to provide information other than the Tweet IDs (this dataset has been completely re-designed on March 20, 2020, to comply with data sharing policies set by Twitter). Note: This dataset should be solely used for non-commercial research purposes. A new list of tweet IDs will be added to this dataset every day. Bookmark the dataset page for further updates.
Dataset status as of May 24, 2020: 116,962,112 Global Tweets (EN) 
Relevant answer
Answer
The following is a list of some datasets which might be helpful for your work.
Dataset #1
MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak - The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 (date when the first case of this outbreak was detected) to 23rd July 2022. Link to the dataset - https://doi.org/10.5281/zenodo.6898178
Dataset #2
Twitter Conversations about the COVID-19 Omicron Variant - It presents a total of 522,886 Tweet IDs of the same number of tweets about the SARS-CoV-2 Omicron Variant posted on Twitter since the first detected case of this variant on November 24, 2021. Link to the dataset - https://doi.org/10.5281/zenodo.6893676
Dataset #3
Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave - The dataset comprises a total of 52,984 Tweet IDs of the same number of tweets about online learning posted since the first detected case of the Omicron variant. Link to the dataset - https://doi.org/10.5281/zenodo.6837118
  • asked a question related to Dataset
Question
6 answers
I am looking for the dataset where the users along with the vehicles are in motion. If the data set contains any social information that would help me alot.
Relevant answer
Answer
The following is a list of some more datasets which might be helpful for your work.
Dataset #1
MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak - The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 (date when the first case of this outbreak was detected) to 23rd July 2022. Link to the dataset - https://doi.org/10.5281/zenodo.6898178
Dataset #2
Twitter Conversations about the COVID-19 Omicron Variant - It presents a total of 522,886 Tweet IDs of the same number of tweets about the SARS-CoV-2 Omicron Variant posted on Twitter since the first detected case of this variant on November 24, 2021. Link to the dataset - https://doi.org/10.5281/zenodo.6893676
Dataset #3
Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave - The dataset comprises a total of 52,984 Tweet IDs of the same number of tweets about online learning posted since the first detected case of the Omicron variant. Link to the dataset - https://doi.org/10.5281/zenodo.6837118
  • asked a question related to Dataset
Question
1 answer
Collaborating Filtering or Content-based Recommender Systems.
Relevant answer
Answer
The following is a list of some datasets which might be helpful for your work.
Dataset #1
MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak - The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 (date when the first case of this outbreak was detected) to 23rd July 2022. Link to the dataset - https://doi.org/10.5281/zenodo.6898178
Dataset #2
Twitter Conversations about the COVID-19 Omicron Variant - It presents a total of 522,886 Tweet IDs of the same number of tweets about the SARS-CoV-2 Omicron Variant posted on Twitter since the first detected case of this variant on November 24, 2021. Link to the dataset - https://doi.org/10.5281/zenodo.6893676
Dataset #3
Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave - The dataset comprises a total of 52,984 Tweet IDs of the same number of tweets about online learning posted since the first detected case of the Omicron variant. Link to the dataset - https://doi.org/10.5281/zenodo.6837118
  • asked a question related to Dataset
Question
1 answer
I would like to simulate a disaster environment. Can any one provide me a dataset for urban/sub-urban uneven terrain please.
Relevant answer
Answer
The following is a list of some more datasets which might be helpful for your work.
Dataset #1
MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak - The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 (date when the first case of this outbreak was detected) to 23rd July 2022. Link to the dataset - https://doi.org/10.5281/zenodo.6898178
Dataset #2
Twitter Conversations about the COVID-19 Omicron Variant - It presents a total of 522,886 Tweet IDs of the same number of tweets about the SARS-CoV-2 Omicron Variant posted on Twitter since the first detected case of this variant on November 24, 2021. Link to the dataset - https://doi.org/10.5281/zenodo.6893676
Dataset #3
Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave - The dataset comprises a total of 52,984 Tweet IDs of the same number of tweets about online learning posted since the first detected case of the Omicron variant. Link to the dataset - https://doi.org/10.5281/zenodo.6837118
  • asked a question related to Dataset
Question
1 answer
Is there any website form where i can dowload the free data sets related to the biomedical image processing or some other text related data to build but the Deep lerning model.
Relevant answer
Answer
The following is a list of some datasets which might be helpful for your work.
Dataset #1
MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak - The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 (date when the first case of this outbreak was detected) to 23rd July 2022. Link to the dataset - https://doi.org/10.5281/zenodo.6898178
Dataset #2
Twitter Conversations about the COVID-19 Omicron Variant - It presents a total of 522,886 Tweet IDs of the same number of tweets about the SARS-CoV-2 Omicron Variant posted on Twitter since the first detected case of this variant on November 24, 2021. Link to the dataset - https://doi.org/10.5281/zenodo.6893676
Dataset #3
Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave - The dataset comprises a total of 52,984 Tweet IDs of the same number of tweets about online learning posted since the first detected case of the Omicron variant. Link to the dataset - https://doi.org/10.5281/zenodo.6837118
  • asked a question related to Dataset
Question
5 answers
Hello,
I am studying Computer Science and I am currently working on my Bachelor thesis. For that, I am looking for suitable datasets. My goal is to apply Process Mining to these datasets to identify and analyze interesting processes. However, the problem is that these datasets need to be in a certain format to be suitable for Process Mining. The data needs to have a Case Id, Activity, and Timestamp column. In other words, the data needs to be activity-based so that processes with different activity sequences can be found.
I wanted to ask if someone has any idea where I could find such datasets? I'd be most interested in datasets in sectors such as energy, waste management, public work (but other input would be helpful as well). So far I mainly could find the datasets from previous years' BPI challenges.
Here is a short page with more information about Process Mining and the desired format (including a brief example):
Any feedback would be highly appreciated.
Thanks in advance,
Louis
Relevant answer
Answer
The following is a list of some more datasets which might be helpful for your work.
Dataset #1
MonkeyPox2022Tweets: The First Public Twitter Dataset on the 2022 MonkeyPox Outbreak - The dataset consists of a total of 255,363 Tweet IDs of the same number of tweets about monkeypox that were posted on Twitter from 7th May 2022 (date when the first case of this outbreak was detected) to 23rd July 2022. Link to the dataset - https://doi.org/10.5281/zenodo.6898178
Dataset #2
Twitter Conversations about the COVID-19 Omicron Variant - It presents a total of 522,886 Tweet IDs of the same number of tweets about the SARS-CoV-2 Omicron Variant posted on Twitter since the first detected case of this variant on November 24, 2021. Link to the dataset - https://doi.org/10.5281/zenodo.6893676
Dataset #3
Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave - The dataset comprises a total of 52,984 Tweet IDs of the same number of tweets about online learning posted since the first detected case of the Omicron variant. Link to the dataset - https://doi.org/10.5281/zenodo.6837118
  • asked a question related to Dataset
Question
2 answers
I am training a Beta-VAE using BDD-100k driving dataset. Here are my hyperparameters: Adam optimizer, 0.0001 learning rate, and my latent dimension is 16, loss function is reconstruction loss(MSE) and KLD loss multiplied to Beta factor. After a while of training, the model seems learned something, but with different samples the exact same model's performance is absolutely different. Can anyone give me some hint for how to understand what is going on there? Thanks! Here are examples of same model generating different results.
Relevant answer
Answer
color and clarity
  • asked a question related to Dataset
Question
4 answers
For a dataset like
1.BCI Motor imagery EEG signals(example: BCI competition IV),
2.SEED dataset
which python library is best suitable for processing and feature extraction tasks?
Relevant answer
Answer
check this link
you can also use dfa, numpy, scipy and sklearn.
I used those libraries in my EEG signal filteration and classification project
  • asked a question related to Dataset
Question
13 answers
I need a security dataset with challenges, I mean I want the accuracy to be low so that I can enhance it using ML techniques. I tried several datasets but they already have high accuracy without enhancement.
Relevant answer
Answer
Mohamed Amine Ferrag, Othmane Friha, Djallel Hamouda, Leandros Maglaras, Helge Janicke, "Edge-IIoTset: A New Comprehensive Realistic Cyber Security Dataset of IoT and IIoT Applications for Centralized and Federated Learning", IEEE Access, April 2022, DOI: 10.1109/ACCESS.2022.3165809
  • asked a question related to Dataset
Question
2 answers
I an looking for an resting state (eyes closed) EEG datasets for any kind of psychiatric disorder. These can include, but not limited to
  • Alcohol use disorder
  • Acute stress disorder
  • Addictive disorder
  • Anxiety disorder
  • Behavioral addiction disorder
  • Schizophrenia
  • Post traumatic stress disorder
  • Depressive disorder
  • Bipolar disorder
etc.
I would prefer if the datsets contain raw EEG data eg EDF files but. If anyone can assit i would really appreciate that. Thank you in advance
Relevant answer
Answer
  • asked a question related to Dataset
Question
1 answer
I am working on classifying sentiments for tweets dataset, in an unsupervised manner. I have used TextBlob Polarity, AFINN and Vader Sentiment Analyser for the Sentiment Classification. Among these, I have got relatively better results with Vader. However, the results are still not good enough in terms of accuracy. Vader gave an accuracy of around 50%.
Is there any way to improve the accuracy of Vader or, is there any other pre-trained model that can be used to provide a better classification?
Any help would be highly appreciated.
Thank You.
Relevant answer
Answer
NVIVO
Also the text analysis engine in the IBM Modeler suite (similar to the one in PALANTIR) but much less expensive.
Richard E. Gilder, RN MS
Bioinformatics Scientist
The Gilder Company
TTUHSC Adjunct Faculty
  • asked a question related to Dataset
Question
2 answers
Dear Ones,
We are in the process of data analysis for our research study on HYPOGLYCAEMIA using SPSS version 23. One of the challenges we have just faced after completion of data entry being data on "time of interview" that was coded as NUMERIC during tool design (i.e. variable design on SPSS) could NOT be transformed as time as it read as a NUMERIC. Well, that was our MISTAKE!
Solution: we decided to change the variable TIME OF INTERVIEW back to DATE/TIME format, with specification of time into hh:mm, on "variable view" inside SPSS thinking that it will change it and read as TIME variable (in 24-hours) but to no avail.
At present, it has recorded a completely different time (actually forwarding time by 6 hours for each entry) instead of the original time planned.
e.g. for entry number 1, instead of 0930hrs supposed to be read, it currently reads 1530hrs. It does so for all other entries.
We also tried to convert the same variable "time of interview" back to DATE/TIME using HH:MM:SS but we ended up with a new problem.
i.e. at present, instead of say 09:30 Hrs, it reads 00:15
How to correct our mistake without jeopardising our dataset for the named variable?
Relevant answer
Answer
@shahab : thanks. I normally use SAS myself. It is a friend's research that she just incorporated me. She uses SPSS!
  • asked a question related to Dataset
Question
9 answers
I have the EEG DEAP dataset in .dat format , by the process I can see the complete data of each candidates . but I want to store those data in a CSV file . can you please help me regarding this ?
Relevant answer
Answer
Sorry for replica, there were errors in the webpage..
  • asked a question related to Dataset
Question
3 answers
I want data sets of blood and bone cancer.I want to sequence it in the python by the use of Artificial Neural Networks
  • asked a question related to Dataset
Question
3 answers
pls, is there anyone that can help me to download this dataset "Columbia MVSO Image Sentiment Dataset" I tried to use their link, which was mentioned in the paper, but it's not working!
Relevant answer
  • asked a question related to Dataset
Question
1 answer
I need the python code for molecularnet benchmark datasets ' to find the graph embedding for each dataset
Relevant answer
Answer
Souvik Panda Each item (typically nodes) in our network is given a fixed length vector representation by a graph embedding. These embeddings are a lower-dimensional representation of the graph that preserves the topology of the graph.
  • asked a question related to Dataset
Question
7 answers
Please let me know the name or URL of any comprehensive Bangla corpus data for SA or ER.
  • asked a question related to Dataset
Question
2 answers
Hello,
I would like to compute the MAF of each SNP in my large data set. Is there a quick way to do so in R or in some module in bash?
Thanks,
Giulia
  • asked a question related to Dataset
Question
3 answers
Hello all, I have a sequence alignment of ~2000 sequences, which is likely more than is necessary. If I begin to remove sequences manually or using some software program I'm sure I can reduce the number of gaps, but this will of course reduce the size of the alignment (and may introduce some amount of bias/subjectivity). Is it better to keep the larger dataset at the expense of greater gap character? Is there a rough criteria for minimum amount of gaps an alignment should contain for reconstruction? Thanks very much.
Relevant answer
There is a lot of things to consider. You have ~2000 sequences, but: are these sequences orthologs? Are there paralogs? Do you know the species tree for the organisms from which these sequences were obtained? Is the taxonomic sampling of sequences balanced across the species tree? What is the mean sequence length? What is the MSA length? What is your ASR approach (maximum-likelihood, Bayesian, or parsimony)? Those gap positions are due to true insertions and deletions (indels), or are some of those gap positions caused by incomplete sequences? If they are indels, have you considered to perform a MSA using a statistical model for indels (Poison Indel Process) as implemented in ProPIP MSA software? Have you compared different MSA softwares? Are these gaps present in important and conserved sites or are they only in variable sites?
I believe that the approach to dealing with the trade-off will vary depending on the answer to each of the above questions.
For instance, if you will use maximum-likelihood or Bayesian approaches for ASR, you should be careful with gap positions because the stochastic substitution models used in these approaches do not account for indels. If there are many incomplete sequences, I would remove them as much as possible. If the taxonomic sampling is biased toward one specific clade, and if many sequences of this clade are gappy, I would prune this clade.
I think that perhaps the closest thing to a general rule of thumb is to worry about the quality of your taxonomic sampling rather than the quantity.
Best wishes,
Pedro
  • asked a question related to Dataset
Question
5 answers
So I have calculated the accuracy of my model prediction using both training and testing datasets. And I found that it has higher testing accuracy than training accuracy. It is a normal condition? And how do I interpret this condition?
Relevant answer
Answer
Hello Irfan Ripat,
as you already gussed it is not a normal situation to obtain a better accuracy on test data in comparison with the training data. However, there are some things that you can test that help you figure out this situation. The two most conventional ones are:
  1. Use strong regularization: as you may know, regularization terms are used to prevent the model to perform too well on training data (give model sth. to do other than only fitting on the training data). In this case, since you've extremely limited the computational capacity of your model during training but using all of the model's power during the test, you may obtain such accuracies.
  2. Did not split data correctly: there is a high chance that during splitting the data into train-test sets, the same training data sample appears in the test data (due to either bad splitting or the existence of duplicated data).
Besides the above-mentioned plausible reasons, I searched the similar issue and found the following links useful:
I hope it helps you!
  • asked a question related to Dataset
Question
3 answers
I want to train a CNN to segment Ground Glass Opacities (GGO) in Lungs CT-scans.
I would need a dataset with CT scans and corresponding masks indicating for every voxel if it is GGO or not (i.e. the ground truth for the segmentation).
Do you know any dataset like that?
Many thanks for your help!!
Relevant answer
  • asked a question related to Dataset
Question
5 answers
I am applying multiple regression analysis to my datasets for prediction purpose. To calculate relative contribution of each predictor I want to know about the much suitable method.
Relevant answer
Answer
The relative weight analysis addresses the multicollinearity problems and helps calculate each predictor's importance rank.
  • asked a question related to Dataset
Question
7 answers
Can we simulate an IoT kind of network using the NetSim 5G library? I would also want to model different kinds of attacks and generate data set to train an ML classifier.
  • asked a question related to Dataset
Question
4 answers
Hello!
I need help-
I have a data set with around 35-40% missing.
I work with SPSS, what can I do?
I am looking at change in technology anxiety over time in older adults.
Thanks in advance!
Kind regards,
Jessica
Relevant answer
Answer
Under the assumptions of missing at random (MAR) or missing completely at random (MCAR) data, you can use multiple imputation. Alternatively, many other software programs offer full information maximum likelihood estimation which can be applied to many common statistical procedures such as regression and ANOVA and relies on the same assumptions as does multiple imputation.
  • asked a question related to Dataset
Question
2 answers
Non-Invasive Skin Cancer Diagnosis Using Hyperspectral Imaging
Relevant answer
Answer
Thank you very much for your help but I am looking for multispectral or hyperspectral skin images, however the ISIC images are of the RGB type. Thanks again
  • asked a question related to Dataset
Question
2 answers
I need to do a PCA or spectroscopic dataset (LIBS) classification method in MATLAB. how can i get the scripts?
Relevant answer
Answer
Maybe try it on Github.
  • asked a question related to Dataset
Question
3 answers
I have a nonlinear data set of continuous data points which consists of 141 rows & 5 columns (4 independent variables and 1 output).
For getting a good start what can be the machine learning algorithms I can choose for?
Relevant answer
Answer
You can start with Linear regression, and then introduce the nonlinearity in variables not coefficients - like using the Polynomial features from sklearn library - Python.
In other words, Given 4 variables as X1, X2, X3, X4 - you can start with first degree LR => Y ~ X1 + X2 + X3 + X4; then move on to second degree without and with interactions => Y ~ X1 + X2 + X3 + X4 + X1*X1 + X1*X2 + X1*X3 + X1*X4 + ... + X4*X4; then move on to introducing nonlinear kernels => Y ~ exp(X1) + log(X2) + (1/X3) + ... something like this.
Once all these fails to capture the variations in Y data (which can be determined to an extent from the residual plot) move on to the other non linear techniques like PCR/PLS, SVR etc.
  • asked a question related to Dataset
Question
3 answers
i want to look at country of origin by states to have communication lines with groups with specific language skills. analytics practice
Relevant answer
Answer
I don't use python for this instant, so I'll be suggesting a really simpler software.
Try Orange. It's simple as it doesn't need coding and extremely flexible.
  • asked a question related to Dataset
Question
2 answers
I am planning to create a predictive model. However my approach uses a dataset which is not similar to prior works. So I can't apply my model on the previous datasets and those models will not work with my dataset. How can I validate my research in such cases?
Relevant answer
Answer
You will have to develop a new model based on new variables then only test it.
  • asked a question related to Dataset
Question
1 answer
The dataset is cheddar, which you could find it on 'faraway' package.
The random variable Y is 'taste', and X is H2S
I used this formula to calcualte MSE:
m=lm(taste~H2S, data=cheddar)
test.lm <- lm(taste~H2S, data=cheddar)
mean(test.lm$residuals^2)
The result was 109.538, however the right value of the MSE is 10.83^2.
Relevant answer
Answer
Perhaps you may check some of other approaches to compute MSE:
  • asked a question related to Dataset
Question
4 answers
Hello all, I have an hourly dataset, whose mean value is at around 30. I could tune the LSTM model to have RMSE = 1.7. Please let me know if it's an acceptable one or should it be further tuned.
Relevant answer
Answer
The mean of the dataset is at around 32, and my RMSE is 1.70, is it acceptable then? @Medhat Sir
  • asked a question related to Dataset
Question
6 answers
I plan to use it in a machine learning class and I want the students to be motivated. Ideally it will be an image dataset.
Relevant answer
Answer
Compound databases of natural products have a major impact on drug discovery projects and other areas of research. The number of databases in the public domain with compounds with natural origins is increasing. Several countries, Brazil, France, Panama and, recently, Vietnam, have initiatives in place to construct and maintain compound databases that are representative of their diversity.
Regards,
Shafagat
  • asked a question related to Dataset
Question
1 answer
Hi,
I have two data sets, one consists of make/age/fuel and another one includes make/age/fuel/ engine size but is a very larger dataset, I need to find engine size for cars in the first data set from the second one, what is the faster way in R or excel?
thank you
Relevant answer
Answer
Hello Samaneh,
In R, there are usually multiple ways to accomplish a given task. For your query, consider using the match(a, b) command.
Below is sample R code implementing this with two demonstration data.frames. All variables are treated as strings here, so if your variables include other types, be sure to modify the code accordingly. Any unmatched cases in the original data.frame will have "NA" values for the appended variable.
Good luck with your work.
# pick matching cases across two data.frames, add a variable from
# second data.frame to first data.frame
# for simplicity, all vectors are treated as string variables
# so be sure to modify code if numeric, boolean, or other variable types are used
# sample data.frame to be augmented
df <- data.frame(make=c('Chrysler', 'Fiat', 'Ford', 'Ford', 'Volkswagen'),
year=c('2001', '2010', '2016', '2019', '2020'),
mileage=c('23.1', '33.4', '26.4', '15.5', '28.0'))
# sample of "larger" data.frame from which new variable (displacement) will come
df2 <- data.frame(make=c('Audi', 'Chrysler', 'Dodge', 'Fiat', 'Ford', 'Volkswagen', 'Volvo'),
year=c('2015', '2001', '2007', '2018', '2019', '2020', '2021'),
mileage=c('24.7', '23.1', '18.0', '33.4', '15.5', '28.0', '26.5'),
displacement=c('2.8', '3.6', '5.4', '1.6', '5.0', '2.5', '2.5'))
# show original data.frame
df
# show data.frame to be searched for the displacement variable
df2
# add new vector to original dataframe, for cases which match
# note that all unmatched instances in original data.frame will have "NA" values
df$litres = df2$displacement[match(paste(df$make, df$year, df$mileage), paste(df2$make, df2$year, df2$mileage))]
# show revised original data.frame
df
  • asked a question related to Dataset
Question
3 answers
I have developed the multiple regression model and my responses size was 164. Now, I want to validate the model using new data set. Is there any rule for sample size that I can use? One colleague suggested I use 20% of sample used in the study to develop my model. Please, I need advice.
Relevant answer
Answer
With smaller samples (depending on what you have as a population) you can try to validate your model with additional data collection especially when you are modelling social phenomenon that are highly fragile thus exhibit unexpected variances. I would not suggest a fixed sample size here but advise that you continue expanding your validation sample until the goodness of fit changes in response to new data is insignificant. Of course this assumes a very good sampling design and strategy.
  • asked a question related to Dataset
Question
4 answers
Please share
Relevant answer
Answer
Thanks
  • asked a question related to Dataset
Question
2 answers
Hello,
if both analysis name is same, so please suggest the name of software for performing this analysis and which type dataset is needed.
Thank you
Devanand Maurya
  • asked a question related to Dataset
Question
3 answers
I have one thousand frames converted from a video taken from a particular location. From that dataset, I need to detect the blurred images and segregate that alone, how to do that???
Relevant answer
Answer
If you don't have any reference image in your dataset, then it is problematic, you are right, you can't use that function. What about to apply FFT to the images? You can have a look on the distribution of low and high frequencies. Low amount of high frequencies can indicate that the image is blurry, but you would need to estimate the right threshold of the "low amount" of the high frequencies. Maybe this approach is worth a try :)
  • asked a question related to Dataset
Question
1 answer
I have an analyzed RNA seq data set. The analysis part including differential gene expression, clustering analysis and enrichment analysis has been done. I am aware that the bioinformatic part is done and most of the analysis part is also done. Could someone please guide on how to extract the biological relevance from the data set. What should be the starting point for working with this data? Should I start by looking at the differentially expressed genes in different comparisons or start from the cluster analysis and try to look for the genes.
Relevant answer
Answer
In recent years, RNA sequencing (in short RNA-Seq) has become a very widely used technology to analyze the continuously changing cellular transcriptome, i.e. the set of all RNA molecules in one cell or a population of cells. One of the most common aims of RNA-Seq is the profiling of gene expression by identifying genes or molecular pathways that are differentially expressed (DE) between two or more biological conditions. This tutorial demonstrates a computational workflow for the detection of DE genes and pathways from RNA-Seq data by providing a complete analysis of an RNA-Seq experiment profiling Drosophila cells after the depletion of a regulatory gene.
Regards,
Shafagat
  • asked a question related to Dataset
Question
3 answers
I am working on a school project and am having trouble finding data to reference. I know I’ve seen similar studies before. Thanks so much!!!
Relevant answer
Answer
Hi, I don't have a data set, but an interesting question. Try looking at relevant papers and contacting those researchers, they may share data with you if you collect your own.
An example might be:
Renzoni, A., Pirrea, A., Novello, F., Lepri, A., Cammarata, P., Tarantino, C., ... & Perra, A. (2018). The tattooed population in Italy: a national survey on demography, characteristics and perception of health risks. Annali dell'Istituto superiore di sanita, 54(2), 126-136.
Good luck with your research!
  • asked a question related to Dataset
Question
9 answers
Hi
who knows how I can compare two datasets (different datatypes) the data types do not have the same parameters.
what is the best way that I can link both data? thanks
Relevant answer
Answer
You use the terms "compare" and "correlate". Those are two very different activities (one is useful for answering questions like "which is bigger", while the other is useful for answering questions like "does variable A increase as variable B does?"). Without using either term (or any statistical term at all, hopefully), what is it that you're trying to find out
  • asked a question related to Dataset
Question
3 answers
I need a dataset related to Vehicular ad hoc Networks for Reflection-based DDoS attacks.
Any idea or suggestion is welcomed.
Thanks you!
Relevant answer
Answer
Dear Samara Mayhoub,
You may find some useful info below:
DDoS Attack Detection in SDN-based VANET Architectures
_____
A Multivariant Stream Analysis Approach to Detect and Mitigate DDoS Attacks in Vehicular Ad Hoc Networks
_____
naveenrj98/Security_Attacks_VANET
_____
  • asked a question related to Dataset
Question
3 answers
I have 79 final line items of questionnaire now i want to cluster the line items into distinct latent variable. kindly guide me how can i cluster the line items. thanks in anticipation.
Relevant answer
Answer
The number of observations does not really matter. To regroup your lines as meaningful groups (or 'patterns', which might be seen as distinct variables in a way), you need a metric, a way to measure how two lines differ from one another. For instance, you may consider a score taking into account all the columns of a line; in general, one often uses the Euclidean distance, but it will not be the best in your case. Anyway, the idea remains to find a way to automatically compare two lines: this is where your knowledge in Business and Economics intervene.
  • asked a question related to Dataset
Question
9 answers
I am using a panel dataset (N=73; T=9). Dataset Timeframe: 2010-2018
In the GMM estimate on the total dataset, the AR(1) and AR(2) values are fine.
But to investigate the impact of the European crisis, I had to split the data (5 Years during and immediately after the crisis, and the subsequent 4 years). But when GMM is run on the second set of data, (2015-2018), in one of the models, AR(1) and AR(2) values were not generated.
Is the result still usable? What are the potential problems of using this specific result?
  • asked a question related to Dataset
Question
9 answers
I welcome Answers and opinions.
Relevant answer
Answer
Citing under References is enough but if you want to publish any data table you need to take permission from the publisher
  • asked a question related to Dataset
Question
2 answers
time predicted from a predictive maintenance algoritm
Relevant answer
Answer
Hello there! What kind of predictive maintenance algorithm are you using in your application?
  • asked a question related to Dataset
Question
6 answers
Hello,
I needed to understanding this type analysis please suggested name of software and dataset type that is helpful for me doing this analysis.
Thank you
Devanand Maurya
Relevant answer
Answer
I can see why that article is no help, because it does not provide an citation for its method. As far as I can tell, "correspondence factor analysis" is simply correspondence analysis.
Why don't you try reposting your question and include correspondence analysis as a search term?
  • asked a question related to Dataset
Question
3 answers
looking for dataset which can useful for our project.
Relevant answer
Answer
You can try to get similar datasets from UCI machine learning repository or from Kaggle.
  • asked a question related to Dataset
Question
5 answers
Hi All,
I'm carrying out an Ordinal Regression on my dataset. I have continuous predictors (about 12) and an ordinal response variable. When looking into Ordinal Regression in SPSS they have two different procedures to carry this out: PLUM and GENLIN. It is said that GENLIN is better because it is quicker and easier to carry out than PLUM. I wonder if GENLIN has other advantages?
Manyt thanks!
Laura
Relevant answer
Answer
  • asked a question related to Dataset
Question
3 answers
I am developing a ML model where i am working with medical specialties.
My dataset contains different specialties (General surgery, urology,etc...) . First I build a model with all my data with the specialties mixed, and then i apply the same ML algorithm at a specialty level. I achieve better evaluation metrics for all the Specialty models except for one specialty can someone help understand the reason?
Relevant answer
Answer
Dear Alice,
as far as I know, there is mostly more than one image from each patient in medical datasets. Regarding that, while splitting your dataset or working with a small subset of the whole dataset, you should pay attention to samples to be such that:
  • No two images (scans) of the same patient are presented in the train and test set simultaneously
  • [Softly] keep the ratio of classes
So in your case, there is a chance that the subset you've chosen and trained your model on, doesn't contain enough number of samples of that specific specialty.
I hope it helps you!
  • asked a question related to Dataset
Question
4 answers
Hello everyone!
I have RNA seq dataset for two groups knockout and wild types of mice samples. I have the normalized values in terms of quant all datasets. Please guide me how to perform PCA on the normalized values. I am not a bioinformatician, kindly suggest non-coding methods.
Thanks in advance!
Relevant answer
Answer
Hi! I recommend using LatchBio for RNA seq data. I've used it several times, and it is super easy to use since it has a non-coding interface.
Hope that helps!
  • asked a question related to Dataset
Question
3 answers
Hello everyone! I am in search of a suitable dataset for the nail tracking application. I need a dataset, I found someone, but I want the images to be more variable. Please If you have one, response to my ask.
Relevant answer
Answer
Look the link, maybe useful.
Regards,
Shafagat
  • asked a question related to Dataset
Question
6 answers
Hi everyone,
I have panel dataset of 4 periods and 29 countries. Which method/technique can be suitable for my dataset?
Thanks for your answer in advance.
Relevant answer
Answer
Kelvyn Jones I really appreciate you taking the time. You've broadened my horizons, sir. I am going to start working out these details. Thank you so much, again.
  • asked a question related to Dataset
Question
1 answer
I am working on a object recognition model that is able to detect whether the person in the picture is wearing any type of headwear or not (hat/cap/helmet/scarf/raincoat etc etc) . I am unable to find any large publicly available datasets of this kind. As of now I am resorting to writing scripts that scrape images from the web of people wearing hats/caps etc using bing image downloader API and Google image downloader API . Please let me know of any publicly available datasets of this kind. Thank you.
Relevant answer
Answer
Check this Dataset, could be useful:
  • asked a question related to Dataset
Question
8 answers
Hello everyone
I am a student working on a project about the prediction of the privacy policy applied to textual posts on facebook. The objective is to predict for a post of a specific user if they would share it with the public, their friends or some more specific audience.
In terms of previous works on the subject, I found some articles that do this:
The first and second papers (and to a lesser extent, the fourth one too) use a model that is trained only on the data of the user, which gives them a model specific to that user, as opposed to the methodology in the third paper. The first one uses only 20 posts, and for the second the precise number is unknown (maybe more that 60)
My first question is: isn’t it too small of a dataset to train a text classification model ? The tf-idf vectors used would have high dimensions and the number of words that appear in multiple posts would be small.
I tried to replicate their results with some data collected thanks to some friends (asked 7 persons to label 20 different posts each), and using any model with tf-idf seem to give pretty bad results (they just act like dummy classifiers and predict the majority class)
I tried adding a small number of features next to the tf-idf vector like the length of the post or its positivity/negativity/objectivity score obtained with a sentiment analysis tool, but it doesnt seem to affect the model at all.
The first paper got a high accuracy with only 20 posts per user (it could maybe be due to the fact that the majority class had a high ratio of more that 70% ?), while the fourth one couldnt get past 65% with a much bigger dataset.
My second question is: am I missing something ? What do you think of the feasibility of the approach (using a very small dataset), and how to improve the results ?
Relevant answer
Answer
An interesting related work that uses BERT.
I know what you did on Venmo
  • asked a question related to Dataset
Question
4 answers
I have two datasets (.edf) of EEG recordings, one for healthy people, one for depressive people.
Each of the recording has 20 channels. So far I opened the data in matlab with edfread() as a timetable.
How can I add a white noise in that timetable?
Relevant answer
Answer
Artificial Intelligence
answer would work. But it might not generate what you need. Have you considered including noises that are true to the body, like bodily function, sounds and surrounding RF? Eleonora Adelina Dănilă
  • asked a question related to Dataset
Question
29 answers
Dear All,
I am looking for some Partial discharge (PD) datasets to download. I would appreciate if you can mention some data sources from where I can download the PD datasets.
Regards,
Anis
Relevant answer
Please share with me: soronilameitei@gmail.com
  • asked a question related to Dataset
Question
6 answers
Hello Friends,
I am applying ML algorithms (DT, RF, ANN, SVM, KNN, etc) in python to my dataset which has features and target variables as continuous data. For example, when I'm using DecisonTreeRegressor I get the r_2 square equal to 0.977. However, I'm interested to deploy the classification metrics like confusion matrix, accuracy score, etc. For this, I converted the continuous target values into categorical ones. Now when I'm applying the DecisionTreeClassifier, I get the accuracy square=1.0 which I think is overfitting. Then I applied the normality checks, and correlation techniques (spearman) but the accuracy remains the same.
My question is am I right to convert numeric data into categorical one?
Secondly, if both regressor and classifiers are used for the same dataset, will the accuracy be changed?
Need your valuable suggestions, please.
For details plz see the attached files.
Thanks for the time
Relevant answer
Answer
I think there are two misconceptions here.
1) There is no reason to expect similar accuracies for regression and classification on the same data set. Turning a regression problem into a classification problem is tricky, and essentially pointless.
2) r_2 is definitely not a valid index of the quality of a regression model. Imagine a model that systematically gives a prediction equal to 10 times the observation. The r_2 of the model will be equal to 1, although the model is obviously very poor. For regression, the most useful quality index is the root mean squared error, computed on a test set, i.e. on data that have never been used for designing the model, neither for training nor for model selection.
  • asked a question related to Dataset
Question
2 answers
We are building an Arabic speech emotion dataset with 508 recorded persons, and every person recorded ten exact phrases divided into five emotions. The WAV files are noise-free and will be converted to MFCC and LDD features.
The validation process is in progress manually by a team of neuro-linguistics and psychologists. The dataset will be a public free access dataset.
What is the process of publishing this dataset?
What is the best journal to publish in?
Relevant answer
Answer
Ángel Carrión-Tavárez thank you for your reply.
  • asked a question related to Dataset
Question
1 answer
I am looking for a publically available video/image dataset from surveillance cameras, which is used to detection if there is violence like fight happened. Also,a dataset from surveillance cameras contains such class also helpful. I have collected UCF-crime, NTU-CCTV dataset, I want to know how to get more datasets like these or how to collect such videos by myself.
  • asked a question related to Dataset
Question
4 answers
Hello Seniors I hope you are doing well
Recently I've read some very good research articles. In those articles datasets were taken from V-Dem, Polity and Freedom House. Though they have shared the link of supplementary datasets and the process of how they analyzed these datasets in SPSS or R in brief but I couldn't understand and replicate these findings. It may be because I am not very good at quantitative data analysis.
So I want to know how could I better understand this Datasets analysis easily like V-Dem etc. Is there any good course online, lectures or conference video etc. Or good book?
Article links
Any help would be appreciated.
Thanks in anticipation.
Relevant answer
Answer
Please find some online course for learning R on Edx and Coursera platforms.
Thanks ~PB
  • asked a question related to Dataset
Question
1 answer
Date Price Volume Turnover
1/1/22 10 12 120
1/1/22 11 10 110
1/1/22 13 20 260
1/1/22 12 15 180
1/1/22 10 13 130
1/1/22 9 9 81
Once I sort turnover in ascending order for every day, I hv 81, 110, 120, 130, 180, 260. Now I need to create categorical variable 1 for 81 and 110, cat variable 2 for 120, and 130, and cat var 3 for 180, and 260. My dataset has many year data and for each day there are thousands of transactions.
Relevant answer
Answer
I've been working with your example data.
Here is the code to add the new variable info
# data.frame is the variable name of your data
# count the nb columns to add the new variable
ncolData = ncol(data.frame)
for(i in 1:nrow(data.frame)) {
#data.frame[i,4] is the column Turnover
value =as.character(data.frame[i,4])
# change data.frame[i,ncolData+1] for the desired column
switch(value,
"81"={data.frame[i,ncolData+1]=1},
"110"={data.frame[i,ncolData+1]=2},
"120"={data.frame[i,ncolData+1]=3},
"130"={data.frame[i,ncolData+1]=4},
"180"={data.frame[i,ncolData+1]=5},
"260"={data.frame[i,ncolData+1]=6},
{print('nothing')}
)
}
Regards,
  • asked a question related to Dataset
Question
8 answers
I am looking for Nitrogen, Potassium, Phosphorus, Organic carbon stock , pH value, Added Nutrients features in soil dataset.
Kindly suggest me the relevant data source.
Relevant answer
Answer
@ Manoj, you may get it from Soil & Land Use Survey of India (Govt. of India). You can also download from FAO Soils Portal, GloSIS Global (Beta), European digital archive on soil maps (EuDASM), International Council for Science (ICSU) World Data System and GEOSS (Global Earth Observation System of Systems) portal.
  • asked a question related to Dataset
Question
1 answer
Hi
I have western blot data set for 3 experimental and three control samples for one housekeeping protein (actin) and three target proteins. Kindly guide me in details what type of stat I should apply. And how we normalize the data.
Thanks in Advance!
Relevant answer
Answer
Hi Nisha,
In order to normalize the data, firstly you will have to scan them (as image files) and then use a software like ImageJ (it's a free tool) to automatically detect and measure signal intensity and band size.
The normalization process can be done by following this procedure:
Now, the statistical analysis part can be done using ANOVA. You can find excellent guides anywhere on the internet both for using ImageJ and performing the ANOVA test.
Hope this helps you
  • asked a question related to Dataset
Question
2 answers
this database contains 494,414 face images from 10,575 actors of IMDb. Face images comprise random pose variations, illumination, facial expression, and resolution.
I need to remove the background and take the front pos of the face image.
  • asked a question related to Dataset
Question
1 answer
Need a dataset for the research
Relevant answer
Answer
This may be helpful.
High precision automated face localization in thermal images: Oral cancer dataset as test case
DOI:10.1117/12.2254236
and
and
  • asked a question related to Dataset
  • asked a question related to Dataset
Question
1 answer
Natural Language Processing
Relevant answer
Answer
Sketch Engine is quite robust
  • asked a question related to Dataset
Question
2 answers
Dear Researchers,
Greetings!
I need an IFDB dataset for my research work. Can anyone help in this regard? I'm not able to download it directly from the website.
Let me know if anyone can help.
Thanks in advance.
Relevant answer
Answer