Science topics: MathematicsData Science
Science topic

Data Science - Science topic

Data science combines the power of computer science and applications, modeling, statistics, engineering, economy and analytics. Whereas a traditional data analyst may look only at data from a single source - a single measurement result, data scientists will most likely explore and examine data from multiple disparate sources. According to IBM, "the data scientist will sift through all incoming data with the goal of discovering a previously hidden insight, which in turn can provide a competitive advantage or address a pressing business problem. A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data." Data Science has grown importance with Big Data and will be used to extract value from the Cloud to business across domains.
Questions related to Data Science
  • asked a question related to Data Science
Question
1 answer
I'm curious to understand how possible it is to bridge the gap between Data Science and Spatial Ecotoxicology in developing countries. Dearth of information and lack of data is a major setback in developing countries. It will be interesting to understand fully the intricacies involved and the way forward. Thanks.
Relevant answer
Answer
Here's a possible example for you. Best wishes David Booth
  • asked a question related to Data Science
Question
1 answer
If I have multiple scenarios with multiple variables changing, and I want to conduct a full factorial analysis, how do I graphically show the results?
Relevant answer
Answer
Hello Niteesh,
Can you be a bit more specific about how many IVs/factors you wish to evaluate, the number of levels of each, the number of scenarios (will this be a repeated measures factor or represent multiple outcome variables?), and so on? As well, what attributes of the results do you wish your presentation to highlight?
As given, your query is just too terse to offer more specific recommendations. I'd wager that if you elaborated, following the lines above, you'd get more focused suggestions for how to proceed.
Good luck with your work.
  • asked a question related to Data Science
Question
5 answers
Hello,
I have 2x2 Mixed Design ANOVA I did assumptions before Anova. I fixed 1 or 2 person (outlier) for data and I went for analysis. After executed outliers, I got still significant and large effect size (partial eta) therefore, I prefered to report without outliers. I have also between group differences, I will report them too. But I have problem in group interaction and some within-subject results. Group intereaction; I have nLOW=10, nHigh=11 group. The result showed me that F(1,19)=4.18, p= .056, Partial eta =0.17! and another F(1,19)=3.72, p= .69, partial eta=0.16! Within-Subject: F(1,19)=4.18, p= .055, partialeta=0.18 ! and I have 4 more like these for within result. My Question: i) I wanted to report nonsigficant data, because I read that "p" is probability but effect size also shows us "how much variances affect/interaction for each other". I also read about "effect size choosing for sample size. They said that " if you have small size (maybe like me for 2x2 design (nLOW=10, nHigh=11, nTOTAL:21)) you must use omega squared" and If I use omega for my results, all large effect sizes have gone. They decreased to medium (around 0,50)
After I read Dr. Lakens's review (10.3389/fpsyg.2013.00863), I thought that can be " generalized eta", because I realized after computed Omega squared and General Eta , they look similar (general eta:0.047, omega square :0.037). But If I have these effects, I thought there is no need to report for nonsignificant data as I mentioned above. Because I also lost effect sizes, right? It is same for some within-subject results. If I compute effect size as a omega squared and generalized eta, almost all large partial eta decreases to medium or low. What should I do about it? Should I report both (partial eta and generalized eta which they are very different for my data: when Partial eta shows:0,241; GeneralEta shows:0.047; omega squared shows: 0.037) ? Thank you in advanced for your time.
Berkant Erman
Relevant answer
Answer
Well you have a pretty good interaction here. This is what drives your analysis for a factorial design. Ignore all the stuff about eta squared and effect size for the time being. That comes at the very end of the analysis. The attached search is for a simple form of your design and gives a complete analysis of one. Take a look at this and see if it doesn't guide you through the analysis. At that point come back with the effect size questions and we can help with that too. If you have questions in the meantime please ask. Best wishes David Booth
  • asked a question related to Data Science
Question
3 answers
I really need to advance in Python for data science. I need the best books. Anybody here can recommend me the list of books? no matter of their costs, the intellectual cultivation pays back.
Relevant answer
Answer
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
Written by: Wes McKinney
This should be a good start.
  • asked a question related to Data Science
Question
6 answers
1. We generally expect the P value <0.05. However, it is quite possible to see a value greater than 0.14 in the partial eta squared (as an effect size) in the two-way analysis of variance (2-way mixed etc.) with p>0.05.
In addition to these, we can also use another effect sizes as eta squared, omega squared etc. Here is my question about it :
- - -How often do you encounter such a situation [(p>0.05 and ES>0.14 (large)] and what importance do you place on such a result(s) in the discussion section of the paper?
2. Due to the complexity of studying in sports sciences (professional athletes, muscle biopsy etc.), there are still important articles published with a few individuals/participants/subjects (may be in Q1-2 Journals). But some journals insist on asking for effect size of study as well (they are also Q1-2 Journals).
- - - What do you think about the effect size of studies in sports sciences?
Your thoughts will be very valuable to me.Thank you in advance.
Relevant answer
Answer
I agree with David Eugene Booth . The best thing you can do is be as transparent as possible. Unfortunately, it really depends on the editor and reviewers you get. If they are learned in statistics, they will understand the issues and limitations of p-values.
Out of curiosity, what is the p-value?
  • asked a question related to Data Science
Question
9 answers
Do you think R and Python should be taught together in a master's level social data science course, or should the course pick one and focus on it? Let's assume students have some background with Excel and have minimal training in R or Python.
Relevant answer
Answer
If you're passionate about the statistical calculation and data visualization portions of data analysis, R could be a good fit for you. If, on the other hand, you're interested in becoming a and working with big data, artificial intelligence, and deep learning algorithms, Python would be the better fit.
Python, a modern programming language with a well-established ecosystem, is particularly useful for all data related work in the social sciences. The language and its external libraries can be used for web-scraping, textual analysis, machine learning, the presentation of data in web pages and data analysis.
  • asked a question related to Data Science
Question
8 answers
I'm looking for datasets containing coherent sets of tweets related to Covid-19 (for example, collected within a certain time period according to certain keywords or hashtags), containing labels according to the fact they contain fake/real news, or according to they fact they contain pro-vax / anti-vax information. Possibly, the dataset I'm looking for would also contain a column showing the textual content of each tweet, a row showing the date, and columns showing 1)The username /id of the autohor; 2)The username/id of the people who retweeted the tweet.
Do you know any dataset with these features?
Relevant answer
Answer
@Luigi Arminio,
This attached screenshot maybe useful. Thanks~PB
  • asked a question related to Data Science
Question
4 answers
I'm a computer engineering student with specialization in software engineering and data science
and soon I have to write my proposal for my master's thesis , would be appreciated a lot if the experts of field would lend me their knowledge and insight to help me in choosing a problem to work on or where to search and look and read in order to be able to detect and work on a problem for my thesis.
Thanks a lot in advance.
Relevant answer
Answer
My opinion about the thesis topic:
Social stigma and Obesity and its consequence on academic performance among [select population group] in [define your settings]
  • asked a question related to Data Science
Question
9 answers
Hi Team, I am working on a problem where I need to classify the product in categories like Bad, Good, Better and Best. I am looking forward to use data related to product price, customer reviews, product features, etc.
Please share your ideas or any reference material related to this.
I have the data available related to product, price, features, reviews, website visit. I have all data science tools and techniques to make use of the data as well.
Thank you!
Relevant answer
Answer
Feature selection is one of the primarily one of the many considerations on classifications. You may also consider to use rapidminer software upon doing classification tasks.
  • asked a question related to Data Science
Question
3 answers
I am looking for upcoming conferences for Data Science, Data Analytics snd Information Systems. Any recommendation!!
  • asked a question related to Data Science
Question
7 answers
Modern politics is characterized by many aspects which were not associated with traditional politics. Big data is one of them. Data mining is being done by political parties as they seek help from data scientists to arrive at various patterns to identify behavior of voters. Question is, what are the various ways in which big data is being used by modern political parties and leaders?
Relevant answer
Answer
Big Data platforms allow government agencies to access large volumes of information that are essential for their daily operations. With real-time access, governments can identify areas that require attention, make better and more timely judgments about how to proceed, and enact the necessary changes.
  • asked a question related to Data Science
Question
4 answers
I have 400 individual data of renal stone patients; the data is about the parameters of some blood biopsies and 24 hours urine analysis; I have done the random Forest using R. However, the result is no significant statistic. What can I do about this data, Is there any Statistic tool to analyze? Thank you so much.
PS: Definitely I've done a logistic regression was no correlation.
Relevant answer
Answer
There are many more learning approaches you could try. You may want to have a look at packages caret (https://topepo.github.io/caret/) or mlr3 (https://mlr3.mlr-org.com/)
Best Matthias
  • asked a question related to Data Science
Question
4 answers
What are the latest research trends in Data Science ?
Relevant answer
Answer
data filteration in IoT sensing enviroment
  • asked a question related to Data Science
Question
3 answers
Dear Researchers,
Greetings for the day !
My name is Shard and I completed B.S. Marine Engineering from Birla Institute of Technology, Pilani, Rajasthan, in 2006 and then M.S. Information Technology (Application Development) from Sunderland University, United Kingdom, in 2010. I also did my second master's degree i.e, MSc Mathematics from Shoolini University in 2021.
I'm a Research Scholar in Yogananda School of AI, Computer and Data Sciences and currently pursuing the Ph.D. degree in Mathematics at Shoolini University, Solan, Himachal Pradesh, India. Also, working as Assistant Professor in Shoolini University since 2014.
My areas of interest in research includes "Technology adoption using mathematical models and statistical tools like SPSS and also waste management energy".
I'm actively looking for research collaborations with national/international collaborators to be the contributor in the scientific community.
Wamly,
Shard
+91-8219639808
Relevant answer
Answer
Dear Mr. Shard!
May I kindly suggest you to sign up to Elsevier Researcher Academy a free of charge portal with wide range of resources and contacts:
In addition you can find useful research community networks on LinkedIn:
Yours sincerely, Bulcsu Szekely
  • asked a question related to Data Science
Question
5 answers
I have been studying on Data analysis softwares. Currently trying to expand my knowledge on R programming. I am interested in projects related to R programming. I can provide free service to anyone that needs help on data analysis specifically R programming and Financial data analysis. Thanks
Relevant answer
Answer
Thank you all for your responses. I appreciate the information you have provided.
  • asked a question related to Data Science
Question
11 answers
I have 2 equation, and in each equation i have a coefficient of interest:
lets say:
eq1: Y=bX+2cX2+4
eq2: Y=aX2+2dX+12
Giving that the value of a and b are changing over time.
And I am aiming to record the values of a in list A and b in another list B
And from their behaviour i want to draw conclusion about the strength of these coefficients.
But i am a bit confused about how to draw such conclusion and what is the most representative way to monitor a and b behaviour change over time.
Or its better to monitor the increase or decrease of coefficient by summing the difference of recorded values over time.
I have more coefficients to be monitored, and they may have value or not. and my aim is to build meaningful classification that can categorise coefficients as useful or not.
Relevant answer
Answer
Please explore correlation and regression analysis. These may be helpful.
  • asked a question related to Data Science
Question
4 answers
  • What is the best metric for model selection?
  • Does accuracy derived by cross-validation is a good metric? *
  • Does the selected model in the model selection process based on the metrics, surely leads to better results?
Relevant answer
Answer
Hasan -
I like to suggest comparing performances of alternative models - perhaps picked based on subject matter knowledge or some other method, but not forward or backward selection which can leave out good possibilities - by using a "graphical residual analysis," which you can research. Such a comparison, on a single scatterplot, would be with regard to one specific sample, so you do then need to avoid overfitting to that particular sample, and that is where "cross-validation" is important.
Some comparison statistics may automatically assume homoscedasticity. That is not a good blanket assumption. See https://www.researchgate.net/publication/354854317_WHEN_WOULD_HETEROSCEDASTICITY_IN_REGRESSION_OCCUR.
Best wishes - Jim
  • asked a question related to Data Science
Question
7 answers
Dear Researchers,
I with my team are working on Machine learning, data science, association rule mining and Simpsons Paradox.
I am looking for potential collaborators in these areas. If you are interested to collaborate or discuss our work, you can email me at rahul.gla@gmail.com
Thank you.
Best Wishes,
Relevant answer
Answer
Prof. Rahul Sharma: It is my honor to work with you.
  • asked a question related to Data Science
Question
9 answers
I'm using Mutual Information (MI) for Multivariate Time Series data as a feature selection method.
MI is nonnegative MI >= 0 where 0 indicates that both variables are strictly independent and above that means the variables share a useful amount of information.
After computing the MI between 8 features and the target variable I got the following values:
[0.19, 0.23, 0.34, 0.19, 0.19, 0.12, 0.21, 0.071]
and when computing MI between some of the Input features, and another input feature to search for redundancy, I got the following values:
[4.0056916 , 1.58167938, 1.20578024, 1.0273157 , 0.93991675,0.9711158 ]
The values are not between -1 and 1 or between 0 and 1 so that it's easy to compare and draw conclusions.
My question is:
Is there a way to set a threshold value (t for example) where if MI >= t, then the variables are sharing a good amount of information and if MI < t they are not? How to know if the MI is high enough?
Relevant answer
Answer
Thank you Shafagat Mahmudova Adnan Zahid Remal Al-Gounmeein for your valuable responses, I will review these articles and papers.
  • asked a question related to Data Science
Question
8 answers
hey guys, I'm working on a new project where I should transfer Facebook ads campaigns data to visualize in tableau or Microsoft power BI, and this job should be done automatically daily, weekly or monthly, I'm planning to use python to build a data pipeline for this, do you have any suggestions or any Resources I can read or any projects similar I can get inspired from ? thank you .
Relevant answer
Answer
To create an ETL pipeline using batch processing, you must first:
1. Construct reference data: create a dataset that outlines the range of possible values for your data.
2. Extract data from various sources: Correct data extraction is the foundation for the success of future ETL processes.
  • asked a question related to Data Science
Question
22 answers
Data science is a growing field of technology in present context. There have been notable applications of data science in electronic engineering, nanotechnology, mechanical engineering and artificial intelligence. What kind of future scopes available for data science at civil engineering aspects in the field of structural analysis, structural design, geotechnical engineering, hydrological engineering, environmental engineering and sustainable engineering?
Relevant answer
  • asked a question related to Data Science
Question
14 answers
In recent years, data science has emerged as a promising interdisciplinary subject, and helped understand and analyze actual phenomena with data in multiple areas. The availability and interpretation of large size data that is a vital tool for many businesses and companies has changed business models, and led to creation of new data-driven businesses.
In agriculture including crop improvement programs, both short and long term experiments are conducted, and big size data is generated. However, deep data mining, meaningful interpretation, deeper extraction of knowledge and learning from data sets are more often missing. Whether application of data science is also vital in agriculture including crop improvement for understanding and analyzing the actual phenomena and extracting deeper knowledge??
Relevant answer
Answer
Dear Dr Jaydip Datta I acknowledge your assertion. Thank you for your interest.
  • asked a question related to Data Science
Question
4 answers
Scientific articles published in esteemed newspapers hold many informations that can be of great value which often includes biodiversity documentation, sporadic incidents, demographic data and many more that are seldom being cited and eventually consign to oblivion.
Do you think that these datas, only those that are relevant and authentic, should be cited and a proper archive and citation protocol should be constituted.
Relevant answer
Answer
There is certainly a marketplace for making scientific content digestible by the public. Unfortunately, many quality newspapers don't accurately reference or link directly to quality research, generally you can find the original source but occasionally the true source can't be found.
  • asked a question related to Data Science
Question
3 answers
I am looking to start my Masters' Dissertation in Data Science. with a five months time frame. What interesting topics would you suggest, one in which I would be able to get a dataset from open source and analyse.
Relevant answer
Answer
Well, may I suggest my improved version of the least squares regression method? It's new and it has a lot of potential, I believe.
  • asked a question related to Data Science
Question
3 answers
Looking for any work on structured prediction.
  • asked a question related to Data Science
Question
5 answers
How shipping and logistics industry can be improved using Data science and Statistical analysis methods?
Relevant answer
Answer
Big data is used to control sensors on a ship and to perform predictive analysis. Enhanced decision making through big data analytics is being actively implemented to avoid and predict additional costs and can be used throughout the life of a ship
  • asked a question related to Data Science
Question
9 answers
I am looking for a journal where I can publish my research paper. As a beginner, it's very difficult to publish my paper in the Q1 or Q2 Journal. That's why, I am looking for some Scopus Indexing Journal, Where I can easily publish my paper. My research direction is Data Science, Machine Learning, or Deep Learning for the time being.
Relevant answer
Answer
You may follow these steps
Refine the scope of your search to get more relevant journals in combination of following items:
1. Paper title
2. Paper abstract
3. Keywords
4. Field of research
5. Publication type (OA, S)
6. Journal impact
7. Review and publication time
  • asked a question related to Data Science
Question
7 answers
1. what are the high-quality research monographs (or books) on artificial intelligence or data science or big data analytics? I hope to have 3 recommendations for each.
2. What are the important technologies or techniques developed in the past 10 years since 2021? I hope to know 5 of them. Please do not mention deep learning.
Thank you
2022-1-12
Relevant answer
Answer
Interesting query
  • asked a question related to Data Science
Question
5 answers
How can a scientist learn about data science and modelling that is applicable to solve wide varieties of problems. If possible contact of where you can do that even as a visiting researcher ?
Relevant answer
Answer
  • asked a question related to Data Science
Question
4 answers
dear community, I need some sources for some data science project or machine learning project related to analyzing the google analytics and Facebook business data , your help is appreciated.
  • asked a question related to Data Science
Question
4 answers
Actually, I am new at research and don't know where to start I even don't know the problem statement because I don't know about the field I am gonna research well I want to study them and find the ratio of cultural impact (food, religion, dress, norms, etc)on different countries like Afghanistan or Pakistan or China or any Country of Asia or Europe of Some of the world largest Empires Like Mongol Ottomon British so as it is related to social science research but I want to make it in term of data science so can anyone tell me what do from where to start and what will have to do and from where I get the dataset I am new at research so I don't know what to do as i read some paper but nothing get anything about models and anything so please suggest me and tell me the brief answer if you can
Relevant answer
Answer
Even though your question is vague but I will still try to provide some pointers.
About the field:
Before data science was a field (lets say before 2005), people still use to do research in social science and use to produce interesting insights. And the reason for that is the discipline which makes it possible i.e. Statistics. Both Social and Data Science rely upon Statistics to perform various analysis e.g. Data Exploration, Hypothesis Testing etc.
So I would definitely recommend you to spend some time on some important concepts in Stats e.g. Hypothesis Testing, p-value interpretation, moments, distributions, statistical tests (t-test, chi-squared, ...), correlations, probability and so on. These concepts will help you to understand a given dataset and how the attributes in the dataset relates to each other (meaning how one variable impact another), you could also test some hypothesis e.g. if religion has an impact on food habits (you could look at the data on religion and food habits). Again these are just the pointers which may or may not be helpful depending upon your specific needs eventually.
About the data:
There are a couple of places where you could look for some public datasets:
- Kaggle is very interesting resource: https://www.kaggle.com/datasets
I hope this helps. Feel free to ask if you have specific question.
Cheers,
  • asked a question related to Data Science
Question
13 answers
A ground truth map is essential for supervised classification of hyperspectral image cube .However, data resources are limited, up till now, almost all the papers I've read use the Indian Pine, Washington D.C.Mall, Salinas, Pavia University. It seems all we have are the four or five sets of data cube. With the development of state-of-art algorithms for hyperspectral image classification, data should be updated too. Hope anybody can provide me new data with Ground truth map, or recommend me a website to download.
Relevant answer
  • asked a question related to Data Science
Question
3 answers
If I have an experience but no certificate in Machine Learning, and have both experience and certifcate in Medicine (Urology) ... and want to publish a peper that include an interaction between these 2 feilds, in either Urology or CS journals. Should i add a coauthor who is certified in Data science or CS? ... if not, Would it be necessary to prove my competency in ML by any means?
Relevant answer
Answer
Publishing on a particular area does not require one to necessarily have a certificate in that area. However, if you find yourself to be inadequately equipped with the requisite knowledge to do a good job in that specific field, you may overcome that by collaborating with someone with that knowledge so that together you can bring out a very good and scientific work.
  • asked a question related to Data Science
Question
4 answers
Which free software for text analytics tasks (such as text mining and sentiment analysis) can be recommended for social science researchers and students who do not have much background in programming or data science?
I understand that R could be the obvious answer for many, given its capabilities, I am specifically looking to shortlist 4 to 5 GUI/point-and-click options which can be recommended to early researchers and post graduate students in social sciences, especially Psychology.
I have experimented with KNIME and Orange, but won't certify them as 'friendly enough'. This could be because I did not spend enough time on them, though.
Relevant answer
Answer
Hi again, Chinchu C Mullanvathukkal , interesting how the researchgate alorithm works, this paper was just recommended to me, maybe the tool they portray is just, what you are looking for:
  • asked a question related to Data Science
Question
7 answers
I have recently come across many online forums recommending Kaggle for people wanting to get into machine learning or data science. However, I would like some scholarly opinions on this website: https://www.kaggle.com/
Relevant answer
Answer
Kaggle is very good for beginners in machine learning and deep learning. The certification courses are excellent.
  • asked a question related to Data Science
Question
4 answers
Currently, artificial intelligence/machine learning (AI/ML) and other technologies (mathematics or statistics) have been introduced into various engineering fields for data science research. We all know that in the past few decades, civil/mining and other fields have accumulated a large amount of data indoors or on-site, but because these data are directly or indirectly related to projects, enterprises, or state secrets, they have very high privacy. . In other words, it is data sensitivity and security. So, with the rapid development of AI/ML, how to do a trade-off between the scientific issues of data science and data security?
Relevant answer
Answer
Data is taking over almost each and every sector, even the ones that you may not immediately relate it to. One of those is construction/civil engineering. Today, however, that data is complemented by a huge amount of information generated by sources such as building engineering logs, cranes, construction workers, earth movers and materials logistics. Before AI revolution, construction software did a great job at recording static information, such as Computer Aided Drafting (CAD) schematics, expenses, invoices, employee details, and project details. Today, on the other hand, construction professionals and civil engineers need the kind of answers hidden in unstructured data, like free-form text (for example email and Word documents), printed documents and analog sensor data.
Regards,
Shafagat
  • asked a question related to Data Science
Question
11 answers
Dear Colleagues,
If you are researcher who is studying or already published on Industry 4.0 or digital transformation topic, what is your hottest issue in this field?
Your answers will guide us in linking the perceptions of experts with bibliometric analysis results.
Thanks in advance for your contribution.
  • asked a question related to Data Science
Question
4 answers
The main goal behind this question is extracting the best practices on how to establish a Data Science / IA Project.
Anyone would be generous to share with us the set of tools or even a boilerplate that may help or guide anyone working in the field and applying data science & IA to any domain.
Relevant answer
Answer
It very depend on dataset that are you going to used. Please specify the task tha are you going to solved. The preocess quite straight forward.
  • asked a question related to Data Science
Question
9 answers
I have a dataset X filled with 4x4 matrices that I am generating through a complex model. However, the values of this dataset are slightly off due to some unknown factor that is being taken into consideration by the software that I am using for the generation. I know this because I have another dataset y filled with 4x4 matrices that contain the actual values. I decided to build a ML model that can predict the right output by using X and y to train the model.
At first, I decided to use the sequential neural network API from Keras to create the model.
X = pandas dataframe
y = pandas dataframe
## Normalize with MinMaxScaler ##
scaler = preprocessing.MinMaxScaler()
names = X.columns
d = scaler.fit_transform(X)
X = pd.DataFrame(d, columns=names)
names = y.columns
d = scaler.fit_transform(y)
y = pd.DataFrame(d, columns=names)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=True)
model = keras.Sequential()
model.add(layers.Dense(12, input_dim=12, activation='linear', name='layer_1'))
model.add(layers.Dense(100, activation='linear', name='layer_2'))
model.add(layers.Dense(150, activation='linear', name='layer_3'))
model.add(layers.Dense(100, activation='linear', name='layer_4'))
model.add(layers.Dense(12, activation='linear', name='output_layer'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, epochs=5, shuffle=True,verbose=2)
In this case, I used neural networks because that is really all I have ever worked with, but I have a feeling as though that might not have been the correct option. What do you guys think?
Relevant answer
Answer
I also agree with Serkan Ballı for suggestions with AUTOML. But it doesn't have all the models you want to try like CatBoost/XGBosot. You can start with autoML. That'll take considerable time and need excellent computational resources. In response, you'll get details of each model and its performance with an exhaustive hyper-parameter search.
  • asked a question related to Data Science
Question
5 answers
Hi,
I am looking for some article/book/clip about performing and interpreting Network Analysis using R in psychology studies. I would be grateful if you could help me in this matter.
Thank you
Relevant answer
Answer
See resources from Sacha Epskamp (http://sachaepskamp.com/) and Eiko Fried (https://eiko-fried.com/).
  • asked a question related to Data Science
Question
11 answers
The accuracy of the model (MLP) decreased when I removed duplicated features (two columns) from dataset. Should I remove them or keep them? It is supposed to get improved, any idea why this happened?
Thank you.
Relevant answer
Answer
The following article can be useful too.
  • asked a question related to Data Science
Question
9 answers
  • If I wanted to build up a career in data science, which one should I focus on, statistics concepts or technical skills in Python or R?
  • Is it true that SQL is the most important skill for a data scientist?
  • Suggest some papers which might be helpful for research in data science.
Thank you!
Relevant answer
Answer
I would recommend both Python and R. I wouldn't say SQL is the most important skill for a data scientist. It is one of the most important components when it comes to storing data. Also, nowadays most data scientists are interested in Big Data, for which SQL is becoming obsolete. Perhaps you should look at NoSQL databases such as Neo4j, ArangoDB, etc. instead. In terms of analysis, you should check out Machine Learning and Deep Learning techniques. A number of articles can be found online. I suggest that you first choose a field of interest and then browse the resources accordingly. Providing articles without knowing your personal interests is very difficult, however, the following links may be useful. Thanks
  • asked a question related to Data Science
Question
7 answers
For example AI can automate processes in the initial interpretation of images and shift the clinical workflow of radiographic detection, management decisions. But what are the clinical challenges for its application?
#AI #cancer #clinical #oncology #biomedical
Relevant answer
Answer
The first is the sample size issue, a predictive model are trained and built with limited imaging data from a particular hospital or region; this has to face bias caused by various factors when dealing with a wider range of diagnoses; it may not be adaptable when making diagnostic decisions for images from other regions.
The second is the inability to balance the patient's treatment intentions, the patient's financial situation, and different regional customs when making treatment decisions.
  • asked a question related to Data Science
Question
5 answers
Dear Community , in order to analyze a component website and its features ; I want to extract some data from the website , it's first time to hear about the web scraping, can you guide me please , how to do it using python for competition intelligence?
thank you
Relevant answer
Answer
I suggest that you try as many tutorials as you can find since web scraping can be a bit tricky at times. Even more difficult is the fact that websites vary so much in terms of layout. Basic knowledge of mark up languages like HTML and XML, however, can be extremely helpful. When it comes to Competitive Intelligence, it is no different; you just need to select the website and make your script accordingly. You should also check the website's policies to make sure scrapping is allowed. This can be done by appending the "/robots.txt" file to the URL. Please see below if it helps. Thanks
  • asked a question related to Data Science
Question
5 answers
I almost asked this as a "technical question" but I think it's more of a discussion topic. Let me describe where I get lost in this discussion, and what I'm seeing in practice. For context, I'm a quantitative social scientist, not "real statistician" nor a data scientist per se. I know how to run and interpret model results, and a little statistical theory, but I wouldn't call myself anything more than an applied statistician. So take that into account as you read. The differences between "prediction-based" modeling goals and "inference-based" modeling goals are just starting to crystalize for me, and I think my background is more from the "inference school" (though I wouldn't have thought to call it that until recently). By that I mean, I'm used to doing theoretically-derived regression models that include terms that can be substantively interpreted. We're interested in the regression coefficient or odds ratio more than the overall fit of the model. We want the results to make sense with respect to theory and hypotheses, and provide insight into the data generating (i.e., social/psychological/operational) process. Maybe this is a false dichotomy for some folks, but it's one I've seen in data science intro materials.
The scenario: This has happened to me a few times in the last few years. We're planning a regression analysis project and a younger, sharper, data-science-trained statistician or researcher suggests that we set aside 20% (or some fraction like that) of the full sample as test sample, develop the model on our test sample, and then validate the model on the remaining 80% (validation).
Why I don't get this (or at least struggle with it): My first struggle point is a conceptual/theoretical one. If you use a random subset of your data, shouldn't you get the same results on that data as you would with the whole data (in expectation) for the same reason you would with a random sample from anything? By that I mean, you'd have larger variances and some "significant" results won't be significant due to sample size of course, but shouldn't any "point estimates" (e.g., regression coefficients) be the same since it's a random subset? In other words, shouldn't we see all the same relationships between variables (ignoring significance)? If the modeling is using significance as input to model steps (e.g., decision trees), that could certainly lead to a different final model. But if you're just running a basic regression, why would anyone do this?
There are also some times when a test sample just isn't practical (i.e., a data set of 200 cases). And sometimes it's impractical because there just isn't time to do it. Let's set those aside for the discussion.
Despite my struggles, there are some scenarios where the "test sample" approach makes sense to me. On a recent project we were developing relatively complex models, including machine learning models, and our goal was best prediction across methods. We wanted to choose which model predicted the outcome best. So we used the "test and validate" approach. But I've never used it on a theory/problem-driven study where we're interested in testing hypotheses and interpreting effect sizes (even when I've had tens of thousands of cases in my data file). It always just seems like a step that gets in the way. FWIW, I've been discussing this technique in terms of data science, but I first learned about it when learning factor analysis and latent variable models. The commonality is how "model-heavy" these methods are relative to other kinds of statistical analysis.
So...am I missing something? Being naive? Just old-fashioned?
If I could phrase this as a question, it's "Why should I use test and validation samples in my regression analyses? And please answer with more than 'you might get different results on the two samples' " :)
Thanks! Looking forward to your insights and perspective. Open to enlightenment! :)
Relevant answer
Answer
As was mentioned above by Alexander Kolker , there are actually two scenarios for using your models. If you want to assess the reliability and the operability of your model or algorithm that produces predictions, you need to compare predictions and actuals. The design of this comparison should be as close as possible to what you'd like to implement in practice. There are also different setups that can be used, e.g., the rolling-origin evaluation when you re-estimate coefficients each time a new observation appears. When it gets to multiple objects/time series, the task becomes a bit more difficult as you need to use more sophisticated error metrics to evaluate your models. Please let me recommend our recent work on this topic where we consider the rolling-origin setup:
The key element in model validation is the choice of metrics or indicators to compare alternative models. For regression analysis if you use the test/validation approach it is usually MAE or MSE, but see the notes in the article above how to choose between the MAE and MSE. In the article above we also recommend the use of metrics for bias, such as the mean error (ME), median error (MdE), and the Overestimation Percentage corrected (OPc), which is the new we metric proposed. The idea is that if you obtain biased predictions, you theoretically can improve them and your model is not optimal.
But if you do not want to split your data, you can use information criteria, such as AICc or BIC. This then corresponds to the second scenario described by Alexander Kolker .
  • asked a question related to Data Science
Question
5 answers
Dear Community , I have a column in my dataset containing ip-adress , I'm trying to do some exploratory data analysis on my dataframe , but I don't know how to explore or describe the ip address column , should I delete it ?
Relevant answer
Answer
skip that column df = subset(mydata, select = -c(columnnames) )
  • asked a question related to Data Science
Question
4 answers
Actually, I am doing research in data science, where I want to explore the new classifiers based on attention mechanisms and transformers for classification. I read a lot of stuff regarding the same. can anybody please suggest some recent research papers on the transformer NN model for classification?
  • asked a question related to Data Science
Question
4 answers
All,
I need your suggestions/ideas on M.Tech dissertation topics in DataScience in NMS/EMS preferably in Telecommunication domain. Thanks in advance.
Relevant answer
Answer
If you are interested in assessing the performance of modelling approaches, please let me recommend these works:
If your topic assumes the comparison of alternative approaches, the FEW-L1 workflow presented in the article above can be used.
  • asked a question related to Data Science
Question
5 answers
While reading a research paper I have found that to find the abnormality in a signal authors were using a sliding window and in each window he was dividing the mean^2 by variance. After searching in the internet I found a term called fano factor but it was inverse of that. Could anyone please give an intuitive idea behind
the equation mean^2/variance?
Relevant answer
Answer
This criterion has some advantages:
It does not have unit. It is a coefficient. Therefore, you can compare different signals.
I think it stems from the coefficient of variations (CV). It is the inverse of square of CV. CV is a standardized deviation index.
  • asked a question related to Data Science
Question
13 answers
Please, if you are a data scientist or have first-hand experience, write your own experience here, please do not reply by sending links from other pages.
Nowadays, the majority of data scientist when dealing with soil, water, environment, or climate subjects tend to use of two programming languages including R and Python. Personally, I am aware that each one has its own advantages and disadvantage. I want you to share here your ideas about following questions:
- when you dealing with a data science matter, you feel more comfortable with which one?
- During your works, you have been faced with less errors while dealing with which one?
- Generally speaking, which language is more efficient
- which language provide better and newer update for libraries and packages?
- If you want to teach them to your future students, which one will be your preference?
- which one would be extinct or less important in future?
Sincerely
Relevant answer
Answer
Both are good. Both have it pros and cons. You can execute R in Python and vice versa.
  • asked a question related to Data Science
Question
9 answers
Data Science has evolved as a new branch of knowledge. Some people define it as applied statistics whereas others define it as an umbrella under which several existing interrelated branches have been grouped together. Being a recent topic, we may find edited volumes and obviously several papers. But, do we have standard fundamental books on this topic. This is essential as when we frame a syllabus, it is highly mandatory to provide some books as text, which cover most of the contents of the syllabus.
I would like to have single source text books or multiple text books covering the entire branch of knowledge.
Relevant answer
Answer
Dear Prof. Andrea,
Thank you very much for providing the list of books.
Regards,
B. K. Tripathy
  • asked a question related to Data Science
Question
4 answers
How do we calculate data processing time, response time and cost in cloud analyst simulator?
Relevant answer
Answer
Hello all!
Please send me some article for "processing time on cloud computing: define, calculation, silulation,... Thank you very much!
  • asked a question related to Data Science
Question
4 answers
I have protein data consisting of Degree, Closeness, Betweenness, Degree, And Eigenvector values in an XL sheet. Now my assignment is to make a Prediction with accuracy value using the machine learning approach. I'm not an expert in programming. I have downloaded Anaconda and R don't know which will be best.
-Can anybody please guide me on where do I get the algorithms or provide me some materials to do it?
-can anybody please tell me how to define train data and test data from my whole data sets?
-What type of classifier will be best to get a high accuracy?
- a Step-by-Step instruction will be easier to understand.
A big big thank you in advance
Relevant answer
Answer
There are plenty of options to use ML in R. I would recommend using the caret package. This package has the means to do a quick and efficient data split (training, testing) and offers multitude of machine learning algorithms both supervised, unsupervised. There are excellent tutorials available that given information on how to conduct ML step by step. Here is the comprehensive guide that outlines all the key steps.
The caret Package (topepo.github.io)
Once you get into caret package there are several advanced options to ensemble or stack your models.
Good luck with your project.
  • asked a question related to Data Science
Question
4 answers
How valuable is it to master Excel VBA in 2021, when most fields are dominated by data science and Python?
  • asked a question related to Data Science
Question
6 answers
Thinking about online and MOOC-type certificates for R programming and data analysis, are there any that are recognized and respected by potential graduate schools and employers?
I guess if people can recommend those for SAS or Python, that would be useful as well.
Relevant answer
Answer
Salvatore S. Mangiafico Did you find any reputed certificates for R or python
  • asked a question related to Data Science
Question
3 answers
I am learning ml, data science for data crunching of financial market data for my trading in financial market . I want to make a terminal which takes live data from NSE and do certain task(some calculation , graph representations ,ml model to run on data )but don't know is it possible with ml and python or have to go through whole software development road . so pls help me to figure out what i need to learn for this and how to do this .
Respectfully
Relevant answer
  • asked a question related to Data Science
Question
5 answers
Hi,
I saw that some public administration study programs have the following. It will be interesting to learn what is the course about?
Relevant answer
Answer
Jaroslavai,
Pažiūrėkite į naują mūsų straipsnį STEPP žurnale (please, look to this our article in STEPP journal):
  • asked a question related to Data Science
Question
7 answers
I would like to dive into the research domain of explainable AI. What are some of the recent trending methodologies in this domain? What can be a good start to dive into this field?
Relevant answer
Answer
Go to Google Scholar and type in some papers you know about, for example Shapley Values:
Review the papers that cite the "The many Shapley values for model explanation" paper.
  • asked a question related to Data Science
Question
4 answers
Some people in Academia criticize those who use neural networks for using a tool without knowing what is happening inside. While I agree (with academics like Cynthia Rudin) that we should benchmark these models against white box models and statistical methods, I think this opinion against black boxes has its origin in conflating data science with pure sciences.
In my opinion data science is not a science (for example in Popperian or Kuhnian sense).
Although data science is highly mathematical it shares methologies with humanities and social sciences where declaration of limitations is usually applied over attempts to reach "objective" results as in pure sciences.
A neural network is an instrument like a radio to capture a pattern of particular complexity from a sample of data. Main engineering task at the application level is to engineer the sample (dataset) where model is a mere tool.
I think a well-formulated philosophy of deep learning is in need.
Relevant answer
Answer
I agree with your statement that "data science is not a science (for example in Popperian or Kuhnian sense)."
Indeed, data science in its current state seeks a lot of empirical statistical associations, data correlations, and patterns in hope that they can be helpful for discovering some hidden insights. But it is the reason why data science cannot be called real science at all.
In general, science can be broadly defined as a logical system of getting and organizing insights in some area of knowledge as a set of some principles, explanations that can be used for making some previously unknown but testable predictions. In order to qualify as a principle, insight must be both highly general (applicable to many settings) and stable (relevant now and in the future developments).
What are specific data science principles that make it real science? Data science is just a loose collection of various computational methods applied to fit empirical data sets. All generated or collected data represent the past, i.e. what was already happen. However, it is well known that the future is never defined solely by its past regardless of the amount of past data.
A discovered empirical association can, of course, be a starting point in research but such research is never becoming real science until it includes a causal relationship between the variables (factors or features), i.e. understanding. A neural network totally lacks that important feature. That's why its power is inherently limited.
  • asked a question related to Data Science
Question
11 answers
I would like to know how CNN differs from ANN and DL, I want to know the actual concept behind each one of them??
  • asked a question related to Data Science
Question
10 answers
I am looking for appropriate books or platforms to learn data science using python from a basic level. I would be happy to get some genuine responses. Thanks.
Relevant answer
Answer
I find this website of https://www.w3schools.com/python/ really useful for Python and many other programming languages.
  • asked a question related to Data Science
Question
3 answers
Hello everyone
One of the obstacles for me in a ml project is data cleaning phase. I am fairly good at the model implementation but I have major issues in data preprocessing. Could someone give me an advice?
Thanks
Relevant answer
Answer
  • asked a question related to Data Science
Question
10 answers
For example, can I import two different datasets as one for training the model and other for testing it. In classifier can I use X_train, y_train just from training dataset and X_test, y_train from testing dataset. is it legit to go for it?
is it affecting on Model accuracy? or not?
Relevant answer
Answer
Dataset splitting is performed to validate the predictive ML model's performance. From the above description, you want to import two different datasets as one for training the model and other for testing it. That means you want to go for the testing directly with the test dataset after the model's training. But if possible, use validation dataset for the model. It is always good practice. Actually, without checking and validating the model it is not right to rely on its prediction. And in sensitive areas like healthcare, any kind of mistake in prediction can leads to major fatalities due to wrong decision taken by the machine learning model in the real-life predictions. And validating the ML model at training and development stage helps to make the model make the right predictions. The validation set is also termed as the Development set. This makes sense since this dataset helps during the “development” stage of the model.
  • asked a question related to Data Science
Question
10 answers
Thinking someone will give a plate what to be solved and will play around with some model and will be paid as a scientist (Youtube is full of many such tutorial videos) is not going to happen, this kind of task will be done by "Artificial Intelligence" itself/automatically. Scientists mean to formulate a new problem which not even seen as a problem and then analyze and convince its potential benefit and then how it can be solved.
Relevant answer
Answer
I don't think we can talk about a data science bubble. On the contrary, I think that there will be a consolidation of the role of the data scientist in every civil, academic and economic field. What I believe is that we cannot be completely automate or delegate to artificial intelligence the human capacity of interpreting the problem and the cause-and-effect processes between the problem under examination and the surrounding world, i.e. material, moral, economic and social effects. The analysis of these processes cannot and must not be left to thinking machines. The consequences are potentially devastating for mankind. We can automate methods and statistical analysis of solutions, but we must not delegate problem analysis without the human presence of a data scientist.
  • asked a question related to Data Science
Question
11 answers
what can be the research topic for application of data science in finance sector?
Relevant answer
Answer
Can any one suggest me some titles on Enterprise Scale Financial Data Analysis
  • asked a question related to Data Science
Question
4 answers
Hi All
I'm looking for some papers on Self driving cars that discuss the current capability and work, on how much Data processing these cars do, and what are the steps and techniques applied to improve them
Relevant answer
Answer
@Farzad possibly this could help your cause.
Yeshodara, N. S., Nagojappa, N. S., & Kishore, N. (2014, October). Cloud based self driving cars. In 2014 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM) (pp. 1-7). IEEE.
  • asked a question related to Data Science
Question
10 answers
One of the common problems in data science is gathering data from various sources in a somehow cleaned (semi-structured) format and combining metrics from various sources for making a higher level analysis. Looking at the other people's effort, especially other questions on this site, it appears that many people in this field are doing somewhat repetitive work. For example analyzing tweets, facebook posts, Wikipedia articles etc. is a part of a lot of big data problems.
Some of these data sets are accessible using public APIs provided by the provider site, but usually, some valuable information or metrics are missing from these APIs and everyone has to do the same analyses again and again. For example, although clustering users may depend on different use cases and selection of features, but having a base clustering of Twitter/Facebook users can be useful in many Big Data applications, which is neither provided by the API nor available publicly in independent data sets.
Is there any index or publicly available data set hosting site containing valuable data sets that can be reused in solving other big data problems? I mean something like GitHub (or a group of sites/public datasets or at least a comprehensive listing) for the data science. If not, what are the reasons for not having such a platform for data science? The commercial value of data, need to frequently update data sets, ...? Can we not have an open-source model for sharing data sets devised for data scientists?
Relevant answer
Answer
Please generate a separate document to share this information on datasets.
It will be a great help to new researchers who spend a long time searching for such databases.
regards
  • asked a question related to Data Science
Question
5 answers
how network data science can address social problem of intersectional inequality?
Kindly give your valuable suggestions and help me to have some literature on it.
Relevant answer
Answer
There are two solutions: agent-based modeling and combining some networks.
  • asked a question related to Data Science
Question
6 answers
i need dataset with world's or country's macroeconomic indicators and/or global indexes yearly for build some machine learning predictive model, who can help? Thanks a lot
Relevant answer
Answer
  • asked a question related to Data Science
Question
5 answers
Hello every one
Based on my studies, I have found that it is not possible for all researchers to make the same conditions for producing samples. Therefore, each researcher by reporting working conditions and effective parameters, trying to produce a sample and perform a series of experiments to extract a series of response data based on their experiment design. The issue in such reports is when one intends to study and then compare the results. Due to differences in parameters, comparisons between studies are not possible. The reason for this is the difference of several parameters during the comparison. My question is, is there a general way to normalize response data based on multi-independent parameters?
Relevant answer
Answer
Hello Milad,
If I understand your query, you're asking whether some adjustment(s) could be made to results such that comparisons of results of one study with those of another could be "apples to apples" (e.g., as if equivalent). The most optimistic answer is a qualified "maybe."
If a large corpus of such studies existed, and you could evaluate the singular and combinatorial effects of each and every relevant nuisance variable with respect to observed outcomes (as sometimes people attempt in a crude way when they regress meta-analysis results on study characteristics), then you might be able to impose statistical adjustments. Otherwise, you'd only be able to impose partial "control" for study differences, and possibly these adjustments would over- or under-correct. This is definitely a case in which "one size will not fit all (cases)."
Good luck with your work.
  • asked a question related to Data Science
Question
6 answers
I'm working on a flood prediction using machine learning under my masters degree for data science. I would like to know the basic process to finding out the data for this and also would like to know what are the models that I can use in order to accomplish this project.
It would also be great to know some of the gap that lies in so far flood prediction models that have been developed by other researchers.
Relevant answer
Answer
My suggestion departs from the norm. Do you have some causal understanding/model in mind to explain when flooding occurs, when it doesn't, and when it stops occurring? To me it is the addition of water beyond what is absorbed or removed. I'm no flood scientist, but I can confidently say floods are CAUSED by water flux in > water flux out--conservation of mass, that is water mass. Anyone who fills a bath tub knowingly/unknowingly is applying this principle.
A non-causal model is not very useful in my mind since it's probably unintelligently trying to get to the same root of the problem without consciously seeking a causal link. A model that makes good predictions may be causal, but not necessarily so (think Ptolemy), and may have embarrassing/costly moments of inexplicable model failures; sometimes this is better than nothing, but avoid if you can.
As far as data collection, the flux problem (input and output of water) depends on the region your evaluating. Where does water enter and leave the area? What are the limits of absorption/drainage? Can they be measured? If so this ought to be your data collection strategy. Check out weather underground for a source (rain) of water influx: https://www.wunderground.com/history.
For a master's degree it may be best to keep it simple, select an area where rain is the only influx of water, and drainage is the only outflux--this would be deserts, ya they flood, Tucson AZ does once a year pretty much. Looking at past floods (in the news archives), and associated rain rate (inches/unit time) ought to get you an upper limit. Also looking at when floods did NOT occur and its associated rain rate out to get you a lower limit. If this proves to be too simple, then gradually add complexity, maybe look at regions with streams and therefore rain/snow-melt upstream matters, or look at regions with absorption properties and therefore recent history of rain matters.
  • asked a question related to Data Science
Question
4 answers
i try to build ml-model for oil prediction, what important factors do you recommend to use for predict big price jumps (volatility ) in 2008 and 2011-2012. thanks a lot
Relevant answer
Answer
Policy announcements concerning the energy sector and its regulation.
For instance any country "ban of atmospheric engine cars by 2030" will force electrical mobility and stop or sharply reduce oil based mobility.
Political decision to authorise or forbid fracking, or other exploitation of energy production...
Diplomatic tensions impacting oil markets (producing countries, consuming countries, balance between the two...)
  • asked a question related to Data Science
Question
8 answers
How machine learning can be helpful for Public Health and Social Science? Please mention some prospective research areas.
Relevant answer
Answer
I agree Anthony Vipin Das
  • asked a question related to Data Science
Question
2 answers
I want to develop a curriculum that includes a consortium of subjects like AI, Data Analytics, Machine Learning, Evolutionary Optimization techniques and the like. Data Science is one among them.
Relevant answer
Answer
Thank you for the information.
  • asked a question related to Data Science
Question
7 answers
Predicting human behavior is challenging. When using machine learning algorithms in human behavior research the NRMSE would be higher than when we are investigating more accurate areas (NRMSE < 0.1), but what is the acceptable range?
Relevant answer
Answer
Md. Qaiser Reza Dear Mr Qaiser Reza;
Thank you for your response. That link is a very useful source, and I myself had participated in the answers too. But I am to find the acceptable error range exclusively in the human behavior field.
  • asked a question related to Data Science
Question
5 answers
Some models like NIMA evaluate photos aesthetically, but I believe these evaluations are more based on the basics of photography, like photos being focused and so on. I need to analyze more sophisticated photos.
Relevant answer
Answer
Gennaro Vessio Thank you, so far, these are the best sources I've seen.