Questions related to Data Modeling
We're conducting research that require a trash filling model as our input: trash bin locations and filling level in a city. Is there yet any method to create a realistic model to generate this data?
If there is no one correct tool, maybe anyone knows a data generator model that can be fed on a data and generate the same pattern randomly?
We would be happy if you have trash filling dataset for us to cite.
Let consider there is a selling factor like this:
Gender | Age | Street | Item 1 | Count 1 | Item 2 | Count 2 | ... | Item N | Count N | Total Price (Label)
Male | 22 | S1 | Milk | 2 | Bread | 5 | ... | - | - | 10 $
Female | 10 | S2 | Cofee | 1 | - | - | ... | - | - | 1 $
We want to predict the total price for a factor based on their buyer demographic information (like gender, age, job) and also their buying items and counts. It should be mentioned that we suppose that we don't know each item's price and also, the prices will be changed during the time (so, we although will have a date in our dataset).
Now it is the main question that how we can use this dataset that contains some transactional data (items) which their combination is not important. For example, if somebody buys item1 and item2, it is equal to other guys who buy item2 and item1. So, the values of our items columns should not have any differences for their value orders.
This dataset contains both multivariate and transactional data. My question is how can we predict the label more accurately?
I have been trying to find a way to fit two functions simultaneously using nonlinear least squares (I have to find the optimum 3 variables, common for both models, that fits best both of them). I typically use Python's scipy.optimize.least_squares module for NLLS work, which uses the Levenberg–Marquardt algorithm.
I tried some specialised multi-objective optimization packages (like pymoo), but they don't seem suitable for my problem as they rely on evolutionary algorithms that output a set of solutions (I only need one optimum solution per variable) and they are made to work for conflicting objectives.
I also tried to take the sum of the norms of the residuals of the two functions (making it into a single objective problem) and to minimize that by various gradient and non-gradient based algorithms from Python's scipy.minimize package, but I found this norm becomes so huge (even with parameter bounds!) that I get oveflow error (34, results too large), crashing the programme sooner or later. It didn't crash using Truncated Newton's Method, but the results produced were rubbish (and from running an optimization on this same data on a simpler model, I know they shouldn't be!)
I have to perform this fit for a few thousand data sets per experiment, so it has to be quite robust.
Surprisingly, I can not find a way to do multiobjective NLLS (only for linear regression). I have found some papers on this, but I'm not a mathematician so it's quite out of my depth to understand them and apply them in Python..
Has anyone had a similar problem to solve?
According to the nature of temporal data, strategies like k-fold cross validation is not an appropriate idea since we cannot remove the dimension of time. In this discussion we want to explore ideas about testing models for temporal data.
I would like to analyze my morphological data in R for the following: outlier identification and removal, heritability, inferential statistics, data modelling...
My data is a bit problematic since the reps are only present for checks not entries. Any ideas or advice on which model to use?
I’m working with a panel data about Foreign Direct Investments using FDI flows as endogenous and, among others, FDI stock in the previous year as one of the explanatory variables. If we use the lagged endogenous as an explanatory variable we would have a dynamic panel data model and we should use a convenient estimator (say Arellano Bond, for example). However, in my case, I'm not using as exogenous the lagged endogenous (flow [yt-1-yt-2]), but the lagged stock of FDI (yt-1). Should this case be considered as a dynamic model too? Should it be estimated using Arellano&Bond or similar to avoid the inconsistency and Is there any specific alternative for this type of specification?
We have too much data piled up ready to be processed using data science. We have too many ready to use tools and technologies, easy-to-use programming languages that do not need a considerable software engineering background to use, and "Data Scientist" a sexy title for, sometimes, an unknown position that many companies are hiring to fill.
The philosophy and concept of Data Science is changing from the area researchers and scientists knew to a new area of applying ready-to-use tools and technologies to create decision support insights, sometimes a quick jump into inaccurate conclusions due to spurious correlations, the lack of domain knowledge, bad automation, etc.
While taking a one day to one week course might be enough to apply for a data science job these days, the question is how real data science can be understood, rescued, and correctly applied, particularly in industry, considering today's inconsistent mentality about Data Science?
The slope of the curve (in this case a first-order model) gives an indication of the dissolution constant.
I have scenario where, It needs to decide whether we can combine two or more variables to form single derived variable. for example: if we have 100 samples of distance traveled and time taken to travel same distance then can we combine these two variables and derive single variable from it like Speed = Distance/Time for all 100 samples. I need some mechanism or tests which should provide me statistical way of confirmation about forming Speed as new single derived variable. Is there any significance test or mechanism to get it done?
Any help would be appreciated and Thanks in Advance!
I have a function as follows:
I attached a simple graph of the function. As it is apparent from the graph the CURVATURE of the function increases from ZERO to a finite value (around x=0.1-0.2) and then it decreases to reach a value of ZERO. I did my best to draw the CURVATURE of this function using the following formula:
However, using this formula I could not reach the predictable trend of the curvature. Do you have any idea what is the problem?
I can work with MATLAB and Excel.
Your help is appreciated in advance.
I am working on big data (engine data) modelling and have opted to work on regime based modelling. So, the operating space is divided into different regimes or zones and a local model network is identified for each of these regimes.
Hence, every region has a LMN which has its own local models.
I have chosen to identify these regimes using lolimot algorithm.
Now the question:
What would be the best approach which would allow to combine these different regimes based LMN to create a global model?
Thank you for your time!
Kind Regards Gaurav
I am working on a panel data modeling research aiming to find labor force dynamics which are affecting employment of young newly graduated women engineers in Turkey. We employed the latest Labor Force Survey to build our model, however as you would realize, for country level inferences Turkstat provides weights for each observation, each observation represent a specific share of the Turkish population. However after filtering all female, young, new graduates, who have an engineering degree and works in a role related to engineering my sample fell to 726 from 370,000. From that point I think I am no longer trying to make an inference on Turkish population and should I keep using weights in modeling and discussion part? Or should I treat data as an ordinary random sampled survey and give the same weight for each individual.
I have a scenario where many people are using a software and while utilizing they are assigned values between 0 to 1 based on their good or bad usage. e.g. good use would provide value to "usage" as 0.99 and "mistakes performed" parameter would be having value as 0.01.
I have such 15 parameters in the data which directly or indirectly quantifies the usage of software. I need to score all the people using this software using any mathematical / Statistical / machine learning model.
How should i build a relationship among 15 variables without a dependent variable and score everyone on the scale of 1 to 100?
How to get this equation built using only data without dependent variable?
Any discussion or reply is appreciated! Thanks in advance!
I am trying to understand the impact of weather and calendar on a particular e-commerce website visitors. I already have historical web traffic and weather data for a particular eCommerce website and city.
Which Data mining technique is recommended to analyze the patterns and to make predictions?
I am considering classification technique and I would appreciate your recommendations. if there is any specific classification model that i can use?
currently I am using Knime software.
I want to collect data for production of biodiesel by using jatropha circus seed for various catalyst. Please, guide me.
I want to test the Mistcherlich Model to data I get from papers.
I have data of agronomic yield and an environmental variable (naturally I have the standard deviation and the number of repetition) and I want to test whether there is a Mistcherlich relation.
Mitscherlich equation is Y=Ymax * (1-e(-b*x))
Where Y is the yield, x my environmental variable and Ymax and b are the two paramaters of my model: the maximum yield and the curve parameter.
To model it without considering the sd, I use the nls function but I would like to know wether is possible to use the sd, or use a weight for the data.
Which R function can do that?
If we train a data model once on a dataset using a machine learning algorithm, save the model, and then train it again using the same algorithm and the same dataset and data ordering, will the first model be the same as the second?
I would propose a classification of ml algorithms based on their "determinism"
in this respect. On the one extreme we would have:
(i) those which always produce an identical model when trained from the same dataset with the records presented in the same order and on the other end we would have:
(ii) those which produce a different model each time with a very high variability.
Two reasons for why a resulting model varies could be (a) in the machine learning algorithm itself there could be a random walk somewhere, or (b) a sampling of a probability distribution to assign a component of an optimization function. More examples would be welcome !
Also, it would be great to do an inventory of the main ML algorithms based on their "stability" with respect to retraining under the same conditions (i.e. same data in same order). E.g. decision tree induction vs support vector vs neural networks. Any suggestions of an initial list and ranking would be great !
for quite a comprehensive list of methods.
Am starting to collect a small set of diabetes relevant (T1) time series data (movement, blood glucose, sleep, etc.) Have talked to many people who are doing such research, but few people seem to know how to process the data. Some have suggested ARIMA modelling, perhaps with ML for a variable window, but looking for any ideas or ideally someone who wants to co-author a research paper. Please get in touch if you have ideas, know relevant papers, or are interested. Thanks, Dmitri
I am working on my master theses for a quite big company department. It is about creating a data warehouse for a department process metrics. I need to build my research on a state-of-the-art and state-of-the-practice in data warehousing field. However, I can't figure out what to understand as the state-of-the-art and -practice in this area.
Could you please help me to clear up what type of information should I categorize as state-of-the-art or -practice and possibly point out resources where should I look for them? My theses is mainly about the "database" part of DW/BI system, thus the dimensional modeling, metadata model creation and the ETL process creation.
If one reducer is handling more than one key, how are the key-value pairs (received by the reducer) differentiated for different keys?
I am looking for some machine learning/statistical techniques/algorithms/functions with the following capabilities:
- can be called as an api from c++
- can model multiple numerical outputs for calibration modeling*
* For example, consider an industrial process which has two inputs
(temperature, tension) and two outputs (friction, radius), all of which are continuous numerical values. So, for a diversity of input configurations, we want to model what the outputs will be.
Also, we would like to model the reverse, that is, given some output values (friction, radius), we would like to know what the input values should be to obtain those output values.
My name is Ebiteme Joy Botu. I am currently studying for my PhD at the University of Greenwich, and would like to invite you to take part in my current study about ‘the identification of criteria required by pedestrians when traveling through a route’, which for purposes of this survey, will be referred to as ‘the pedestrian route criteria’.
In this study, you will be asked to answer a three part questionnaire pertaining to pedestrian routes. The survey should take you approximately 20 minutes or less to complete. The data collected will not contain any personal identifiable information about you. Please follow the link below to take part in the survey.
Thank you very much.
I'd like to use languages as models for sequences of spatial points (time series).
A point is defined by coordinate-system (x-y cartesian coordinates or lat/long ellypsoidal coordinates).
An example could be dataset of eye-tracking position, or GPS-positions of users moving through a city.
I want to fit the data in different thermodynamic models like R-K-SOAVE NRTL etc.
how can i do that ??.
is there any programming or simulation can help me?
or if possible by MATLAB ?
I am working on building context-aware software where I have data models and context models and both of them are in different modules in my architecture.
I have some concepts in context modules that are the same in data module. Is this allowed from data modelling point of view ?
Thanks for your answers
I have 19 large files of average size of 5GB, I want to split data from all the files into small files into another 35000 files based on some criteria.
One file takes 8 to 10 hrs if done serial way. and if the session stops or any other failure leads to do it again and again causing threat to reliability of data. Is there any way to make it possible to execute parallel or more faster approach.
This task is very much important for me as the my main goal of modelling depends on this data. This process is one time execution process for my application.
I am working on windows(right now bit difficult to Linux). The files are database files, I can produce both Mysql DB files or tables in the form of .txt file. I want to split based on the selection and filter criteria on this files(Its not after every n bytes). I am using R for this problem. The server working on is 8 core, 32 GB RAM server.
Please suggest any way if anyone came across such problem?
Django-registration is not maintained for quite a while. And it needs some modification with an update to Django 1.7, at least for my project. So, this might be a good time to migrate from django-registration. From a number of discussions I got an impression that django-allauth is the closest in design and among the most advanced alternatives to django-registration.
I found only one blog  about this kind of migration. A link to it says the move is simple, but the very blog cannot be opened at the moment... Any other tips/notes/recommendations are most welcome!
As we all know there are many software components in an automobile system. These softwares have different ports through which they communicate through signals with other software components(to ports in it).These signals are just represented by lines and since these lines intersect each other it is difficult to understand the representation.This representation of communication of data will be very untidy and difficult to understand.So is there any idea in information modelling in which we can generate simplicity to this complex representation?
Inmon, Kimball, Hefesto or another? I'm currently building a data warehouse to pave the way for data mining, the goal of this work is to improve the process of decision-making in education policy. This requires knowing what the best architecture is.
I'm working on a prediction model to help clinician practice. I did an experimental trial to identify predictor variables and my model has a good adjusted R2, but I don't know what the next step to validate my model and to apply it in real practice should be.
I was wondering if anyone can help to generate some hypothetical rotarod data to explore some hypothesis. Is there a software available out there that I can use or is there an existing data set/bank that I can access to test my hypothesis? Any suggestion will be quite helpful and appreciated.
I have financial data of some firms and I want to establish relationship between a dependent variable that I have calculated and some independent variable from the firm financials. What are the alternatives available? I know linear regression is one but are there any recent more efficient methods?
Does it make sense to take the autocorrelation of a nonstationary signal or sequence? Or does it not matter in which case the interpretation will be affected (if at all)? Do we need to do some pre-processing before taking the autocorrelation of such a signal, or are the results as such valid?