Science topics: DatabasesData Modeling
Science topic

# Data Modeling - Science topic

Explore the latest questions and answers in Data Modeling, and find Data Modeling experts.
Questions related to Data Modeling
• asked a question related to Data Modeling
Question
We're conducting research that require a trash filling model as our input: trash bin locations and filling level in a city. Is there yet any method to create a realistic model to generate this data?
If there is no one correct tool, maybe anyone knows a data generator model that can be fed on a data and generate the same pattern randomly?
We would be happy if you have trash filling dataset for us to cite.
We collected data from six communities in one city with the help of the local waste collectors who weighed the trash before it was thrown into the garbage truck. It meant more work for the waste collectors so we made sure there was an incentive for them to do this.
• asked a question related to Data Modeling
Question
Let consider there is a selling factor like this:
Gender | Age | Street | Item 1 | Count 1 | Item 2 | Count 2 | ... | Item N | Count N | Total Price (Label)
Male | 22 | S1 | Milk | 2 | Bread | 5 | ... | - | - | 10 \$
Female | 10 | S2 | Cofee | 1 | - | - | ... | - | - | 1 \$
....
We want to predict the total price for a factor based on their buyer demographic information (like gender, age, job) and also their buying items and counts. It should be mentioned that we suppose that we don't know each item's price and also, the prices will be changed during the time (so, we although will have a date in our dataset).
Now it is the main question that how we can use this dataset that contains some transactional data (items) which their combination is not important. For example, if somebody buys item1 and item2, it is equal to other guys who buy item2 and item1. So, the values of our items columns should not have any differences for their value orders.
This dataset contains both multivariate and transactional data. My question is how can we predict the label more accurately?
Hi Dr Behzad Soleimani Neysiani . I agree with Dr Qamar Ul Islam .
• asked a question related to Data Modeling
Question
I have been trying to find a way to fit two functions simultaneously using nonlinear least squares (I have to find the optimum 3 variables, common for both models, that fits best both of them). I typically use Python's scipy.optimize.least_squares module for NLLS work, which uses the Levenberg–Marquardt algorithm.
I tried some specialised multi-objective optimization packages (like pymoo), but they don't seem suitable for my problem as they rely on evolutionary algorithms that output a set of solutions (I only need one optimum solution per variable) and they are made to work for conflicting objectives.
I also tried to take the sum of the norms of the residuals of the two functions (making it into a single objective problem) and to minimize that by various gradient and non-gradient based algorithms from Python's scipy.minimize package, but I found this norm becomes so huge (even with parameter bounds!) that I get oveflow error (34, results too large), crashing the programme sooner or later. It didn't crash using Truncated Newton's Method, but the results produced were rubbish (and from running an optimization on this same data on a simpler model, I know they shouldn't be!)
I have to perform this fit for a few thousand data sets per experiment, so it has to be quite robust.
Surprisingly, I can not find a way to do multiobjective NLLS (only for linear regression). I have found some papers on this, but I'm not a mathematician so it's quite out of my depth to understand them and apply them in Python..
Has anyone had a similar problem to solve?
Many thanks!
Two functions fitted simultaneously to one dataset? Do I understand correctly? What is the objective function?
• asked a question related to Data Modeling
Question
According to the nature of temporal data, strategies like k-fold cross validation is not an appropriate idea since we cannot remove the dimension of time. In this discussion we want to explore ideas about testing models for temporal data.
Pooia Lalbakhsh you may want to employ deep learning models such as LSTM and GRU.
Good luck
• asked a question related to Data Modeling
Question
chi-square test for data modelling.
Dear Esraa,
I cannot fully answer your question since I am also struggling with this problem. However, what I do understand is that the metric used in the Chi squared test scales with the a scaling factor. This can quite easily be seen:
where a is the scaling factor.
A colleague of mine also mentioned that normally you use this test when you are considering "counts" and categories. I don't think this is the case with your example (?). Furthermore, this colleague also noticed that when you have more counts your uncertainty decreases, and thus you will see that your significance increases when the scaling factor increases. I hope this helps..
• asked a question related to Data Modeling
Question
I would like to analyze my morphological data in R for the following: outlier identification and removal, heritability, inferential statistics, data modelling...
My data is a bit problematic since the reps are only present for checks not entries. Any ideas or advice on which model to use?
This is a nice GitHub repository in R to do morphological analysis
You may also like to read this short article on morphological analysis
• asked a question related to Data Modeling
Question
Magnetic data modeling using Excel sheet / or Neural network etc
• asked a question related to Data Modeling
Question
Is there an Excel implementation of the Douglas Peucker algorithm?
Some improvements can be made to Ian's solution especially if you are working with students learning programming and/or VBA.
• Define classes for points and polylines
• Instantiate variables of those classes to avoid using variants (may lead to memory improvements)
That might be overkill for business analysts or people who just need something done in Excel, but an important step (I think!) for people wanting to learn better coding skills that may be applicable to other languages.
Suggestions welcome!
• asked a question related to Data Modeling
Question
I’m working with a panel data about Foreign Direct Investments using FDI flows as endogenous and, among others, FDI stock in the previous year as one of the explanatory variables. If we use the lagged endogenous as an explanatory variable we would have a dynamic panel data model and we should use a convenient estimator (say Arellano Bond, for example). However, in my case, I'm not using as exogenous the lagged endogenous (flow [yt-1-yt-2]), but the lagged stock of FDI (yt-1). Should this case be considered as a dynamic model too? Should it be estimated using Arellano&Bond or similar to avoid the inconsistency and Is there any specific alternative for this type of specification?
Yes, this too is a dynamic model and applying Arellano-Bond (or the related GMM-type estimators) would be perfectly adequate.
• asked a question related to Data Modeling
Question
We have too much data piled up ready to be processed using data science. We have too many ready to use tools and technologies, easy-to-use programming languages that do not need a considerable software engineering background to use, and "Data Scientist" a sexy title for, sometimes, an unknown position that many companies are hiring to fill.
The philosophy and concept of Data Science is changing from the area researchers and scientists knew to a new area of applying ready-to-use tools and technologies to create decision support insights, sometimes a quick jump into inaccurate conclusions due to spurious correlations, the lack of domain knowledge, bad automation, etc.
While taking a one day to one week course might be enough to apply for a data science job these days, the question is how real data science can be understood, rescued, and correctly applied, particularly in industry, considering today's inconsistent mentality about Data Science?
I think that your post "The side effect of Data intoxication" and the question "how real data science can be understood, rescued, and correctly applied, particularly in industry ..." is a rare sober voice in a hyped world of buzzwords. The same question can be asked on artificial intelligence (AI), Big Data, machine learning, etc.
As for Data science (DS), I believe that its purpose is inferring some insights (quantitative and qualitative) from an empirical collection of various data obtained usually without preliminary planning (just because the data are available). However, data science does not offer a causal explanation of any relationships between the variables (factors, features) but it explores merely some empirical patterns, associations / correlation of the past data and attempts making a formal projection of these data to forecast some future events. (By the way, is the future forecast always possible?-See the attached presentation).
Now, there are two distinct areas concerned with using the data potential. One is
focused on Technology for storing, processing and managing large amounts of data of various nature- this is the current trend. This trend leads to fitting a company’s arsenal with data-savvy tools. Value is too often framed as something that increases solely by the collection of more data. This means investments in data-focused activities center around the tools. This leaves an organization with a big set of tools, and a small amount of knowledge on how to convert data into something useful.
Another area is Methodology for making business decisions using modeling and simulation based on data specifically collected to address some business problems. This is called Business Analytics (and more closely related to genuine Data Science). It is getting momentum but still beyond the radar for too many companies.
(i) Defining a business problem, (ii) identifying an analytic method (algorithm) or simulation approach, (iii) collecting data required to feed the algorithm, (iv) turning solution into the actionable managerial decisions.
In summary:
•Data should NOT be a starting point in business analytics • More data not always result in more accurate predictions • Analytics is NOT a side effect of collecting, keeping and presenting / querying data in general. • Data have business value only in a specific business context. •Currently there is a trend to focusing on data itself instead of focusing on business context and methodology.
You can take a look for more details at the (attached) presentation ' DATA ANALYTICS FOR SOLVING BUSINESS PROBLEMS: SHIFTING FOCUS FROM THE TECHNOLOGY DEPLOYMENT TO THE ANALYTICS METHODOLOGY "
• asked a question related to Data Modeling
Question
The slope of the curve (in this case a first-order model) gives an indication of the dissolution constant.
Dear Sejad Ayyoubi , not explicitly, as far as I am aware of. However, it will report the estimates for the fitted model and therefore you will have the estimate for k (d%=100*(1-exp(-kt))). Check at "Dissolution data modeling" and then choose one of the first-order model options.
Kind regards, Luis
• asked a question related to Data Modeling
Question
I have scenario where, It needs to decide whether we can combine two or more variables to form single derived variable. for example: if we have 100 samples of distance traveled and time taken to travel same distance then can we combine these two variables and derive single variable from it like Speed = Distance/Time for all 100 samples. I need some mechanism or tests which should provide me statistical way of confirmation about forming Speed as new single derived variable. Is there any significance test or mechanism to get it done?
Any help would be appreciated and Thanks in Advance!
Many indicators are constructed as functions of several variables not necessarily strongly correlated. The ultimate validity condition is their subject domain meaning. Frequently, they are used to measure a certain distance to an ideal modelling condition. Correlational methods, such as factor or principal components analyses, are very good exploratory techniques but they are restricted to linear relationships between the variables. So, you have a lot of work to do on your own.
• asked a question related to Data Modeling
Question
I have a function as follows:
y= a*x^b
a=7e-5
b=-0.755
I attached a simple graph of the function. As it is apparent from the graph the CURVATURE of the function increases from ZERO to a finite value (around x=0.1-0.2) and then it decreases to reach a value of ZERO. I did my best to draw the CURVATURE of this function using the following formula:
K=f"/(1+f'^2)^1.5
However, using this formula I could not reach the predictable trend of the curvature. Do you have any idea what is the problem?
I can work with MATLAB and Excel.
Kind regards,
Ebrahim
Check the power index in the denominator of your expression for the curvature. It should be 3/2 rather than 1/2. Good luck
• asked a question related to Data Modeling
Question
Dear all
Anybody knows how to participate into the WHO health statistics and health surveys which are published annually on the different healthcare topics and disorders? Of participation, I manily mean providing local data or modeling assessments.
Hi Dear;
Good Luck.
• asked a question related to Data Modeling
Question
Hi,
I am working on big data (engine data) modelling and have opted to work on regime based modelling. So, the operating space is divided into different regimes or zones and a local model network is identified for each of these regimes.
Hence, every region has a LMN which has its own local models.
I have chosen to identify these regimes using lolimot algorithm.
Now the question:
What would be the best approach which would allow to combine these different regimes based LMN to create a global model?
Kind Regards Gaurav
Ajit kumar Roy
, Why would you suggest it is not being advisable? any particular reasons or limitations?
Regards
Gaurav
• asked a question related to Data Modeling
Question
I am working on a panel data modeling research aiming to find labor force dynamics which are affecting employment of young newly graduated women engineers in Turkey. We employed the latest Labor Force Survey to build our model, however as you would realize, for country level inferences Turkstat provides weights for each observation, each observation represent a specific share of the Turkish population. However after filtering all female, young, new graduates, who have an engineering degree and works in a role related to engineering my sample fell to 726 from 370,000. From that point I think I am no longer trying to make an inference on Turkish population and should I keep using weights in modeling and discussion part? Or should I treat data as an ordinary random sampled survey and give the same weight for each individual.
One thing to keep in mind is that you can't use the original sampling weights anymore, since you excluded 370,000-726 observations from the original data set, so the the original sampling weights are no longer valid. Calculating new sampling weights for your sub-sample would be very hard to do. Also, when and why to use sampling weights, i found this very useful: https://blogs.worldbank.org/impactevaluations/tools-of-the-trade-when-to-use-those-sample-weights
• asked a question related to Data Modeling
Question
I have a scenario where many people are using a software and while utilizing they are assigned values between 0 to 1 based on their good or bad usage. e.g. good use would provide value to "usage" as 0.99 and "mistakes performed" parameter would be having value as 0.01.
I have such 15 parameters in the data which directly or indirectly quantifies the usage of software. I need to score all the people using this software using any mathematical / Statistical / machine learning model.
How should i build a relationship among 15 variables without a dependent variable and score everyone on the scale of 1 to 100?
How to get this equation built using only data without dependent variable?
You can sum the scores as you would do in a typical knowledge score, with 0 for bad/false or 1 for good/true answers, see example in:
I do not understand why you want to transfer them to a 0 - 100 score though, seems unneccessary to me.
How to use it? You can use it as a dependent variable and for example analyze predictors of software competence. If you have a learning program you can examine change or you can also do an RCT (one group gets the learning program) and compare groups
Best
John-Kåre
• asked a question related to Data Modeling
Question
i would like to use GGMPlus data from http://ddfe.curtin.edu.au/gravitymodels/GGMplus/data/
for modeling subsurface geology structures
Hi,
I just scanned: Hirt, C, S.J. Claessens, T. Fecher, M. Kuhn, R. Pail, M. Rexer (2013), New ultra-high resolution picture of 1 Earth's gravity field, Geophysical Research Letters, Vol 40, doi: 10.1002/grl.50838.
It appears to me that GGMPlus is the most recent Free Air Gravity compilation utilizing satellite data. It should be convertable to Bouguer Gravity if you have a good elevation grid, and then convertable to Isostatic Gravity if you have a good isostatic correction grid.
• asked a question related to Data Modeling
Question
Hi
I am trying to understand the impact of weather and calendar on a particular e-commerce website visitors. I already have historical web traffic and weather data for a particular eCommerce website and city.
Which Data mining technique is recommended to analyze the patterns and to make predictions?
I am considering classification technique and I would appreciate your recommendations. if there is any specific classification model that i can use?
currently I am using Knime software.
Thank you
Thank you very much everyone, very interesting comments and recommendations.
I have discussed with my supervisor and he recommended association rule but also to be validated with other techniques. I am now cleaning the data and trying to create categories for the factors.
Is there any credible source where i can find Meteorology terms guidelines if possible for Germany .
For example
The temperature range for Cold(chilly), sunshine, and Worm
Samer Sarsam I will do cross validation thank you .
Jesús S. Aguilar-Ruiz I would like to understand the relationship and also to predict . I will work the association rule but i will check also regression tree model. thank you.
Bojan Ploj I will discuss that with my supervisor
Milford Estay Thank you for your offer but for privacy reasons i am not allowed to do that plus i am new to data mining but i would like to learn.
Jorge Humberto Guanín Fajardo : Any particular supervised techniques in mind?
Thank you all for your help
Kib
• asked a question related to Data Modeling
Question
I want to collect data for production of biodiesel by using jatropha circus seed for various catalyst. Please, guide me.
What you need to do is keep all the conditions of the reaction constant and vary only one condition for different catalyst acting on the jatropha circus seed. For example you can keep temperature, pressure constant and vary concentration of different catalyst acting on different specified amount of the jatropha circus seed and the measure the amount of biodiesel produce per different catalyst concentration. The data obtain from catalyst concentration (serve as input); and that obtain from biodiesel produce per specified amount of catalyst (serve as output). Then the model will be obtain from input and output obtain in this regard. Thereafter, The simulation will follow. Thereby, proceeding to analysis and conclusion.
• asked a question related to Data Modeling
Question
I want to test the Mistcherlich Model to data I get from papers.
I have data of agronomic yield and an environmental variable (naturally I have the standard deviation and the number of repetition) and I want to test whether there is a Mistcherlich relation.
Mitscherlich equation is Y=Ymax * (1-e(-b*x))
Where Y is the yield, x my environmental variable and Ymax and b are the two paramaters of my model: the maximum yield and the curve parameter.
To model it without considering the sd, I use the nls function but I would like to know wether is possible to use the sd, or use a weight for the data.
Which R function can do that?
Thanks
Thanks Wendel.
The applied regression in the article you sent is polynomial, wich doesn't work for nonlinear and nonpolynomial equation
Finally I find a solution for the fitting step, using the R nls function.
Primary I had to determine a weight for every point, based on variance and number of experimental repetition: Wi=Ni / vi.
Finally I used the nls function, giving the weights input into the function:
start2 <- list(ym = 10000, b=0.01)            #my starting parameters values
n2 <- nls(Yield~ (ym)*(1-exp(-b*x)),         #my equation
start = start2,
weights=WPB,                           #this input allowed to weight my data
trace = FALSE)
• asked a question related to Data Modeling
Question
If we train a data model once on a dataset using a machine learning algorithm, save the model, and then train it again using the same algorithm and the same dataset and data ordering, will the first model be the same as the second?
I would propose a classification of ml algorithms based on their "determinism"
in this respect. On the one extreme we would have:
(i) those which always produce an identical model when trained from the same dataset with the records presented in the same order and on the other end we would have:
(ii) those which produce a different model each time with a very high variability.
Two reasons for why a resulting model varies could be (a) in the machine learning algorithm itself there could be a random walk somewhere, or (b) a sampling of a probability distribution to assign a component of an optimization function. More examples would be welcome !
Also, it would be great to do an inventory of the main ML algorithms based on their "stability" with respect to retraining under the same conditions (i.e. same data in same order). E.g. decision tree induction vs support vector vs neural networks. Any suggestions of an initial list and ranking would be great !
for quite a comprehensive list of methods.
There is an element of chance in the training process. In some software, you can get reproducible answers by using something like set.seed( ) in the R language. Using the seed number again with the same data will then give the same result. Then you can report the software you used with the seed. However in general the different outcomes will be close together, but as with sampling, you will occasionally get outliers (depending on the seed you choose).
• asked a question related to Data Modeling
Question
Analyzing Big data, which is recommended; R or Python?
Python (scikit-learn + Pandas or Anacondas), and you can always call R from Python (https://sites.google.com/site/aslugsguidetopython/data-analysis/pandas/calling-r-from-python).
• asked a question related to Data Modeling
Question
Am starting to collect a small set of diabetes relevant (T1) time series data (movement, blood glucose, sleep, etc.) Have talked to many people who are doing such research, but few people seem to know how to process the data.  Some have suggested ARIMA modelling, perhaps with ML for a variable window, but looking for any ideas or ideally someone who wants to co-author a research paper.  Please get in touch if you have ideas, know relevant papers, or are interested.   Thanks, Dmitri
Pca has no limiting a priori hypothesis but you must
Be very careful in the choice of variables and statistical units
Relative role so if you have (reasonably) three times of
Blood glucose at 6 12 24 your variables must be g6 g12 g24 and
Your statistical units the different subjects if you have exercise
At lets say 11 15 40 hours who care you add other three variabkes
Named e11 e15 e40 each individual so will be a row vector with his
Correspobding values for each variable pay attention to fill all the
Variables and not to have voids in the matrix then go on (before
Obviously read carefully one of the many existing introductory
Articles or books to pca)
• asked a question related to Data Modeling
Question
I am working on my master theses for a quite big company department. It is about creating a data warehouse for a department process metrics. I need to build my research on a state-of-the-art and state-of-the-practice in data warehousing field. However, I can't figure out what to understand as the state-of-the-art and -practice in this area.
Could you please help me to clear up what type of information should I categorize as state-of-the-art or -practice and possibly point out resources where should I look for them? My theses is mainly about the "database" part of DW/BI system, thus the dimensional modeling, metadata model creation and the ETL process creation.
Hi,
what is missing in your message is the subject of your thesis and its title, from which you can have a first idea about the context and motivation of your work. Is the department in need for a traditional data warehouse or are there any specific needs to be met? is the company dealing with structured data only? are there any 'data warehouse / big data" integration needs? what is specific with the company's data and needs that cannot be met by existing approaches of data warehousing?
I think that your starting point is rather the title and theme of your thesis. Once you have a clearer idea about what your company is looking for about data warehousing, you can draw the main keywords then drive your "state-of-the-art" and "state-of-practice" according to the keywords.
• asked a question related to Data Modeling
Question
functional dependencies of breast cancer data set ??
You are welcome.
• asked a question related to Data Modeling
Question
If one reducer is handling more than one key, how are the key-value pairs (received by the reducer) differentiated for different keys?
They are first of grouped by key and that sorted. While processing key/value pairs in your reducer (data is passed into it in sequence fashion) you have to implement by your own 'splitting" of incoming data sequence. Here is example for you (it's in Python, but could be use with Hadoop Streaming) - https://gist.github.com/anonymous/f28f051bb2a7b1532d03
• asked a question related to Data Modeling
Question
I am looking for some machine learning/statistical techniques/algorithms/functions with the following capabilities:
- can be called as an api from c++
- can model multiple numerical outputs for calibration modeling*
* For example, consider an industrial process which has two inputs
(temperature, tension) and two outputs (friction, radius), all of which are continuous numerical values. So, for a diversity of input configurations, we want to model what the outputs will be.
Also, we would like to model the reverse, that is, given some output values (friction, radius), we would like to know what the input values should be to obtain those output values.
Hello,
Google just open sourced its deep learning library a few weeks ago. It's the same C++ library they use internally to train their different deep learning algorithms for image recognition, etc. Look at the following link for TensorFlow.
You could also have a look at DLib, also in C++.
• asked a question related to Data Modeling
Question
My name is Ebiteme Joy Botu. I am currently studying for my PhD at the University of Greenwich, and would like to invite you to take part in my current study about ‘the identification of criteria required by pedestrians when traveling through a route’, which for purposes of this survey, will be referred to as ‘the pedestrian route criteria’.
In this study, you will be asked to answer a three part questionnaire pertaining to pedestrian routes. The survey should take you approximately 20 minutes or less to complete. The data collected will not contain any personal identifiable information about you. Please follow the link below to take part in the survey.
Thank you very much.
Dear Botu,
I would like to add a note to the good comment provided by Peter.
There are different approaches/criteria to define an urban wayfinding model. However, I think that many times researchers forget that people do not have a God's view knowledge of the scenarios. Therefore, criteria such as shortest distance could nor be realistic if a person does not have a map of the scenario (i.e. city map) and skills to identify the shortest path in complex scenarios.
I have been studying this problem for building evacuation and I have submitted a paper on this topic but it still under review. Please contact me if you want a copy of this work.
Regards
• asked a question related to Data Modeling
Question
I'd like to use languages as models for sequences of spatial points (time series).
A point is defined by coordinate-system (x-y cartesian coordinates or lat/long ellypsoidal coordinates).
An example could be dataset of eye-tracking position, or GPS-positions of users moving through a city.
Could you please state more precisely your understanding of "movement data" and "sequences"?
• asked a question related to Data Modeling
Question
Hi,
I want to fit the data in different thermodynamic models like R-K-SOAVE NRTL etc.
how can i do that ??.
is there any programming or simulation can help me?
or if possible by MATLAB ?
Dear Colleagues,
Good Day,
suppose you have the following data:
x=[0.9 1.5 3 4 6 8 9.5];
y=[0.9 1.5 2.5 5.1 4.5 4.9 6.3];
Curve fitting with polynomials is done in MATLAB with the polyfit
function, which uses the least squares method. The basic form of the polyfit
function is:
For the same set of m points, the polyfit function can be used to fit polynomials
of any order up to m-1 . If n = 1 the polynomial is a straight line, if n = 2
the polynomial is a parabola, and so on. The polynomial passes through all the
points if n=m-1  (the order of the polynomial is one less than the number of
points). It should be pointed out here that a polynomial that passes through all the
points, or polynomials with higher order, do not necessarily give a better fit overall.
High-order polynomials can deviate significantly between the data points.

p = polyfit(x,y,n)
x is a vector with the horizontal coordinates
of the data points (independent variable).
y is a vector with the vertical coordinates of
the data points (dependent variable).
n is the degree of the polynomial.
p is the vector of the coefficients
of the polynomial
that fits the data.
p=polyfit(x,y,3)
xp=0.9:0.1:9.5;
yp=polyval(p,xp);
plot(x,y,'o',xp,yp)
xlabel('x'); ylabel('y')
See page page 267-274 , if you need any help just ask, please.
Hazim
How can I find the coefficient for fitting a curve in MATLAB? - ResearchGate. Available from: https://www.researchgate.net/post/How_can_I_find_the_coefficient_for_fitting_a_curve_in_MATLAB [accessed Jul 8, 2015].
• asked a question related to Data Modeling
Question
sorry I do not have the expertise to answer this question.
Regards
• asked a question related to Data Modeling
Question
I need to apply the method of hidden cointegration as developed by Granger and Yoo (2002). How can it be done in EViews 8.0?
Dear friend i have not used version 8 of EViews.  Pl pardon me.
• asked a question related to Data Modeling
Question
Hello,
I am working on building context-aware software where I have data models and context models and both of them are in different modules in my architecture.
I have some concepts in context modules that are the same in data module. Is this allowed from data modelling point of view ?
Hi Ines, It is an interesting question. It would be even better if we had some examples or diagrams here to illustrate the issue. I teach (and do) systems modeling and many of of my students also do context diagrams, aside from data models, process models and object models. My general answer would be yes, it is sometimes possible to have the same concept in different models. If something is in the business context model but is  also a major entity that we want to record detailed data on, yes, it would make sense to have it in the data or database model as well. Location related data, for example, can be quite detailed. Best wishes and good luck with your work.
• asked a question related to Data Modeling
Question
I have 19 large files of average size of 5GB, I want to split data from all the files into small files into another 35000 files based on some criteria.
One file takes 8 to 10 hrs if done serial way. and if the session stops or any other failure leads to do it again and again causing threat to reliability of data. Is there any way to make it possible to execute parallel or more faster approach.
This task is very much important for me as the my main goal of modelling depends on this data. This process is one time execution process for my application.
I am working on windows(right now bit difficult to Linux). The files are database files, I can produce both Mysql DB files or tables in the form of .txt file. I want to split based on the selection and filter criteria on this files(Its not after every n bytes). I am using R for this problem. The server working on is 8 core, 32 GB RAM server.
Please suggest any way if anyone came across such problem?
Splitting your database seems like a very messy approach. Many database technologies support data access out of memory, and can also be very quick if your data is structured properly. I would suggest looking at HDF5 or something similar. Python has some good tools for this work with PyTables and Pandas. Good luck. http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas
• asked a question related to Data Modeling
Question
Django-registration is not maintained for quite a while. And it needs some modification with an update to Django 1.7, at least for my project. So, this might be a good time to migrate from django-registration. From a number of discussions I got an impression that django-allauth is the closest in design and among the most advanced alternatives to django-registration.
I found only one blog [1] about this kind of migration. A link to it says the move is simple, but the very blog cannot be opened at the moment... Any other tips/notes/recommendations are most welcome!
To convert the site from django-registration to django-allauth, I only needed to change "'registration'," to "'allauth'," in INSTALLED_APPS within settings.py (well, I also moved it closer to the end of the list), and all instances of id_username to id_login. All together, 5 files including a test and a css file. I used id_username as a name for an id attribute and changed that too, for consistency. 3 files (1 line per file) would be enough: settings.py and two templates.
I used just basic authentication, and the templates for login pages are in the project.
I also found that the blog (see a link in the question) worked for a short while during the last week. It has something more on installing and on copying the templates, which I had ready before I started the migration.
Tested in Django 1.6.
Upgrade to Django 1.7 is not finished, it needs more work...
• asked a question related to Data Modeling
Question
As we all know there  are many software components in an automobile system. These softwares have different ports through which they communicate through signals with other software components(to ports in it).These signals are just represented by lines and since these lines intersect each other it is difficult to understand the representation.This representation of communication of data will be very untidy and difficult  to understand.So is there any idea in information modelling in which we can generate simplicity to this complex representation?
Hi again, also see Microsoft Research's Automatic Graph Layout library:
• asked a question related to Data Modeling
Question
What are the best possible ways to build a model form skew data which can be further used for estimation purpose?
In statistics, any model has its assumptions, it seems you use linear regression, and there are 5 assumptions about linear regression,
1.       All observation are independent
2.       The dependent variable Y has a linear relationship with independent factors x1, x2, …
3.       The variance of Y are homogeneous
4.       All x1, x2, … are constant without errors
5.       All x1, x2, … are independent
In other word, the residuals of the regression Y=a+b1x1+b2x2+… Yi-^YI  are independent and  identical normal distributed.
Hence you should check if is the distribution of residuals skew or not, rather than Yi
You should not use R^2=0.4 only to say if the model fit well or not. If your data size is large and a few independent variables are used, the R^2 could be low. If transformation is used, you need compare the R^2 between original data and transformed data to see how much are improved.
Low R^2 are due to missing significant factors, or other reasons.
• asked a question related to Data Modeling
Question
Inmon, Kimball, Hefesto or another? I'm currently building a data warehouse to pave the way for data mining, the goal of this work is to improve the process of decision-making in education policy. This requires knowing what the best architecture is.
Its depend on what do you need, no matter what the architecture, its go back to the purpose to build datawarehouse where we need to deliver database environment which can create best sql performance when access data from datawarehouse
you can create star schema or snowflake, top down or bottom up, its depend on how best performance your sql to access datawarehouse rather than from oltp.
• asked a question related to Data Modeling
Question
I'm working on a prediction model to help clinician practice. I did an experimental trial to identify predictor variables and my model has a good adjusted R2, but I don't know what the next step to validate my model and to apply it in real practice should be.
To check the performance of your model you can carry out cross-validation by leaving one observation out and fitting the model using the n-1 observations and comparing the predicted value with the observation left out in some way using the new model and repeating on each observation in turn. (Many R software functions have facilities for carrying out cross-validation)
If the performance of your model is not satisfactory enough you could refine your model (by finding a more suitable model or removing outliers if exist for example) and repeat checking the performance of your model until you get a satisfactory model.
• asked a question related to Data Modeling
Question
I was wondering if anyone can help to generate some hypothetical rotarod data to explore some hypothesis. Is there a software available out there that I can use or is there an existing data set/bank that I can access to test my hypothesis? Any suggestion will be quite helpful and appreciated.
Find one of my Research work in which I performed Rota-rod model to check SKM Relaxant activity of Hibiscus rosa-sinensis. Hope it will give all the required details to you...
• asked a question related to Data Modeling
Question
I have financial data of some firms and I want to establish relationship between a dependent variable that I have calculated and some independent variable from the firm financials. What are the alternatives available? I know linear regression is one but are there any recent more efficient methods?
MISUSE OF LINEAR MODELING
Linear regression has often been misused to be the holly grail of proving relationship forecast. There is always a built-in tendency to conclude that Y changes at the rate of “b” for every unit change in X, reading from the linear equation: Y + a + bX + c. Actually, this equation is used for a rudimentary forecast given an array of X: (x1, x2, …,xn) and Y: (y1, y2, …, yn). Some financial or economic data may not be linear. For non-linear data, linear modeling may not be appropriate.
CORRELATION COEFFICIENT
When working with financial information, some data may not behave in a straight line fashion, especially cost functions or productivity. Instead of using linear equation as a starting point, one may look at correlation coefficient as a tool to measure the association between two data arrays. In so doing, use the following as a guide:
X quantitative: Y quantitative = Pearson correlation
X quantitative: Y ordinal = Biserial correlation
X quantitative: Y nominal = Point biserial correlation
X ordinal: Y quantitative = Biserial correlation
X ordinal: Y ordinal = Spearman rho or tetrachoric correlation
X ordinal: Y nominal = Rank biserial correlation
X nominal: Y quantitative = Point biserial correlation
X nominal: Y ordinal = Rank biserial correlation
X nominal: Y nominal = Phi, L, C & Lambda
The correlation coefficient interval runs between negative one (-1) and positive one (+1). The association may be read from this range. For forecast, if linear regression is not appropriate, then consider non-linear curve fitting.
NON-LINEAR CURVE FITTING
For non-linear data dispersion, it is necessary to look beyond linear modeling. One needs non-linear curve fitting. One approach to curve fitting is to increase the order of a polynomial equation. Basic definition of a polynomial follows: (1) the line equation y = a + bX + c is a polynomial of the first order. It is a best fit equation for two points; hence, the shortest distance between two points is a straight line; (2) the equation y = aX^2 + bX + c is a second order polynomial; it is the best fit of a curve that is bounded by three points; and (3) the equation y = aX^3 + bX^2 + cx + d is a polynomial of the third order; it is a best fit polynomial connecting four points. Each point referred hitherto is known as a constraint. Each constraint may be a point, an angle or a curvature.
ATTACHED FILE
A file is attached for exercise in nonlinear curve fitting.
• asked a question related to Data Modeling
Question