Science topic

# Data Model - Science topic

Explore the latest questions and answers in Data Model, and find Data Model experts.

Questions related to Data Model

Dear Scholars,

I have stationary dependent variable and non-stationary independent variables. I employed the Panel ARDL model but also I would like to run a static panel data model too. To control country differences, I decided to use fixed effects model but I could not find proper answer about taking differences.

Should I take differences for all variables or just for non-stationary variables?

Thank you very much for your helps.

Can someone direct me to a working link to download the Century model for SOC?

The link in the Colorado State University site below doesn't seem to work

I'm trying to get

**the data of loan officers from microfinance**(how many borrowers they approach, loan amount outstanding, the portfolio risk, the percentage of complete repayment, etc). Can anyone suggest to me the database to use data for the panel data model?Thank you.

One of my big problems is finding articles that could suggest new thoughts/research to my work. Part of the problem is the amount of extraneous material (dirt) that is available. For example, when I see an abstract that is long (>about 300 words in English), I simply ignore. My experience tells me it is usually unfounded or vague or hand waving. But there is a possibility there may be a grain of something that I'm ignoring. There is also the possibility I'm missing some paper that may be valuable. Then there is all those ad-hominem statements to which I respond to just ignoring those authors. I'd like to be more effective at finding new data/models while ignoring the dirt. How can I be more effective at distributing my research?

i sudied a process using design of experiments. firstly, i used screening by fractional factorial design. results showed that 3 out of 5 affecting factors are significant. also i found significant curvature in model. so, i used RSM method (box-behnken) to better understand the process using the 3 selected factors. results showed that the linear model is the best model that fit the data. i have confused with the results. whats the reason that results from fractional factorial design show curvature but behave linear in RSM method?

I have panel data comprises 5 cross sections across 14 independent variables. the data time series part is 10 years. while I run the panel data model for pooled OLS and FE model it gives results while for Random effect model it shows error as RE estimation required number of cross-section>number of coefficients for between estimators for estimation of RE innovation variance. Can anyone help me how to get the results for Random effect model?

I am working on the development of a PMMS model. To select the best-performing tools and models, several models are needed to be developed and validated. Can this be replaced by some optimization algorithms?

The 5 methods of estimating dynamic panel data models using 'dynpanel" in R

*# Fit the dynamic panel data using the Arellano Bond (1991) instruments*reg<-dpd(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,Produc,index=c("state","year"),1,4) summary(reg)

*# Fit the dynamic panel data using an automatic selection of appropriate IV matrix*

*#reg<-dpd(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,Produc,index=c("state","year"),1,0)*

*#summary(reg)*

*# Fit the dynamic panel data using the GMM estimator with the smallest set of instruments*

*#reg<-dpd(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,Produc,index=c("state","year"),1,1)*

*#summary(reg)*

*# Fit the dynamic panel data using a reduced form of IV from method 3*

*#reg<-dpd(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,Produc,index=c("state","year"),1,2)*

*#summary(reg)*

*# Fit the dynamic panel data using the IV matrix where the number of moments grows with kT*

*# K: variables number and T: time per group*

*#reg<-dpd(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,Produc,index=c("state","year"),1,3)*

*#summary(reg)*

A common threshold for standardized coefficients in structural equation models is 0.1. But is this also valid for first difference models?

Can anyone suggest any ensembling methods for the output of pre-trained models? Suppose, there is a dataset containing cats and dogs. Three pre-trained models are applied i.e., VGG16, VGG19, and ResNet50. How will you apply ensembling techniques? Bagging, boosting, voting etc.

Extended/edited from an early question for clarity.

I have temporally high resolution outputs of modelled climate data (model x, spanning 0-5000 ka. Low spatial resolution 0.5 degrees). Compared to other climate models, however, I have reason to believe it is under-predicting precipitation/temperature changes at certain time intervals. Is there a way to calibrate this with better quality records (i.e., those available via WorldClim/PaleoClim)?

For example, the response to the MIS 5e (120-130 ka BP) incursion of the African Summer Monsoon and Indian Summer Monsoon into the Saharan and Arabian deserts is very weak compared to the MIS 5e data from WorldClim/PaleoClim (and corroborated by palaeoclimatic data). Can I correct/calibrate model x with these more responsive models, and how should this be done?

Dear blockchain researchers,

In the classical Nakamoto Blockchain (BC) model, Transactions (TXs) are packaged in blocks and each block points to, specifically, its previous single block (I'm not gonna go into technical details here). This is a linear data model which justifies the name 'chain'. In the DAG-based BCs, TXs may, or may not, be packaged into blocks, and then each TX/ block (say 'a' ) is allowed/enforced to point to more than one parent. Consequently, several children blocks/TXs (say 'b', 'c' and 'd') are similarly allowed/enforced to randomly point later to 'a'. This is a network like data model which is obvious.

Searching in previous works, all DAG-based BCs I found adopt a many-to-many cardinality model of blocks/TXs as described above. Some do propose children must point to several parent for higher throughput and credibility. However, none of those proposed, specifically, a relaxed one-to-many parent-child dependency.

To clarify, I specifically mean that children are enforced to point to 'only one' parent, while each parent is allowed to be pointed to by several children. This leads to a tree-like DAG instead of a complicated dense network. I need some references that discuss such data modelling. Would be much beneficial if a comparison is also conducted between different types of DL data models (1-to-many vs. vs. many-to-one vs. 1to1 vs many-to-many).

Any help, explanation, or suggestions are most welcome!

Dear all,

I wanted to evaluate the accuracy of a model using observation data. My problem is the correlation of the model with observed data is really good (bigger than 0.7) but RMSE is very high too (like bigger than 100 mm in a month for monthly rainfall data). How can I explain it? the model also has low bias.

How to explain this case?

Thank you all

The aim of my study is to investigate the impact of Integrated Reporting disclosure practices on operational performance (ROA) and firm value (Tobin's). I have applied panel data model for my analysis. Under descriptive statistics the Std. deviation of tobin's q is high i.e. 4.964. One of the reviewer commented that high std deviation of tobin's q means that variable is not normal, which may affect results. However, I have studied that normality is not required in panel data models? What justification should I give to satisfy reviewer? Please also mention some references.

I try to create a model in Rstudio, however, I can't find a solution. Order of my procedure is;

- data

- Shapiro-Wilk test for normality (It says; data has non-normal distribution)

- log transformation

- Shapiro-Wilk test for normality again (It says; data still have non-normal distribution)

What can I do?

Base on Hansen(1999) we can estimate fixed effect threshold panel data model. In my model Hausman Test says it's random effect, what can I do?

HI everybody

I am trying to run the CESM-atm model, but I don't get where is the path for the data, I am attaching an image of what must be the structure of the path.

By the way, I am running this model in my personal lap, so I had to do the porting proccess before, so I don't think that would be really the problem here.

Could anyone explain me what I must do for downloading the data for the model?

Thanks a lot!

There are many models which helps us to explain in some detail certain processes or phenomena or even to predict to some point how or why or what will happen to the organism if we did... animal testing, trying out a drug or a compound. there are many models, and this does not mean they are used dfor the same reason. Some organisms are better suited than others. How many models and why is this model appropriate? I would like to start by adding a partial answer. I hope we can make a detailed answer to the question, which fully explains the reason or the main aspects on the use of organisms for a model.

A related model may be the use of E. Coli for secveral purposes. some of them require the use of variants or stains, or some need the use og genetic engineering. Is this organism considered a model despite the transformations, or is it more related to the fact it is widely used? How many strains are there, why would we use a particular strain and is it related to the case of general use of model organisms? Are the more cases where we use organisms like this? Do we classify them under the model organism as a whole? Is there anything else you would like to add to the question?

I noticed that while using the gemtc package to perform a fixed effect model with likelihood = "normal" and link = "identity" (mean difference), the burn in iteration specified in mtc.run ("n.adapt") are not taken into account.

Example (with "parkinson" data):

model <- mtc.model(parkinson, likelihood='normal', link='identity', linearModel = 'fixed')

res <- mtc.run(model, n.adapt = 20000, n.iter = 75000)

summary(res)

#Results on the Mean Difference scale

#Iterations = 1:75000

#Thinning interval = 1

#Number of chains = 4

#Sample size per chain = 75000

if no specification for the linear model, a random effect is performed by default. Random effect is working, and other likelihood / link are working in both model.

Is there a way to use the package in mean difference with a fixed effect model including "burn in" interations ? Do you see any error in the way I used the likelihood='normal' / link='identity' ?

In the development of forecasting, prediction or estimation models, we have recourse to information criterions so that the model is parsimonious. So, why and when should one or the other of these information criterions be used ?

I have a panel data set of 11 countries and 40 years while data is consisted of two groups developing and developed countries. The chosen method will be applied on both groups of data set separately in order to compare results of two groups. Suggestions will be appreciated.

I am using transfer learning using

**pre-trained models**in PyTorch for the Image classification task.When I modified the output layer of the pre-trained model (e,g, alexnet) as per our dataset and run the code for seeing the modified architecture of alexnet it gives output as "

**none"**.I have non-stationary time-series data for variables such as Energy Consumption, Trade, Oil Prices, etc and I want to study the impact of these variables on the growth in electricity generation from renewable sources (I have taken the natural logarithms for all the variables).

I performed a linear regression which gave me spurious results (r-squared >0.9)

After testing these time series for unit roots using Augmented Dickey- Fuller test all of them were found to be non-stationary and hence the spurious regression. However their first differences for some of them, and second differences for the others, were found to be stationary.

Now when I test the new linear regressions with the proper order of integration for each variables (in order to have a stationary model) the statistical results are not good (high p-value for some variables and low r-squared (0.25))

My question is how should I proceed now? Should i change my variables?

*EDIT:**Up to the literature suggested in the answers,*

*IT IS NOT POSSIBLE**because they are required at least some calibration data, which - in my case - are not available.*

I am looking for a technique/function to estimate soil temperature from meteorological data only, for soils covered with crops.

In particular, I need to estimate soil temperature for a field with herbaceous crops at mid-latitudes (north Italy), but the models I found in literature are fitted for snow-covered and/or high-latitude soils.

I have daily values of air temperature (minimum, mean and maximum), precipitation, relative humidity (minimum, mean and maximum), solar radiation and wind speed.

Thank you very much

I am running an ARDL model on eviews and I need to know the following if anyone could help!

1. Is the optimal number of lags for annual data (30 observations) 1 or 2

**OR**should VAR be applied to know the optimal number of lags?2. When we apply the VAR, the maximum number of lags applicable was 5, beyond 5 we got singular matrix error, but the problem is as we increase the number of lags, the optimal number of lags increase (when we choose 2 lags, we got 2 as the optimal, when we choose 5 lags, we got 5 as the optimal) so what should be done?

Hello,

My friend is seeking an collaborator in psychology-related statistics. Current projects including personality traits and their relations to other variables (e.g., age). You will be responsible for doing data analysis for potential publications. Preferbably you should have some knowledge about statistics and is fimaliar with software that is used to do analysis (e.g., MATLAB, R, SPSS). 10 hours a week is required. Leave your email address if interested.

I'm a community ecologist (for soil microbes), and I find hurdle models are really neat/efficient for modeling the abundance of taxa with many zeros and high degrees of patchiness (separate mechanisms governing likelihood of existing in an environment versus the abundance of the organism once it appears in the environment). However, I'm also very interested in the interaction between organisms, and I've been toying with models that include other taxa as covariates that help explain the abundance of a taxon of interest. But the abundance of these other taxa also behave in a way that might be best understood with a hurdle model. I'm wondering if there's a way of constructing a hurdle model with two gates - one that is defined by the taxon of interest (as in a classic hurdle model); and one that is defined by a covariate such that there is a model that predicts the behavior of taxon 1 given that taxon 2 is absent, and a model that predicts the behavior of taxon 1 given that taxon 2 is present. Thus there would be three models total:

Model 1: Taxon 1 = 0

Model 2: Taxon 1 > 0 ~ Environment, Given Taxon 2 = 0

Model 3: Taxon 1 > 0 ~ Environment, Given Taxon 2 > 0

Is there a statistical framework / method for doing this? If so, what is it called? / where can I find more information about it? Can it be implemented in R? Or is there another similar approach that I should be aware of?

To preempt a comment I expect to receive: I don't think co-occurrence models get at what I'm interested in. These predict the likelihood of taxon 1 existing in a site given the distribution of taxon 2. These models ask the question do taxon 1 and 2 co-occur more than expected given the environment? But I wish to ask a different question: given that taxon 1 does exist, does the presence of taxon 2 change the abundance of taxon 1, or change the relationship of taxon 1 to the environmental parameters?

Hello,

Dpes anyone have an idea about howto analyse my panel data of exchange rate and stock markets of six countries spread over ten years. My panel data set is actually long (T greater than N) and is unbalanced. I'm initially using the pooled regression and fixed effects models and the Wald test. But while reading, I come to notice that panel data models are applied according to panel data structure. So I'm a bit confused. I will be glad if I could have more insight on which model best fit my data structure. Thanks in advance.

**I need your help**and expertise on the J48 decision tree algorithm that will walk me through the data analysis and interpretation of the data model.

How the data will be consolidated? Processed? Analyzed? and interpretation of the data model.

is there any book that can explain what kind of classical assumption or diagnostic check that need to be tested on panel data model ?

I first conducted a fixed effect model using xtreg y x, fe and I found that all the variables are significant and R-squared is .51.

So I thought that maybe I should use two step system GMM to account for endogeneity. But, since I only have 3 years, when i include the lagged variable as a predictor using xtabond2 y l.y x y*, gmm ( l.y) iv (x y*), equation (level)) nodiffsargan twostep robust orthogonal small, the number of observations per group shrinks to two and I can't even run an AR(1) test or Sargan test. And Also the output shows insignificant lagged variable.

I am still new to dynamic panel data models. Do I need GMM in such small sample size and small number of observations? Should i use something else? If i only report fixed effects results would that be sufficient to be considered for publication?

I would love to hear your recommendation. Thank you very much,

I wish to investigate the effects of landscape parameters on avian community assemblages in an agricultural landscape. In order to conduct modelling in ArcGIS is it advisable to use BIOCLIM data in the Model Builder?

I'm not going for prediction , rather, just want to see the effects of landscape parameters on birds' assemblage.

I am having 21 json files containing more than 15 million rows with approx. 10 features in each file. I need to first convert all the json files to csv and combine all the csv files into one to have a high dimensional dataset. For now, if I load each individual json file as csv, it provides me only the max limit of excel which is 1048576 rows of data which means I am losing rest of the data. I know I can analyze them using data model and power pivot in excel. However, I need to load them first in a csv file for doing dimensionality reduction.

Any idea or suggestion on loading this much data in a csv, excel or any other accepted format which I can later use in Python?

Hello!

I estimate the influence of some components of the global competitiveness index on the index itself for 12 countries over the period of 11 years. So, in my model I have N=12 and T=11, while the number of components is equal to 32. I am facing the situation when the only model, which provides acceptable test results for my data is the 1-step dynamic panel. In my model I use log differences of selected variable. Yes, it contains lagged dependent variables, but am worried if the presence of lagged dependent variables and the acceptable test results are enough to justify the selection of dynamic panel data model.

what useful information can be extracted from a saved model file regarding the training data.

From security perspective too. If someone has access to the saved model what information can they gain?

I have to estimate a panel data model( 19 country and and 37 year) with xtscc command

**(Regression with Driscoll-Kraay standard errors)**, i want to know how can i choose the optimal lag for this estimation . Thank you for any suggestion .I am working on this corporate panel data model: LEVERAGE_it = PROFITABILITY_it + NON-DEBT-TAX-SHIELD_it + SIZE_it + LIQUIDITY_it + TOBIN_Q_it + TANGIBILITY_it + u_it. Where:

leverage = long term debt/total assets

profitability = cash flows/total asset

non debt tax shield = depreciation/total asset

size = log (sales)

liquidity = current assets/current liabilities

tobin_q = mkt capitalization/total assets

tangibility = tangible fixed asset/total assets

What can I say about the exogeneity condition? Can I assume that expected value of the covariance between error term u_it and of X_i is zero? Why? A lot of papers make this assumption but do not explain why.

Thank in advance for your response.

Can I use one Sentinel image for training my model and another one for testing? Since i have two or three images, I wanted to use one or two image for training and the rest for test. However, I know 70 15 15 is the ideal proportion. But i dont know how to implement it for three images. And also, is this possible not to include 15 percent for validation? Just 70 30?

Dear all,

I am working on the BACON model to establish the chronology of a lake core. However, I have a question seeking help from you.

Is it necessary to add my 210Pb data into the model? If yes, how to calculate the dating error ?

Thanks,

Mingda

Why some researchers, in their paper, report the results from the static panel data models (OLS, FE and RE) beside the results from the dynamic models (1st difference GMM and SYS-GMM) while they chose GMM models as the best model for the research problem.

I have behavioral data (feeding latency) which is the dependent variable. There are 4 populations from which the behavioral data is collected. So population becomes a random effect. I have various environmental parameters like dissolved oxygen, water velocity, temperature, fish diversity index, habitat complexity etc. as the independent variables (continuous). I want to see which of these variables or combination of variables will have significant effect on the behavior.

Regarding interoperability of FEA tools:

1. Is Dassault Abaqus Input-file Format is widely supported by other FEA tools (such as, Ansys, Nastran, etc.)? Or every FEA tool has a specific Input file format that cannot be handled/used by a different tool?

2. Are there any interoperability issues between different versions of Nastran provided by different vendors (for instance, MSC Nastran and NX Nastran, etc.)? Or can we use the model developed in one Nastran version (e.g. MSC Nastran) easily in a different Nastran version (e.g. NX Nastran)?

Thanking you.

Obviously, the world is watching covid-19 transmission very carefully. Elected officials and the press are discussing what "the models" predict. As far as I can tell, they are talking about the SIR model: (Susceptible, Infected, Recovered). However, I can't tell if they are using a spatial model and if the spatial model they are using is point pattern or areal.This is critical because the disease has very obvious spatial autocorrelation and clustering in dense urban areas. However, there appears to be a road network effect and a social network effect. For example, are they using a Bayesian maximum entropy SIR? Or a Conditional Autoregressive Bayesian spatio-temporal model? An agent based model? Random walk?

I mean "they" generally. I'm sure different scholars are using different models, but right now I think I can find one spatio-temporal model, and what these scholars meant is that they did two cross sectional count data models (not spatial ones either) in two different time periods.

Dear researchers,

It is several years in which OSM (Open Street Map) is developing huge amounts of spatial data all around the world. As I know some countries like Canada have reorganized their NTDB (National Topographic Data Base) data models to be harmonized with OSM data layers and being merged with them, and although they accept the ODBL licenses as Open DataBase Licence.

I am wondering if it is possible to have a list of such countries' names.

Any help will be so appreciated

Thank you very much for your time.

With Regards

Ali Madad

sir

my research topic is crowding-in and crowding-out effects of public investment on private investment in emerging Asian economies. i have panel data of 6 countries 15-years yearly data and 1 IV (public investment) , 1 DV (private investment) and 8 control variables as my panel data is small , i need your suggestion which panel data model on stata is suitable for my data.

What variable(s) can be used as instruments for public health and education expenditure in testing for endogeneity in a static panel data model that regresses public health/education expenditure on economic growth? I am using random effects estimators since this is the most appropriate traditional panel technique (the Hausman test suggested this).

There are many variables in the literature that have a positive correlation with public health expenditure, for instance. However, these variables also have strong correlation with real GDP per capita growth rate and therefore are unsuitable instruments.

Dear all,

the panel data model i am going to analyse has some stationary and non-stationary variables, and non-stationary variables are integrated of order one. what would be the best estimation method i must apply? discuss plz

The conventional test for the system GMM is 1) testing for instrument validity and 2) test for second order serial autocorrelation.

Are there pre-estimation tests that may be relevant i.e normality,heteroskedasticity, panel unit root tests, panel cointegration test

I do ask since almost 90% of academic papers reviewed seem to ignore these tests and stress mostly on the two .

Hi, I'm testing a serial multiple mediation model with two mediators. I tried twice by testing different data but they all showed the same results that CMIN and df=0.

First, I did CFA test to ensure the validity of model and the model fit was acceptable. Second, when I was doing mediation test(I made casual model only using unobserved variables), the information of Notes for Model were shown that:

Number of distinct sample moments: 15

Number of distinct parameters to be estimated: 15

Degrees of freedom(15-15):0

Result(Default model)

Minimum was achieved

Chi-square= .000

Degrees of freedom= 0

Probablity level cannot be computed

Based on these, will this model be acceptable to test further hypotheses? Or will this model be meaningful to study? And I checked the literature by Byrne(2001).

The reference is: Byrne, B.M. 2001.

*Structural equation modeling with AMOS : basic concepts, applications, and programming.*Mahwah, N.J. ;: Lawrence Erlbaum Associates.It mentioned that "this kind of model is not scientifically interesting because it has no degrees of freedom and therefore can never be rejected" (Byrne, 2001). Anyone could give any comments and suggestions on this?

I think it might result from this particular type of causal relationship or coincidence? Because the CFA test of this model:

CMIN=1170.358

df=399

CFI= 0.919

TLI=0.911

SRMR=0.045

RMSEA=0.066

which might provide evidence that the data and model could match well. So, what might be the actual reasons?

Thank you all for any comments in advance!

Thanks!

Hi, I am trying to model the effect of human perception on wildfire ignition in the United States. I want to use Google trend data to model society's perception of wildfire. Are there any similar studies that use Google trend data to model people's perception?

Five major differences between data and data stores:

**1. Data Lakes Retain All Data**

During the development of a data warehouse, a considerable amount of time is spent analyzing data sources, understanding business processes and profiling data. The result is a highly structured data model designed for reporting. A large part of this process includes making decisions about what data to include and to not include in the warehouse. Generally, if data isn’t used to answer specific questions or in a defined report, it may be excluded from the warehouse. This is usually done to simplify the data model and also to conserve space on expensive disk storage that is used to make the data warehouse performant.

In contrast, the data lake retains ALL data. Not just data that is in use today but data that may be used and even data that may never be used just because it MIGHT be used someday. Data is also kept for all time so that we can go back in time to any point to do analysis.

This approach becomes possible because the hardware for a data lake usually differs greatly from that used for a data warehouse. Commodity, off-the-shelf servers combined with cheap storage makes scaling a data lake to terabytes and petabytes fairly economical.

**2. Data Lakes Support All Data Types**

Data warehouses generally consist of data extracted from transactional systems and consist of quantitative metrics and the attributes that describe them. Non-traditional data sources such as web server logs, sensor data, social network activity, text and images are largely ignored. New uses for these data types continue to be found but consuming and storing them can be expensive and difficult.

The data lake approach embraces these non-traditional data types. In the data lake, we keep all data regardless of source and structure. We keep it in its raw form and we only transform it when we’re ready to use it. This approach is known as “Schema on Read” vs. the “Schema on Write” approach used in the data warehouse.

**3. Data Lakes Support All Users**

In most organizations, 80% or more of users are “operational”. They want to get their reports, see their key performance metrics or slice the same set of data in a spreadsheet every day. The data warehouse is usually ideal for these users because it is well structured, easy to use and understand and it is purpose-built to answer their questions.

The next 10% or so, do more analysis on the data. They use the data warehouse as a source but often go back to source systems to get data that is not included in the warehouse and sometimes bring in data from outside the organization. Their favorite tool is the spreadsheet and they create new reports that are often distributed throughout the organization. The data warehouse is their go-to source for data but they often go beyond its bounds

Finally, the last few percent of users do deep analysis. They may create totally new data sources based on research. They mash up many different types of data and come up with entirely new questions to be answered. These users may use the data warehouse but often ignore it as they are usually charged with going beyond its capabilities. These users include the Data Scientists and they may use advanced analytic tools and capabilities like statistical analysis and predictive modeling.

The data lake approach supports all of these users equally well. The data scientists can go to the lake and work with the very large and varied data sets they need while other users make use of more structured views of the data provided for their use.

**4. Data Lakes Adapt Easily to Changes**

One of the chief complaints about data warehouses is how long it takes to change them. Considerable time is spent up front during development getting the warehouse’s structure right. A good warehouse design can adapt to change but because of the complexity of the data loading process and the work done to make analysis and reporting easy, these changes will necessarily consume some developer resources and take some time.

Many business questions can’t wait for the data warehouse team to adapt their system to answer them. The ever increasing need for faster answers is what has given rise to the concept of self-service business intelligence.

In the data lake on the other hand, since all data is stored in its raw form and is always accessible to someone who needs to use it, users are empowered to go beyond the structure of the warehouse to explore data in novel ways and answer their questions at their pace.

If the result of an exploration is shown to be useful and there is a desire to repeat it, then a more formal schema can be applied to it and automation and reusability can be developed to help extend the results to a broader audience. If it is determined that the result is not useful, it can be discarded and no changes to the data structures have been made and no development resources have been consumed.

**5. Data Lakes Provide Faster Insights**

This last difference is really the result of the other four. Because data lakes contain all data and data types, because it enables users to access data before it has been transformed, cleansed and structured it enables users to get to their results faster than the traditional data warehouse approach.

However, this early access to the data comes at a price. The work typically done by the data warehouse development team may not be done for some or all of the data sources required to do an analysis. This leaves users in the driver’s seat to explore and use the data as they see fit but the first tier of business users I described above may not want to do that work. They still just want their reports and KPI’s.

I have done uni-axial testing on a biological tissue for 10% strain and have the data. Now I need to use the data and model the tissue in abaqus.I believe Fung-Anisotropic model suits best for the tissue.I could not find any clear reference textbooks/sources for modelling from test results.

I was able to run an analysis on AMOS.

However, I have a low fit of my data to the model, and given time contrasts I doubt I would be able to double the number of data I got (I only have 217 responses..).

What could I do? At the moment I have a

- CMIN/DF: 18,5

-CFI: 0,46

-RMSEA: 0,285

I tried to go through the modification indices, but the only available covariance modification i could do between two 'errors' doesn't make sense and would only improve my model by 4,7.

Any suggestions?

I’m working with a panel data about Foreign Direct Investments using FDI flows as endogenous and, among others, FDI stock in the previous year as one of the explanatory variables. If we use the lagged endogenous as an explanatory variable we would have a dynamic panel data model and we should use a convenient estimator (say Arellano Bond, for example). However, in my case, I'm not using as exogenous the lagged endogenous (flow [yt-1-yt-2]), but the lagged stock of FDI (yt-1). Should this case be considered as a dynamic model too? Should it be estimated using Arellano&Bond or similar to avoid the inconsistency and Is there any specific alternative for this type of specification?

I would like to incorporate semi-structured surveys, satellite tracking, and eBird records into a single species distribution model, while being able to control for potential limitations and biases of each sampling approach.

See these papers for background / theory of this approach:

Hello!

Does anybody know how to estimate variance components for GLM-models in R?

It can be easily done for ordinary linear model (e.g., using VCA package), but I am not able to find any solutions for GLMs.

I would be greatful for any advices or links, R-code is very appreciated.

Here is an example of data and model I have:

N <- 200

dummydata <- rbind(

data.frame(

incidence = sample(x = 0:5, size = N/2, replace = T),

size = 12,

Pred1 = rep(c("X1", "X2", "X3", "X4"), each = 25),

Pred2 = "T1"

),

data.frame(

incidence = sample(x = 6:10, size = N/2, replace = T),

size = 12,

Pred1 = rep(c("X1", "X2", "X3", "X4"), each = 25),

Pred2 = "T2"

))

mod <- glm(

cbind(incidence, size - incidence) ~ Pred1 * Pred2,

data = dummydata,

family = binomial)

With best regards,

Igor

Thanks in advance!

if I have unit root problem in both, dependent and independent variables, which technique is recommended to use to get consistent estimation? (T is bigger than N). can I employ dynamic panel data models or first differences? and which of them is better? also if some independent variables are stationary, do I leave they as level or use first differences?

dear all,

Now I try to carry out the panel data model (fixed and random effects), the Durbin-Watson stat is = 0.2 less than 2

I took the first difference for all variables in the model, but the Durbin-Watson stat

(D-W) still less than 2 it is = 0.38

The variables in the log and percentage forms, are this which do the problem?

Can I solve this problem?

can I remove auto-correlation between residuals?

thanks a lot in advance

I have a panel firm level data set with N>20,000 and T=142. T is large because it's a monthly dataset. I need to employ a dynamic panel data model but I am not able to use either ARDL because it requires that T>N and I can't use GMM because T is large. Does anyone have any idea on what model to use? I was trying to use the CCEM on stata but with no success either. Any advice would be appreciated.

how fusion of 3 different data model like RDB,semantic and big dta base takes place and how it is useful in realtime analysis

I'm trained my model with reasonable accuracy and loss functions but when I predict with model, the model give same probability for two classes

Anyone looking for a discussion board and collaborator to mine data and model interactions in online discussions?

To build the data model in the Blockchain architecture how to represent and save data items inside the system and how to represent the whole data?.

Hi, model. Please help me to fit TGA data into model. Is it required any coading?

Good morning!!

I am Ranjeet Kumar Mishra from India. I am working on thermochemical conversion of biomass and plastics into renewable fuel and chemicals. Presently, I am working on kinetic analysis of biomass using

**model. I have TGA data but not able to fit in***vyazovkin***model. Please help me to fit TGA data into model. Is it required any coding?***vyazovkin*I am using the Arellano Bond dynamic panel data GMM esimator with the 'xtdpd' command in STATA to determine the impact of health human capital on economic growth for 30 sub saharan African coutries over a period of 20 years (1995 - 2014). In an alternaive model I lagged the health variable by 10 years to account for the delayed impact of health on growth. Could the coeffecient be interpreted as a long term effect in this case?

I have a dataset that was passed to a StandardScaler before being passed into a classification model for training. StandardScaler was also applied to the test data for Model validation.

But, the problem comes when I have to make real world predictions with this model since, the data on which the prediction is to be made won't be scaled or standardized.

How to go about using such a model to make predictions on real world data?

Because I see that some manuscript identified the break point in panel by considering each cross sectional units as an individual and then they can apply general method which is describe by many researchers. But I think that each cross sectional units having impact to other variables. So, break point is identify by jointly consider the variables in panel.

I am having problem with prediction with my model trained on resnet50. I have 10 classes of Nepali numbers from (0 ...9). I have trained the model for 100 epochs with around 40,000 data . What is the issue with my model? I am having overfit? Or, the model I have used to train is just too complex for 10 class. I have also tried this model to predict on the training set but the prediction accuracy is very very poor around (10%).

I am interested in looking at the effect of land and wealth inequality (alongside a number of other explanatory variables) on the process of agricultural expansion in Latin America over 1990-2010. Data on land and wealth distribution are however sparse and available only for one point in time. Although there is some evidence that both land and wealth concentration have not varied significantly over 1990-2010, I was wondering if there would be a better way than panel data to analyse the data.

Thanks to everybody for your help.

This is my final year research topic.And I want to know about How good is the quality of information which emergent through OSM data model and ordnance survey is for representing VGI

May I know what is the minimum acceptable R square or the range of R square (0.XX) for biological wastewater treatment model ?

I am evaluating the performance of lab-scale bioreactor in treating phenol containing wastewater. While building a model based on the performance I obtained i.e., phenol removal efficiency versus the variables, I am not sure what is the acceptable range of the R square in the model. Of course, the higher the R square the better the model. However, there is an acceptable R square of model in specific field. So, I would like to know the proper range and minimum R square for the model under the aforementioned biological wastewater treatment approach. Thank you.

Constraints: Data non-availability for other independent variables for older years.

I am searching for a suitable regression method, when the dependent variable is a fraction or proportion. I am using an unbalanced panel data set.