Since 2013, the Italian National Institute of Statistics (Istat) has been investigating the potential of Big Data sources for Official Statistics. Among such sources, Internet data originated by websites content has been considered as one of the most important to produce information about enterprises. In 2018, Istat started producing experimental statistics on the activities that enterprises
... [Show full abstract] carry out through their websites (web ordering, job vacancy advertisement, link to social media, etc.). They are a subset of the statistics currently bythe “Survey on ICT usage and e-Commerce in Enterprises” and are computed starting from enterprise websites’ contents, acquired by web scraping tools and processed with text mining techniques. A machine learning approach is adopted to estimate models in the subset of enterprises for which the survey and the web sources are both available, with survey data serving as training set for the machine learning task. The content scraped from successfully reached websites is used as input to predict the target values by applying the model fitted in the previous step. The experimental statistics are obtained using two different estimators:(i) a full model based estimator; (ii) an estimator that combines model and survey based estimates. Considering the various domains for which they have been calculated, the three sets of estimates (survey, model and combined) in most cases are not significantly different(i.e. model and combined estimated values lay in the confidence intervals of survey estimates).Simulations have demonstrated that the Mean Square Errors of these new estimates are competitive as compared to those produced in the traditional way.