Conference PaperPDF Available

Place, Big Data, and statistical methods

Authors:

Abstract

The ongoing digitalisation of our everyday lives is leading to a steady increase in so-called Big Data. These data make our lives increasingly traceable and (often unintentionally) publicly discernible. A lot of Big Data is spatial in nature using GPS, spatial language, or checking in at places, making them accessible to spatial analysis. Unsurprisingly, a substantial number of scholarly articles have been published that make use of these novel data, providing a variety of interesting insights into aspects of everyday life that were previously difficult to access on a large scale. However, datasets such as tweets, geotagged photos, and check-ins differ from traditional, scientifically collected counterparts in that they come from platforms designed for mundane use. Main drivers of Big Data generation include communication with peers and friends, curation of digital alter egos, and enhancement of life comfort. The resulting often unscientific nature of said spatial data has implications when it comes to supposedly rigorous, quantitative analyses. Quantitative methods attempt to uncover structures in a nomothetic way. They assume that the information analysed is representative of joint underlying processes, which show uncertainties only due to varying contextual conditions or measurement error. However, Big Data is not collected via calibrated devices, as would be the case with questionnaires or with measuring devices. Rather, these data often reflect subjective mental geographies and thus information that is much closer to the human-geographical concept of intimate place than to the concept of abstract space.
93
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this article, a new method called spatial amplifier filtering is proposed. The presented method is related to Moran eigenvector filtering and allows the accentuation of spatial structures in heterogeneous data sets. The spatial amplifier filtering technique is based on the inclusion of certain eigenvectors of a spatial weights matrix into a regression model. The application of this method can be seen as a pre-processing step prior to subsequent analyses, and to separate different types of spatially correlated components in a data set. For this purpose, three different types of so-called spatial amplifiers are proposed, each consisting of different subsets of eigenvectors of the weights matrix. These amplifiers can either emphasise the positive or negative spatial autocorrelation, or spatial structuring in general. In this way, it is possible to make desired spatial structures more visible, especially in spatially highly mixed data sets, whereby the focus here is on geosocial media data. In the empirical part of the article, it is first shown why georeferenced social media data are difficult to handle from a spatial analysis perspective, motivating the need for the method proposed. Subsequently, the technique of amplifier filtering is applied to two data sets: a census data set from Brazil and Twitter data from London. The results obtained show that the method is capable of strengthening existing spatial structures and mitigating potentially disturbing spatial randomness patterns and other nuisances. This facilitates the interpretation especially of the Twitter data used. While the analysis of the unfiltered Twitter data with established methods reveals little information about possible spatial structures in the tweets, the filtered data offer a much clearer picture with distinguishable clusters. In addition, the method also provides insights into the internal irregularity of spatial clusters and thus complements the toolbox for investigating spatial heterogeneity.
Conference Paper
Full-text available
Ambient user-generated geo-information like that from geosocial media is collected using liberal, unmoderated acquisition modes. This offers a high degree of freedom regarding content. However, the collected information is influenced by idiosyncratic spatial perceptions. The resulting datasets are thus heterogeneous and comprise different (often inseparable), spatially and temporally superimposed statistical populations. Traditional notions of stationarity, which are oftentimes required in spatial analysis, are therefore frequently violated and conclusions about disclosed spatial structures might be misleading. This paper examines how the spatial superimposition of statistical populations influences the spatial autocorrelation estimator Moran’s I. The approach chosen allows to gain insights beyond specific empirical datasets and with full flexibility in parameterization. A synthetic point pattern is therefore constructed, which contains two overlapping, differently scaled sub-patterns. Normally distributed values drawn from populations with different means and variances are repeatedly assigned to these, and Moran’s I is calculated for 20,000 overall configurations. Each parameter value thereby corresponds to a multiple of the same parameter value of the other population. The results show strong influences of discrepancies in statistical parameter values of co-located populations on the characterization of spatial patterns. While differences in mean values change the magnitude of Moran’s I, whereas differences in variances increase the range of the measure. The scale associated with the dominant of the involved populations further influences the magnitude of Moran’s I. These results suggest that the spatial analysis of ambient user-generated geo-information from unmoderated acquisition modes may require the consideration of different superimposed statistical populations to ensure meaningful results.
Article
Full-text available
Twitter and related social media feeds have become valuable data sources to many fields of research. Numerous researchers have thereby used social media posts for spatial analysis , since many of them contain explicit geographic locations. However, despite its widespread use within applied research, a thorough understanding of the underlying spatial characteristics of these data is still lacking. In this paper, we investigate how topological out-liers influence the outcomes of spatial analyses of social media data. These outliers appear when different users contribute heterogeneous information about different phenomena simultaneously from similar locations. As a consequence, various messages representing different spatial phenomena are captured closely to each other, and are at risk to be falsely related in a spatial analysis. Our results reveal indications for corresponding spurious effects when analyzing Twitter data. Further, we show how the outliers distort the range of outcomes of spatial analysis methods. This has significant influence on the power of spatial inferential techniques, and, more generally, on the validity and interpretability of spatial analysis results. We further investigate how the issues caused by topological outliers are composed in detail. We unveil that multiple disturbing effects are acting simultaneously and that these are related to the geographic scales of the involved overlapping patterns. Our results show that at some scale configurations, the disturbances added through overlap are more severe than at others. Further, their behavior turns into a volatile and almost chaotic fluctuation when the scales of the involved patterns become too different. Overall, our results highlight the critical importance of thoroughly considering the specific characteristics of social media data when analyzing them spatially.
Article
Georeferenced user-generated datasets like those extracted from Twitter are increasingly gaining the interest of spatial analysts. Such datasets oftentimes reflect a wide array of real-world phenomena. However, each of these phenomena takes place at a certain spatial scale. Therefore, user-generated datasets are of multi-scale nature. Such datasets cannot be properly dealt with using the most common analysis methods, because these are typically designed for single-scale datasets where all observations are expected to reflect one single phenomenon (e.g., crime incidents). In this paper, we focus on the popular local G statistics. We propose a modified scale-sensitive version of a local G statistic. Furthermore, our approach comprises an alternative neighborhood definition that is enables to extract certain scales of interest. We compared our method with the original one on a real-world Twitter dataset. Our experiments show that our approach is able to better detect spatial autocorrelation at specific scales, as opposed to the original method. Based on the findings of our research, we identified a number of scale-related issues that our approach is able to overcome. Thus, we demonstrate the multi-scale suitability of the proposed solution.