Conference Paper

SparkTrails: A MapReduce Implementation of HypTrails for Comparing Hypotheses About Human Trails

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

HypTrails is a bayesian approach for comparing different hypotheses about human trails on the web. While a standard implementation exists, it exposes performance issues when working with large-scale data. In this paper, we propose a distributed implementation of HypTrails based on Apache Spark taking advantage of several structural properties inherent to HypTrails. The performance improves substantially. Our implementation is publicly available.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The basic assumption is that individuals or groups might have a regular activity area, which indicates the inner similarity of social and geographic closeness (Cranshaw et al. 2010). Becker et al. (2016) introduce a Bayesian approach for comparing hypotheses about human trails on the web. Piatkowski et al. (2013) present a graphical model designed for efficient probabilistic modeling of spatiotemporal data, which can keep the accuracy as well as efficiency. ...
Article
Full-text available
Collective social media provides a vast amount of geo-tagged social posts, which contain various records on spatio-temporal behavior. Modeling spatio-temporal behavior on collective social media is an important task for applications like tourism recommendation, location prediction and urban planning. Properly accomplishing this task requires a model that allows for diverse behavioral patterns on each of the three aspects: spatial location, time, and text. In this paper, we address the following question: how to find representative subgroups of social posts, for which the spatio-temporal behavioral patterns are substantially different from the behavioral patterns in the whole dataset? Selection and evaluation are the two challenging problems for finding the exceptional subgroups. To address these problems, we propose BNPM: a Bayesian non-parametric model, to model spatio-temporal behavior and infer the exceptionality of social posts in subgroups. By training BNPM on a large amount of randomly sampled subgroups, we can get the global distribution of behavioral patterns. For each given subgroup of social posts, its posterior distribution can be inferred by BNPM. By comparing the posterior distribution with the global distribution, we can quantify the exceptionality of each given subgroup. The exceptionality scores are used to guide the search process within the exceptional model mining framework to automatically discover the exceptional subgroups. Various experiments are conducted to evaluate the effectiveness and efficiency of our method. On four real-world datasets our method discovers subgroups coinciding with events, subgroups distinguishing professionals from tourists, and subgroups whose consistent exceptionality can only be truly appreciated by combining exceptional spatio-temporal and exceptional textual behavior.
... Furthermore, the calculation (of Bayesian evidence) could easily be distributed onto several computational units, cf. (Becker et al. 2016). ...
Article
Full-text available
Understanding edge formation represents a key question in network analysis. Various approaches have been postulated across disciplines ranging from network growth models to statistical (regression) methods. In this work, we extend this existing arsenal of methods with JANUS, a hypothesis-driven Bayesian approach that allows to intuitively compare hypotheses about edge formation in multigraphs. We model the multiplicity of edges using a simple categorical model and propose to express hypotheses as priors encoding our belief about parameters. Using Bayesian model comparison techniques, we compare the relative plausibility of hypotheses which might be motivated by previous theories about edge formation based on popularity or similarity. We demonstrate the utility of our approach on synthetic and empirical data.JANUS is relevant for researchers interested in studying mechanisms explaining edge formation in networks from both empirical and methodological perspectives.
... While the method from Sect. 3.4 has converged quickly ( 50 iterations) so that we were able to calculate our results on regular consumer hardware in a few hours, parallelizations along the lines of Becker et al. (2016) may be useful for larger datasets. We have also experimented with other methods for approximating the marginal likelihood such as Chib (1995), but have found irregularities in the convergence behavior. ...
Article
Full-text available
Sequential traces of user data are frequently observed online and offline, e.g.,as sequences of visited websites or as sequences of locations captured by GPS. However,understanding factors explaining the production of sequence data is a challenging task,especially since the data generation is often not homogeneous. For example, navigation behavior might change in different phases of a website visit, or movement behavior may vary between groups of user. In this work, we tackle this task and propose MixedTrails, a Bayesian approach for comparing the plausibility of hypotheses regarding the generative processes of heterogeneous sequence data. Each hypothesis represents a belief about transition probabilities between a set of states that can vary between groups of observed transitions.For example, when trying to understand human movement in a city, a hypothesis assuming tourists to be more likely to move towards points of interests than locals, can be shown to be more plausible with observed data than a hypothesis assuming the opposite. Our approach incorporates these beliefs as Bayesian priors in a generative mixed transition Markov chain model, and compares their plausibility utilizing Bayes factors. We discuss analytical and approximate inference methods for calculating the marginal likelihoods for Bayes factors,give guidance on interpreting the results, and illustrate our approach with several experiments on synthetic and empirical data from Wikipedia and Flickr. Thus, this work enables a novel kind of analysis for studying sequential data in many application areas.
Conference Paper
The analysis of sequential patterns is a prominent research topic. In this paper, we provide a formalization of a graph-based approach, such that a directed weighted graph/network can be extended using a sequential state transformation function, that “interprets” the network in order to model state transition matrices. We exemplify the approach for deriving such interpretations, in order to assess these and according hypotheses in an industrial application context. Specifically, we present and discuss results of applying the proposed approach for topology and anomaly analytics in a large-scale real-world sensor-network.
Article
When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful, for example, for improving underlying network structures, predicting user clicks, or enhancing recommendations. In this work, we present a method called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our method utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to calculate the evidence of the data under them. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method, and to compare the relative plausibility of hypotheses, we employ Bayes factors. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including Web site navigation, business reviews, and online music played. Our work expands the repertoire of methods available for studying human trails.
Conference Paper
Understanding human movement trajectories represents an important problem that has implications for a range of societal challenges such as city planning and evolution, public transport or crime. In this paper, we focus on geo-temporal photo trails from four different cities (Berlin, London, Los Angeles, New York) derived from Flickr that are produced by humans when taking sequences of photos in urban areas. We apply a Bayesian approach called HypTrails to assess different explanations of how the trails are produced. Our results suggest that there are common processes underlying the photo trails observed across the studied cities. Furthermore, information extracted from social media, in the form of concepts and usage statistics from Wikipedia, allows for constructing explanations for human movement trajectories.
Article
When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful for e.g., improving underlying network structures, predicting user clicks or enhancing recommendations. In this work, we present a general approach called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our approach utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to leverage the sensitivity of Bayes factors on the prior for comparing hypotheses with each other. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including website navigation, business reviews and online music played. Our work expands the repertoire of methods available for studying human trails on the Web.