Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper describes a hierarchical, three-level modelling framework for monitoring social media. Immediate social reality is modelled through the first level of the models. They represent various virtual communities at social media sites and adhere to the social world models of the sites, i.e., the "site ontologies". The second-level model is a temporal multirelational graph that captures the static and dynamic properties of the first-level models from the perspective of the monitoring site. The third-level model consists of a temporal relational database scheme that models the temporal multirelational graph within the database. The models are specified and instantiated at the monitoring site. An important contribution of the paper is the description of the mappings between the modelling levels and their schematic algorithmic implementation within the monitoring site. The paper also describes theoretical limits for the accuracy and timeliness of monitoring activity, assuming that the monitoring is performed remotely over the internet.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In this case user profiles are modeled as nodes in a graph, and relations between the users on a site are modeled as edges. As was discussed in [1]. There are three levels of modeling involved in a social media analysis. ...
... For example, Twitter has a simple model with a few implemented concepts (tweet, retweet, follow, picture, video, links), whereas Facebook has a much richer set of concepts implemented (friend, timeline, status, group, video, picture, etc...). These site ontologies, as they were called in [1], they can be again modeled as graphs. Thus, the latter kind of graphs or networks modeling friendship relations recorded on social media sites and further relations, like interactions between virtual identities of individuals may need to have over 1 billion nodes. ...
... The term graph is used to refer to formal models that model the networks in the immediate reality or as they are recorded on some social media sites. I.e., they are second level models in the parlance of [1]. Often, graphs modeling real world networks contain groups of densely connected nodes having fewer connections between the groups. ...
... Data collection was performed where we used a list of ca. 100 prominent political bloggers as a starting point and collected the list of their friends and friends-of-friends. The data is stored in the SOMEA repository in a special graph form [3] and is retrieved from there for analysis. The paper is structured as follows. ...
... videos, text messages, photographs) that the members generate. In [3], we argued that virtual communities are modeled by social media sites in a way that is typical of each of them. These special models are called site ontologies. ...
... If we want to analyze social media sites outside of the sites themselves, we have to capture the concepts of the site ontologies in total or in part and populate the ontologies with the concept instances at a remote site. These form the second-level models in the parlance of [3]. The argument is that a complicated-enough graph-a temporal multidigraph-can represent all secondlevel models. ...
Conference Paper
Full-text available
In this article, we describe an analysis of social network data set that was collected from the Live Journal (LJ) site during autumn 2013. Initially, we collected 114 politically active LJ user profiles, and friends of their friends using the graph search, i.e. those users who are two hops away from those on the original list. A graph was formed from the data where a node exists for each collected profile, and for each "friend" and "friend-of-friend" relationship, an arc connects the corresponding nodes. The graph features ca. 1,6M nodes and 52 million arcs. The goal result is the clusters of the graph and the connections between these clusters. Further analysis have been done on classification of in and out degree and its affect on the clusters.
... In 2005 there were 150 exabytes digital information produced; and there were already 1200 exabytes in 2010 [21]. Nearly 70% of this content is user generated [46]. One of the enablers of rapid growth of user-generated content are social media sites. ...
Article
Full-text available
Multipartite entity resolution aims at integrating records from multiple datasets into one entity. We derive a mathematical formulation for a general class of record linkage problems in multipartite entity resolution across many datasets as a combinatorial optimization problem known as the multidimensional assignment problem. As a motivation for our approach, we illustrate the advantage of multipartite entity resolution over sequential bipartite matching. Because the optimization problem is NP-hard, we apply two heuristic procedures, a Greedy algorithm and very large scale neighborhood search, to solve the assignment problem and find the most likely matching of records from multiple datasets into a single entity. We evaluate and compare the performance of these algorithms and their modifications on synthetically generated data. We perform computational experiments to compare performance of recent heuristic, the very large-scale neighborhood search, with a Greedy algorithm, another heuristic for the MAP, as well as with two versions of genetic algorithm, a general metaheuristic. Importantly, we perform experiments to compare two alternative methods of re-starting the search for the former heuristic, specifically a random-sampling multi-start and a deterministic design-based multi-start. We find evidence that design-based multi-start can be more efficient as the size of databases grow large. In addition, we show that very large scale search, especially its multi-start version, outperforms simple Greedy heuristic. Hybridization of Greedy search with very large scale neighborhood search improves the performance. Using multi-start with as few as three additional runs of very large scale search offers some improvement in the performance of the very large scale search procedure. Last, we propose an approach to evaluating complexity of the very large-scale neighborhood search.
... In 2005 there were 150 exabytes digital information produced; and there were already 1200 exabytes in 2010 [21]. Nearly 70% of this content is user generated [46]. One of the enablers of rapid growth of user-generated content are social media sites. ...
Preprint
Full-text available
Multipartite entity resolution aims at integrating records from multiple datasets into one entity. We derive a mathematical formulation for a general class of record linkage problems in multipartite entity resolution across many datasets as a combinatorial optimization problem known as the multidimensional assignment problem. As a motivation for our approach, we illustrate the advantage of multipartite entity resolution over sequential bipartite matching. Because the optimization problem is NP-hard, we apply two heuristic procedures, a Greedy algorithm and very large scale neighborhood search, to solve the assignment problem and find the most likely matching of records from multiple datasets into a single entity. We evaluate and compare the performance of these algorithms and their modifications on synthetically generated data. We perform computational experiments to compare performance of recent heuristic, the very large-scale neighborhood search, with a Greedy algorithm, another heuristic for the MAP, as well as with two versions of genetic algorithm, a general metaheuristic. Importantly, we perform experiments to compare two alternative methods of re-starting the search for the former heuristic, specifically a random-sampling multi-start and a deterministic design-based multi-start. We find evidence that design-based multi-start can be more efficient as the size of databases grow large. In addition, we show that very large scale search, especially its multi-start version, outperforms simple Greedy heuristic. Hybridization of Greedy search with very large scale neighborhood search improves the performance. Using multi-start with as few as three additional runs of very large scale search offers some improvement in the performance of the very large scale search procedure. Last, we propose an approach to evaluating complexity of the very large-scale neighborhood search.
... At the first stage, they identified the list of relevant communities using a list of predefined keywords. Based on the created list of 14,777 communities, the dataset of 19,430,445 wall posts and 62,193,711 comments were collected using social media monitoring software presented in Semenov and Veijalainen [105] and Semenov et al. [106]. To classify texts into positive and negative classes, the authors applied a rule-based approach with the vocabulary of 8,863 positive and 24,299 negative words in both Russian and Ukrainian. ...
Article
Full-text available
Sentiment analysis has become a powerful tool in processing and analysing expressed opinions on a large scale. While the application of sentiment analysis on English-language content has been widely examined, the applications on the Russian language remains not as well-studied. In this survey, we comprehensively reviewed the applications of sentiment analysis of Russian-language content and identified current challenges and future research directions. In contrast with previous surveys, we targeted the applications of sentiment analysis rather than existing sentiment analysis approaches and their classification quality. We synthesised and systematically characterised existing applied sentiment analysis studies by their source of analysed data, purpose, employed sentiment analysis approach, and primary outcomes and limitations. We presented a research agenda to improve the quality of the applied sentiment analysis studies and to expand the existing research base to new directions. Additionally, to help scholars selecting an appropriate training dataset, we performed an additional literature review and identified publicly available sentiment datasets of Russian-language texts.
... The inclusion of social media in particular, in monitoring activities, which are performed remotely over the internet, is nowadays considered an inevitable trend (Heath and Palenchar, 2009). Semenov and Veijalainen (2013) have recently developed a modelling framework as a basis for social media monitoring to capture longitudinal network online development. At the same time, several researchers have demonstrated the heavy impact of emotions on discourse concerning issues and crises (e.g., Bronstein, 2013;Liu and Kim, 2011;Stieglitz and Dang-Xuan, 2013). ...
Article
This research explores the relation between a crisis and public discussion on related issues. In organisational crisis communication a single-issue strategy is often proposed. Such a strategy, however, may not be adequate in complex crises where the crisis lifecycle is likely to encompass shorter lifecycles of issues that generate attention. Decomposing the online crisis debate into a pattern of issues supports understanding of public perceptions, and hence of crisis response and communication. This is investigated through an analysis of Facebook posts prompted by the loss of Malaysia Airlines flight MH370 in 2014. The analysis shows that during the crisis a variety of related issues arose that became topics of public debate. Compassion for victims dominated in the early stages of the crisis, while later on reputation-related issues took over. The insights gained help in understanding the results of social media monitoring during complex organisational crises and facilitate organisational decision making.
... The inclusion of social media in particular, in monitoring activities, which are performed remotely over the internet, is nowadays considered an inevitable trend (Heath and Palenchar, 2009). Semenov and Veijalainen (2013) have recently developed a modelling framework as a basis for social media monitoring to capture longitudinal network online development. At the same time, several researchers have demonstrated the heavy impact of emotions on discourse concerning issues and crises (e.g., Bronstein, 2013;Liu and Kim, 2011;Stieglitz and Dang-Xuan, 2013). ...
... The website, with its ontology being similar to other forum-hosting platforms [59], was crawled in 2015 [58]. The user personal profile data were not saved and the usernames were anonymized. ...
Article
Full-text available
This paper addresses the challenge of strategically maximizing the influence spread in a social network, by exploiting cascade propagators termed “seeds”. It introduces the Seed Activation Scheduling Problem (SASP) that chooses the timing of seed activation under a given budget, over a given time horizon, in the presence/absence of competition. The SASP is framed as a blogger-centric marketing problem on a two-level network, where the decisions are made to buy sponsored posts from prominent bloggers at calculated points in time. A Bayesian evidence diffusion model – the Partial Parallel Cascade (PPC) model – allows the network nodes to be partially activated, proportional to their accumulated evidence levels. The SASP under the PPC model is proven NP-hard. A mixed-integer program is presented for the SASP, along with an efficient column generation heuristic. The paper sets up its problem instances in real-world settings, taking web-based marketing as an application example. Favorable optimality gaps are achieved for SASP solutions on networks based on observed user interactions in pro-health discussion forums. The presented analyses highlight a trade-off between early and late seed activation in igniting and maintaining influence cascades over time. The results reveal the importance of early seeds for campaigns that favor longevity, e.g., in service industry, and the importance of late seeds for campaigns with deadline(s), e.g., in political competitions.
... But it is necessary to consider the category of data while performing data collection. Usually there are three categories of digitally-encoded data which includes public (can be accessed by everyone), semipublic (can be accessed by small group of individuals), or private (can be accessed only the owner) [2]. In the case of semipublic and private, authentication and authorization is needed for accessing data. ...
Article
WWW becomes a widely used platform for different social networks and social medias for the social communication. This platform becomes the oasis of a huge amount of data. Therefore, this data repository draws tremendous attention from corporate, government, NGOs, social workers, politician, etc. to either promote their products or to convey their message to the targeted community. But identification of community structure and social graph becomes a challenging issue for the social network researcher and graph theory researchers since the pervasive usage of instant messaging systems and fundamental shift in publishing contents in these social medias. Although a lot of attention has been given by the researcher to introduce several algorithms for identifying the community structure, most of them are not suitable for dealing with the large scale social network data in real time. This paper presents a model for community detection from social graph using the real time data analytic. In this paper, we introduce data analytic algorithms that can analysis contextual data. These algorithms can analyze large scale social interaction data and can detect a community based on the user supplied threshold value for community detection. Experiment result shows that the proposed algorithms can identify expected number meaningful communities from the social graph.
Book
Full-text available
Open access link http://urn.fi/URN:ISBN:978-951-39-7147-2 (115 pages, with contributions by Irna van der Molen and Markus Mykkänen). Keywords: continuity management, corporate communication, crisis communication, disasters, emergencies, issue arenas, issues management, monitoring, organisational resilience, social media. This book is characterised by a broad approach towards corporate communication, emphasising change and crisis. The focus is not on crises as an exceptional situation but rather on broader volatility in the environment. The purpose of this book is to increase the understanding of multi-stakeholder communication concerning organisational issues and crises. From the perspective of organisational management, this book clarifies how communication contributes to organisational resilience—the ability to adapt to a changing environment and mitigate emergency crises. In todays’ world, change is not the exception but a constant presence. Moreover, issues and risks occur that may grow to become crises. Coping with change and unexpected events, is what the concept of ‘resilience’ is about. Organisational resilience is the basis for the long-term viability of organisations in a turbulent environment. Communication, in various ways, is a bridging activity that supports the capacity of the organisation to function despite risks and disruptive incidents. Attention is needed for a resilient culture and collective mindfulness, in particular, in high reliability organisations. This book explains that the roots of current crises are complex. As many crises combine different kinds of threats, cooperation with other actors is needed for their mitigation. Communication brings such actors together. Communication has often aimed at enhancing dyadic relations between an organisation and its stakeholder groups. The issue arena approach instead focuses on competitive multi-actor interaction and poses that people primarily have a stake in issues that matter to them, rather than in organisations. Issues spread fast in social media and, hence, may result in organisational crises. To understand fast changing public views, developing digital communication and monitoring online discourse are vital. In addition, the diversity of environmental dynamics and crises requires a range of different communication strategies. Research can offer a better understanding of evolving multi-actor interaction concerning issues, risks and crises, and support communicative decision making. This also calls for attention for methodological and ethical constraints in using big data for monitoring purposes. Finally, the book advocates the use of simulations and serious gaming to investigate multi-actor interaction in turbulent environments.
Article
Does the intensity of a social conflict affect political division? Traditionally, social cleavages are seen as the underlying cause of political conflicts. It is clear, however, that a violent conflict itself can shape partisan, social, and national identities. In this paper, we ask whether social conflicts unite or divide the society by studying the effects of Ukraine's military conflict with Russia on online social ties between Ukrainian provinces (oblasts). In order to do that, we collected original data on the cross-regional structure of politically relevant online communication among users of VKontakte social networking site. We analyze the panel of provinces spanning the most active phases of domestic protests and military conflict and isolate the effects of province-specific war casualties on the nature of inter-provincial online communication. The results show that war casualties entice strong emotional response in the corresponding provinces, but do not necessarily increase the level of social cohesion in inter-provincial online communication. We find that the intensity of military conflict entices online activism, but activates regional rather than nation-wide network connections. We also find that military conflict tends to polarize some regions of Ukraine, especially in the East. Our research brings attention to the underexplored areas in the study of civil conflict and political identities by documenting the ways the former may affect the latter.
Article
Full-text available
Surprisingly cruel mass murders and attacks have been witnessed in the educational institutions of the Western world since the 1970s. These are often referred to as 'school shootings'. There have been over 300 known incidents around the world and the number is growing. Social network sites (SNSs) have enabled the perpetrators to express their views and intentions. Our result is that since about 2005, all major school shooters have had a presence in SNS and some have left traces that would have made possible to evaluate their intentions to carry out a rampage. A further hypothesis is that future school shooters will behave in a similar manner and would thus be traceable in the digital sphere. In this paper, we try to take advantage of this tendency and study the presence of school shooting related information in various SNS and its relation to past and perhaps future cases.
Article
Full-text available
The tOWL language is a temporal web ontology language based on OWL-DL without nominals. The language enables the representation of time and time-related aspects, such as state transitions. The design choices of the language pose new challenges from a temporal perspective. One such challenge is the representation of temporal cardinality. Another challenge consists of optimising the temporal representations in order to reduce the number of axioms. One such optimisation is temporal coalescing, which merges concepts that are associated with time intervals that either meet or share at least one instant with each other. In this paper we formally introduce these concepts into the tOWL language and illustrate how they can be applied.
Article
Full-text available
This publication contains reprint articles for which IEEE does not hold copyright. Full text is not available on IEEE Xplore for these articles.
Book
Full-text available
Graphs.- Groups.- Transitive Graphs.- Arc-Transitive Graphs.- Generalized Polygons and Moore Graphs.- Homomorphisms.- Kneser Graphs.- Matrix Theory.- Interlacing.- Strongly Regular Graphs.- Two-Graphs.- Line Graphs and Eigenvalues.- The Laplacian of a Graph.- Cuts and Flows.- The Rank Polynomial.- Knots.- Knots and Eulerian Cycles.- Glossary of Symbols.- Index.
Article
Full-text available
This article details the networked production and dissemination of news on Twitter during snapshots of the 2011 Tunisian and Egyptian Revolutions as seen through information flows—sets of near-duplicate tweets—across activists, bloggers, journalists, mainstream media outlets, and other engaged participants. We differentiate between these user types and analyze patterns of sourcing and routing information among them. We describe the symbiotic relationship between media outlets and individuals and the distinct roles particular user types appear to play. Using this analysis, we discuss how Twitter plays a key role in amplifying and spreading timely information across the globe.
Article
Full-text available
The understanding of dynamics of data streams is greatly affected by the choice of temporal resolution at which the data are discretized, aggregated, and analyzed. Our paper focuses explicitly on data streams represented as dynamic networks. We propose a framework for identifying meaningful resolution levels that best reveal critical changes in the network structure, by balancing the reduction of noise with the loss of information. We demonstrate the applicability of our approach by analyzing various network statistics of both synthetic and real dynamic networks and using those to detect important events and changes in dynamic network structure.
Article
Full-text available
The proposed survey discusses the topic of community detection in the context of Social Media. Community detection constitutes a significant tool for the analysis of complex networks by enabling the study of mesoscopic structures that are often associated with organizational and functional characteristics of the underlying networks. Community detection has proven to be valuable in a series of domains, e.g. biology, social sciences, bibliometrics. However, despite the unprecedented scale, complexity and the dynamic nature of the networks derived from Social Media data, there has only been limited discussion of community detection in this context. More specifically, there is hardly any discussion on the performance characteristics of community detection methods as well as the exploitation of their results in the context of real-world web mining and information retrieval scenarios. To this end, this survey first frames the concept of community and the problem of community detection in the context of Social Media, and provides a compact classification of existing algorithms based on their methodological principles. The survey places special emphasis on the performance of existing methods in terms of computational complexity and memory requirements. It presents both a theoretical and an experimental comparative discussion of several popular methods. In addition, it discusses the possibility for incremental application of the methods and proposes five strategies for scaling community detection to real-world networks of huge scales. Finally, the survey deals with the interpretation and exploitation of community detection results in the context of intelligent web applications and services.
Article
Full-text available
Traditionally, consumers used the Internet to simply expend content: they read it, they watched it, and they used it to buy products and services. Increasingly, however, consumers are utilizing platforms--such as content sharing sites, blogs, social networking, and wikis--to create, modify, share, and discuss Internet content. This represents the social media phenomenon, which can now significantly impact a firm's reputation, sales, and even survival. Yet, many executives eschew or ignore this form of media because they don't understand what it is, the various forms it can take, and how to engage with it and learn. In response, we present a framework that defines social media by using seven functional building blocks: identity, conversations, sharing, presence, relationships, reputation, and groups. As different social media activities are defined by the extent to which they focus on some or all of these blocks, we explain the implications that each block can have for how firms should engage with social media. To conclude, we present a number of recommendations regarding how firms should develop strategies for monitoring, understanding, and responding to different social media activities.
Article
Full-text available
The purpose of this White Paper of the EU Support Action “Visioneer”(see www.visioneer.ethz.ch) is to address the following goals: 1. Develop strategies to quickly increase the objective knowledge about social and economic systems. 2. Describe requirements for efficient large-scale scientific data mining of anonymized social and economic data. 3. Formulate strategies how to collect stylized facts extracted from large data set. 4. Sketch ways how to successfully build up centers for computational social science. 5. Propose plans how to create centers for risk analysis and crisis forecasting. 6. Elaborate ethical standards regarding the storage, processing, evaluation, and publication of social and economic data.
Conference Paper
Full-text available
A social network is an abstract concept consisting of set of people and relationships linking pairs of humans. A new multidimensional model, which covers three main dimensions: relation layer, time window and group, is proposed in the paper. These dimensions have a common set of nodes, typically, corresponding to human beings. Relation layers, in turn, reflect various relationship types extracted from different user activities gathered in computer systems. The time dimension corresponds to temporal variability of the social network. Social groups are extracted by means of clustering methods and group people who are close to each other. An atomic component of the multidimensional social network is a view – small social sub-network, which is in the intersection of all dimensions. A view describes the state of one social group, linked by one type of relationship (one layer), and derived from one time period. The multidimensional model of a social network is similar to a general concept of data warehouse, in which a fact corresponds to a view. Aggregation possibilities and usage of the model is also discussed in the paper. Keywordssocial network–multidimensional social network–multi-layered social network–network model
Article
Full-text available
The concept of Social Media is top of the agenda for many business executives today. Decision makers, as well as consultants, try to identify ways in which firms can make profitable use of applications such as Wikipedia, YouTube, Facebook, Second Life, and Twitter. Yet despite this interest, there seems to be very limited understanding of what the term “Social Media” exactly means; this article intends to provide some clarification. We begin by describing the concept of Social Media, and discuss how it differs from related concepts such as Web 2.0 and User Generated Content. Based on this definition, we then provide a classification of Social Media which groups applications currently subsumed under the generalized term into more specific categories by characteristic: collaborative projects, blogs, content communities, social networking sites, virtual game worlds, and virtual social worlds. Finally, we present 10 pieces of advice for companies which decide to utilize Social Media.
Conference Paper
Full-text available
We introduce the CAPER project (Collaborative information, Acquisition, Processing, Exploitation and Reporting), partially funded by the European Commission. The goal of CAPER is to create a common platform for the prevention of organized crime through sharing, exploitation and linking of Open and Closed information Sources. CAPER will support collaborative multilingual analysis of unstructured and audiovisual contents, based on Text Mining and Visual Analytics technologies. CAPER will allow Law Enforcement Agencies (LEAs) to share informational, investigative and experiential knowledge.
Conference Paper
Full-text available
Models of web data persistency are essential tools for the de- sign of ecien t information extraction systems that repeat- edly collect and process the data. This study models the persistence of web data through the measurement of URL and content persistence across several snapshots of a na- tional community web, collected for 3 years. We found that the lifetimes of URLs and contents are modelled by loga- rithmic functions. We gathered statistics on the structure of the web, identied reasons for URL death and characterized persistent URLs and contents. The lasting contents tend to be referenced by dieren t URLs during their lifetime, while half of the contents referenced by persistent URLs do not change.1
Conference Paper
Full-text available
This paper describes the architecture and a partial implementation of a system designed for the monitoring and analysis of communities at social media sites. The main contribution of the paper is a novel system architecture that facilitates long-term monitoring of diverse social networks existing and emerging at various social media sites. It consists of three main modules, the crawler, the repository and the analyzer. The first module can be adapted to crawl different sites based on ontology describing the structure of the site. The repository stores the crawled and analyzed persistent data using efficient data structures. It can be implemented using special purpose graph databases and/or object-relational database. The analyzer hosts modules that can be used for various graph and multimedia contents analysis tasks. The results can be again stored to the repository, and so on. All modules can be run concurrently.
Conference Paper
Full-text available
As social networks are becoming ubiquitous on the Web, the Semantic Web goals indicate that it is critical to have a standard model allowing exchange, interoperability, transformation, and querying of social network data. In this paper we show that RDF/SPARQL meet this desiderata. Building on developments of social network analysis, graph databases and Semantic Web, we present a social networks data model based on RDF, and a query and transformation language based on SPARQL meeting the above requirements. We study its expressive power and complexity showing that it behaves well, and present an illustrative prototype.
Conference Paper
Full-text available
Temporal streams of interactions are commonly aggregated into dynamic networks for temporal analysis. Results of this analysis are greatly affected by the resolution at which the original data are aggregated. The mismatch between the inherent temporal scale of the underlying process and that at which the analysis is performed can obscure important insights and lead to wrong conclusions. To this day, there is no established framework for choosing the appropriate scale for temporal analysis of streams of interactions. Our paper offers the first step towards the formalization of this problem. We show that for a general class of interaction streams it is possible to identify, in a principled way, the inherent temporal scale of the underlying dynamic processes. Moreover, we state important properties of these processes that can be used to develop an algorithm to identify this scale. Additionally, these properties can be used to separate interaction streams for which no level of aggregation is meaningful versus those that have a natural level of aggregation.
Conference Paper
Full-text available
Evolving the database that is at the core of an Information System represents a difficult maintenance problem that has only been studied in the framework of traditional information systems. However, the problem is likely to be even more severe in web information systems, where open-source software is often developed through the contributions and collaboration of many groups and individuals. Therefore, in this paper, we present an in- depth analysis of the evolution history of the Wikipedia database and its schema; Wikipedia is the best-known example of a large family of web information systems built using the open-source software MediaWiki. Our study is based on: (i) a set of Schema Modification Operators that provide a simple conceptual representation for complex schema changes, and (ii) simple software tools to automate the analysis. This framework allowed us to dissect and analyze the 4.5 years of Wikipedia history, which was short in time, but intense in terms of growth and evolution. Beyond confirming the initial hunch about the severity of the problem, our analysis suggests the need for developing better methods and tools to support graceful schema evolution. Therefore, we briefly discuss documentation and automation support systems for database evolution, and suggest that the Wikipedia case study can provide the kernel of a benchmark for testing and improving such systems.
Conference Paper
Full-text available
In this paper we introduce graph-evolution rules, a novel type of frequency-based pattern that describe the evolution of large networks over time, at a local level. Given a sequence of snapshots of an evolving graph, we aim at discovering rules describing the local changes occurring in it. Adopting a definition of support based on minimum image we study the problem of extracting patterns whose frequency is larger than a minimum support threshold. Then, similar to the classical association rules framework, we derive graph-evolution rules from frequent patterns that satisfy a given minimum confidence constraint. We discuss merits and limits of alternative definitions of support and confidence, justifying the chosen framework. To evaluate our approach we devise GERM (Graph Evolution Rule Miner), an algorithm to mine all graph-evolution rules whose support and confidence are greater than given thresholds. The algorithm is applied to analyze four large real-world networks (i.e., two social networks, and two co-authorship networks from bibliographic data), using different time granularities. Our extensive experimentation confirms the feasibility and utility of the presented approach. It further shows that different kinds of networks exhibit different evolution rules, suggesting the usage of these local patterns to globally discriminate different kind of networks.
Article
Full-text available
We explore problems related to computing graph distances in the data-stream model. The goal is to design algorithms that can process the edges of a graph in an arbitrary order given only a limited amount of working memory. We are motivated by both the practical challenge of processing massive graphs such as the web graph and the desire for a better theoretical understanding of the data- stream model. In particular, we are interested in the trade-offs between model parameters such as per- data-item processing time, total space, and the number of passes that may be taken over the stream. These trade-offs are more apparent when considering graph problems than they were in previous streaming work that solved problems of a statistical nature. Our results include the following: (1) Spanner construction: There exists a single-pass, ˜ O(tn1+1/t)-space, ˜ O(t2n1/t)-time-per-edge Ω(n1+1/k) space. Since constructing BFS trees is an important subroutine in many traditional graph algorithms, this demonstrates the need for new algorithmic techniques when processing graphs in the data-stream model. (3) Graph-distance lower bounds: Any t-approximation of the distance between two nodes requires Ω(n1+1/t) space. We also prove lower bounds for determining the length of the shortest cycle and other graph properties. (4) Techniques for decreasing per-edge processing: We discuss two general techniques for speeding up the per-edge computation time of streaming algorithms while increasing the space by only a small factor.
Article
Full-text available
The data warehousing and OLAP technologies are now moving onto handling complex data that mostly originate from the Web. However, intagrating such data into a decision-support process requires their representation under a form processable by OLAP and/or data mining techniques. We present in this paper a complex data warehousing methodology that exploits XML as a pivot language. Our approach includes the integration of complex data in an ODS, under the form of XML documents; their dimensional modeling and storage in an XML data warehouse; and their analysis with combined OLAP and data mining techniques. We also address the crucial issue of performance in XML warehouses.
Article
Full-text available
Many dynamic applications are built upon large network infrastructures, such as social networks, communication networks, biological networks and the Web. Such applications create data that can be naturally modeled as graph streams, in which edges of the underlying graph are received and updated sequentially in a form of a stream. It is often necessary and important to summarize the behavior of graph streams in order to enable effective query processing. However, the sheer size and dynamic nature of graph streams present an enormous challenge to existing graph management techniques. In this paper, we propose a new graph sketch method, gSketch, which combines well studied synopses for traditional data streams with a sketch partitioning technique, to estimate and optimize the responses to basic queries on graph streams. We consider two different scenarios for query estimation: (1) A graph stream sample is available; (2) Both a graph stream sample and a query workload sample are available. Algorithms for different scenarios are designed respectively by partitioning a global sketch to a group of localized sketches in order to optimize the query estimation accuracy. We perform extensive experimental studies on both real and synthetic data sets and demonstrate the power and robustness of gSketch in comparison with the state-of-the-art global sketch method.
Article
Full-text available
Community structure is one of the main structural features of networks, revealing both their internal organization and the similarity of their elementary units. Despite the large variety of methods proposed to detect communities in graphs, there is a big need for multi-purpose techniques, able to handle different types of datasets and the subtleties of community structure. In this paper we present OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics. It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations, which is estimated with tools of Extreme and Order Statistics. OSLOM can be used alone or as a refinement procedure of partitions/covers delivered by other techniques. We have also implemented sequential algorithms combining OSLOM with other fast techniques, so that the community structure of very large networks can be uncovered. Our method has a comparable performance as the best existing algorithms on artificial benchmark graphs. Several applications on real networks are shown as well. OSLOM is implemented in a freely available software (http://www.oslom.org), and we believe it will be a valuable tool in the analysis of networks.
Article
Full-text available
Community effects on the behaviour of individuals, the community itself and other communities can be observed in a wide range of applications. This is true in scientific research, where communities of researchers have increasingly to justify their impact and progress to funding agencies. While previous work has tried to explain and analyse such phenomena, there is still a great potential for increasing the quality and accuracy of this analysis, especially in the context of cross-community effects. In this work, we propose a general framework consisting of several different techniques to analyse and explain such dynamics. The proposed methodology works with arbitrary community algorithms and incorporates meta-data to improve the overall quality and expressiveness of the analysis. We suggest and discuss several approaches to understand, interpret and explain particular phenomena, which themselves are identified in an automated manner. We illustrate the benefits and strengths of our approach by exposing highly interesting in-depth details of cross-community effects between two closely related and well established areas of scientific research. We finally conclude and highlight the important open issues on the way towards understanding, defining and eventually predicting typical life-cycles and classes of communities in the context of cross-community effects. Comment: Presented at ASNA 2010 conference in Zurich (http://www.asna.ch). 28 pages, 7 tables, and 12 figures
Article
Full-text available
The resource description framework (RDF) is a metadata model and language recommended by the W3C. This paper presents a framework to incorporate temporal reasoning into RDF, yielding temporal RDF graphs. We present a semantics for these kinds of graphs which includes the notion of temporal entailment and a syntax to incorporate this framework into standard RDF graphs, using the RDF vocabulary plus temporal labels. We give a characterization of temporal entailment in terms of RDF entailment and show that the former does not yield extra asymptotic complexity with respect to nontemporal RDF graphs. We also discuss temporal RDF graphs with anonymous timestamps, providing a theoretical framework for the study of temporal anonymity. Finally, we sketch a temporal query language for RDF, along with complexity results for query evaluation that show that the time dimension preserves the tractability of answers
Article
Full-text available
A large body of work has been devoted to defining and identifying clusters or communities in social and information networks. We explore from a novel perspective several questions related to identifying meaningful communities in large social and information networks, and we come to several striking conclusions. We employ approximation algorithms for the graph partitioning problem to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as communities. In particular, we define the network community profile plot, which characterizes the "best" possible community--according to the conductance measure--over a wide range of size scales. We study over 100 large real-world social and information networks. Our results suggest a significantly more refined picture of community structure in large networks than has been appreciated previously. In particular, we observe tight communities that are barely connected to the rest of the network at very small size scales; and communities of larger size scales gradually "blend into" the expander-like core of the network and thus become less "community-like." This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, it is exactly the opposite of what one would expect based on intuition from expander graphs, low-dimensional or manifold-like graphs, and from small social networks that have served as testbeds of community detection algorithms. We have found that a generative graph model, in which new edges are added via an iterative "forest fire" burning process, is able to produce graphs exhibiting a network community profile plot similar to what we observe in our network datasets.
Article
Social media sites have appeared to the cyber space during the last 5-7 years and have attracted hundreds of millions of users. The sites are often viewed as instances of Web 2.0 technologies and support easy uploading and downloading of user generated contents. This content contains valuable real time information about the state of affairs in various parts of the world that is often public or at least semipublic. Many governments, businesses, and individuals are interested in this information for various reasons. In this paper we describe how ontologies can be used in constructing monitoring software that would extract useful information from social media sites and store it over time for further analysis. Ontologies can be used at least in two roles in this context. First, the crawler accessing a site must know the "native ontology" of the site in order to be able to parse the pages returned by the site in question, extract the relevant information (such as friends of a user) and store it into the persistent generic (graph) model instance at the monitoring site. Second, ontologies can be used in data analysis to capture and filter the collected data to find information and phenomena of interest. This includes influence analysis, grouping of users etc. In this paper we mainly discuss the construction of the ontology-guided crawler.
Conference Paper
Nowadays, WWW contains a number of social media sites, which are growing rapidly. One of the main features of social media sites is to allow to its users creation and modification of contents of the site utilizing the offered WWW interfaces. Such contents are referred to as user generated contents and their type varies from site to site. Social media sites can be modeled as constantly evolving multirelational directed graphs. In this paper we discuss persistent data structures for such graphs, and present and analyze queries performed against the structures. We also estimate the space requirements of the proposed data structures, and compare them with the naive "store each complete snapshot of the graph separately". We also investigate query performance against our data structure. We present analytical estimation results, simulation results, and discuss its performance when it is used to store entire contents of Live journal.
Chapter
Data sets originating from many different real world domains can be represented in the form of interaction networks in a very natural, concise and meaningful fashion. This is particularly true in the social context, especially given recent advances in Internet technologies and Web 2.0 applications leading to a diverse range of evolving social networks. Analysis of such networks can result in the discovery of important patterns and potentially shed light on important properties governing the growth of such networks. It has been shown that most of these networks exhibit strong modular nature or community structure. An important research agenda thus is to identify communities of interest and study their behavior over time. Given the importance of this problem there has been significant activity within this field particularly over the last few years. In this article we survey the landscape and attempt to characterize the principle methods for community discovery (and related variants) and identify current and emerging trends as well as crosscutting research issues within this dynamic field.
Chapter
Social influence is the behavioral change of a person because of the perceived relationship with other people, organizations and society in general. Social influence has been a widely accepted phenomenon in social networks for decades. Many applications have been built based around the implicit notation of social influence between people, such as marketing, advertisement and recommendations. With the exponential growth of online social network services such as Facebook and Twitter, social influence can for the first time be measured over a large population. In this chapter, we survey the research on social influence analysis with a focus on the computational aspects. First, we present statistical measurements related to social influence. Second, we describe the literature on social similarity and influences. Third, we present the research on social influence maximization which has many practical applications including marketing and advertisement. KeywordSocial network analysis-Social influence analysis-Network centrality-Influence Maximization
Article
Traditional approaches to temporal reasoning assume that time periods and time spans of events can be accurately represented as intervals. Real-world time periods and events, on the other hand, are often characterized by vague temporal boundaries, requiring appropriate generalizations of existing formalisms. This paper presents a framework for reasoning about qualitative and metric temporal relations between vague time periods. In particular, we show how several interesting problems, like consistency and entailment checking, can be reduced to reasoning tasks in existing temporal reasoning frameworks. We furthermore demonstrate that all reasoning tasks of interest are NP-complete, which reveals that adding vagueness to temporal reasoning does not increase its computational complexity. To support efficient reasoning, a large tractable subfragment is identified, among others, generalizing the well-known ORD Horn subfragment of the Interval Algebra (extended with metric constraints).
Article
Although recently there are extensive research on the collaborative networks and online communities, there is very limited knowledge about the actual evolution of the online social networks (OSN). In the Letter, we study the structural evolution of a large online virtual community. We find that the scale growth of the OSN shows non-trivial S shape which may provide a proper exemplification for Bass diffusion model. We reveal that the evolutions of many network properties, such as density, clustering, heterogeneity and modularity, show non-monotone feature, and shrink phenomenon occurs for the path length and diameter of the network. Furthermore, the OSN underwent a transition from degree assortativity characteristic of collaborative networks to degree disassortativity characteristic of many OSNs. Our study has revealed the evolutionary pattern of interpersonal interactions in a specific population and provided a valuable platform for theoretical modeling and further analysis.
Conference Paper
In many online social systems, social ties between users play an important role in dictating their behavior. One of the ways this can happen is through social influence, the phenomenon that the actions of a user can induce his/her friends to behave in a similar way. In systems where social influence exists, ideas, modes of behavior, or new technologies can diffuse through the network like an epidemic. Therefore, identifying and understanding social influence is of tremendous interest from both analysis and design points of view. This is a difficult task in general, since there are factors such as homophily or unobserved confounding variables that can induce statistical correlation between the actions of friends in a social network. Distinguishing influence from these is essentially the problem of distinguishing correlation from causality, a notoriously hard statistical problem. In this paper we study this problem systematically. We define fairly general models that replicate the aforementioned sources of social correlation. We then propose two simple tests that can identify influence as a source of social correlation when the time series of user actions is available. We give a theoretical justification of one of the tests by proving that with high probability it succeeds in ruling out influence in a rather general model of social correlation. We also simulate our tests on a number of examples designed by randomly generating actions of nodes on a real social network (from Flickr) according to one of several models. Simulation results confirm that our test performs well on these data. Finally, we apply them to real tagging data on Flickr, exhibiting that while there is significant social correlation in tagging behavior on this system, this correlation cannot be attributed to social influence.
Conference Paper
We present a detailed study of network evolution by analyzing four large online social networks with full temporal information about node and edge arrivals. For the first time at such a large scale, we study individual node arrival and edge creation processes that collectively lead to macroscopic properties of networks. Using a methodology based on the maximum-likelihood principle, we investigate a wide variety of network formation strategies, and show that edge locality plays a critical role in evolution of networks. Our findings supplement earlier network models based on the inherently non-local preferential attachment. Based on our observations, we develop a complete model of network evolution, where nodes arrive at a prespecified rate and select their lifetimes. Each node then independently initiates edges according to a "gap" process, selecting a destination for each edge according to a simple triangle-closing model free of any parameters. We show analytically that the combination of the gap distribution with the node lifetime leads to a power law out-degree distribution that accurately reflects the true network in all four cases. Finally, we give model parameter settings that allow automatic evolution and generation of realistic synthetic networks of arbitrary scale.
Conference Paper
The challenge of monitoring massive amounts of data generated by communication networks has led to the interest in data stream processing. We study streams of edges in massive communication multigraphs, defined by (source, destination) pairs. The goal is to compute properties of the underlying graph while using small space (much smaller than the number of communicants), and to avoid bias introduced because some edges may appear many times, while others are seen only once. We give results for three fundamental problems on multigraph degree sequences: estimating frequency moments of degrees, finding the heavy hitter degrees, and computing range sums of degree values. In all cases we are able to show space bounds for our summarizing algorithms that are significantly smaller than storing complete information. We use a variety of data stream methods: sketches, sampling, hashing and distinct counting, but a common feature is that we use cascaded summaries: nesting multiple estimation techniques within one another. In our experimental study, we see that such summaries are highly effective, enabling massive multigraph streams to be effectively summarized to answer queries of interest with high accuracy using only a small amount of space.
Conference Paper
The Web is a dynamic, ever changing collection of information. This paper explores changes in Web content by analyzing a crawl of 55,000 Web pages, selected to represent different user visitation patterns. Although change over long intervals has been explored on random (and potentially unvisited) samples of Web pages, little is known about the nature of finer grained changes to pages that are actively consumed by users, such as those in our sample. We describe algorithms, analyses, and models for characterizing changes in Web content, focusing on both time (by using hourly and sub-hourly crawls) and structure (by looking at page-, DOM-, and term-level changes). Change rates are higher in our behavior-based sample than found in previous work on randomly sampled pages, with a large portion of pages changing more than hourly. Detailed content and structure analyses identify stable and dynamic content within each page. The understanding of Web change we develop in this paper has implications for tools designed to help people interact with dynamic Web content, such as search engines, advertising, and Web browsers.
Conference Paper
Spatio-temporal networks are spatial networks whose topology and parameters change with time. These networks are important due to many critical applications such as emergency traffic planning and route finding services and there is an immediate need for models that support the design of efficient algorithms for computing the frequent queries on such networks. This problem is challenging due to potentially conflicting requirements of model simplicity and support for efficient algorithms. Time expanded networks which have been used to model dynamic networks employ replication of the network across time instants, resulting in high storage overhead and algorithms that are computationally expensive. In contrast, proposed time-aggregated graphs do not replicate nodes and edges across time; rather they allow the properties of edges and nodes to be modeled as a time series. Since the model does not replicate the entire graph for every instant of time, it uses less memory and the algorithms for common operations (e.g. connectivity, shortest path) are computationally more efficient than those for time expanded networks. One important query on spatio-temporal networks is the computation of shortest paths. Shortest paths can be computed either for a given start time or to find the start time and the path that leads to least travel time journeys (best start time journeys). Developing efficient algorithms for computing shortest paths in a time varying spatial network is challenging because these journeys do not always display greedy property or optimal substructure, making techniques like dynamic programming inapplicable. In this paper, we propose algorithms for shortest path computations in both contexts. We present the analytical cost models for the algorithms and provide an experimental comparison of performance with existing algorithms.
Conference Paper
Given applications such as location based services and the spatio-temporal queries they may pose on a spatial network (eg. road networks), the goal is to develop a simple and expressive model that honors the time dependence of the road network. The model must support the design of efficient algorithms for computing the frequent queries on the network. This problem is challenging due to potentially conflicting requirements of model simplicity and support for efficient algorithms. Time expanded networks which have been used to model dynamic networks employ replication of the network across time instants, resulting in high storage overhead and algorithms that are computationally expensive. In contrast, the proposed time-aggregated graphs do not replicate nodes and edges across time; rather they allow the properties of edges and nodes to be modeled as a time series. Since the model does not replicate the entire graph for every instant of time, it uses less memory and the algorithms for common operations (e.g. connectivity, shortest path) are computationally more efficient than the time expanded networks.
Conference Paper
As web pages are created, destroyed, and updated dynamically, web databases should be frequently updated to keep web pages up-to-date. Understanding the change behavior of web pages certainly helps the administrators manage their web databases. This paper introduces a number of metrics representing various change behavior of the web pages. We have monitored approximately 1.8 million to three million URLs at two-day intervals for 100 days. Using the metrics we propose, we analyze the collected URLs and web pages. In addition, we propose a method that computes the probability that a page will be downloaded on the next crawls.
Conference Paper
Social network analysis has attracted much attention in re- cent years. Community mining is one of the major directions in social network analysis. Most of the existing methods on community mining as- sume that there is only one kind of relation in the network, and moreover, the mining results are independent of the users' needs or preferences. However, in reality, there exist multiple, heterogeneous social networks, each representing a particular kind of relationship, and each kind of re- lationship may play a distinct role in a particular task. In this paper, we systematically analyze the problem of mining hidden communities on heterogeneous social networks. Based on the observation that different relations have different importance with respect to a certain query, we propose a new method for learning an optimal linear combination of these relations which can best meet the user's expectation. With the obtained relation, better performance can be achieved for community mining.
Article
Given applications such as location based services and the spatio-temporal queries they may pose on a spatial network (eg. road networks), the goal is to develop a simple and expressive model that honors the time dependence of the road network. The model must support the design of efficient algorithms for computing the frequent queries on the network. This problem is challenging due to potentially conflicting requirements of model simplicity and support for efficient algorithms. Time expanded networks which have been used to model dynamic networks employ replication of the network across time instants, resulting in high storage overhead and algorithms that are computationally expensive. In contrast, the proposed time-aggregated graphs do not replicate nodes and edges across time; rather they allow the properties of edges and nodes to be modeled as a time series. Since the model does not replicate the entire graph for every instant of time, it uses less memory and the algorithms for common operations (e.g. connectivity, shortest path) are computationally more efficient than the time expanded networks.
This work aims at discovering community structure in rich media social networks through analysis of time-varying, multirelational data. Community structure represents the latent social context of user actions. It has important applications such as search and recommendation. The problem is particularly useful in the enterprise domain, where extracting emergent community structure on enterprise social media can help in forming new collaborative teams, in expertise discovery, and in the long term reorganization of enterprises based on collaboration patterns. There are several unique challenges: (a) In social media, the context of user actions is constantly changing and coevolving; hence the social context contains time-evolving multidimensional relations. (b) The social context is determined by the available system features and is unique in each social media platform; hence the analysis of such data needs to flexibly incorporate various system features. In this article we propose MetaFac (MetaGraph Factorization), a framework that extracts community structures from dynamic, multidimensional social contexts and interactions. Our work has three key contributions: (1) metagraph, a novel relational hypergraph representation for modeling multirelational and multidimensional social data; (2) an efficient multirelational factorization method for community extraction on a given metagraph; (3) an online method to handle time-varying relations through incremental metagraph factorization. Extensive experiments on real-world social data collected from an enterprise and the public Digg social media Web site suggest that our technique is scalable and is able to extract meaningful communities from social media contexts. We illustrate the usefulness of our framework through two prediction tasks: (1) in the enterprise dataset, the task is to predict users’ future interests on tag usage, and (2) in the Digg dataset, the task is to predict users’ future interests in voting and commenting on Digg stories. Our prediction significantly outperforms baseline methods (including aspect model and tensor analysis), indicating the promising direction of using metagraphs for handling time-varying social relational contexts.
Article
Although a large body of work is devoted to finding communities in static social networks, only a few studies examined the dynamics of communities in evolving social networks. In this paper, we propose a dynamic stochastic block model for finding communities and their evolution in a dynamic social network. The proposed model captures the evolution of communities by explicitly modeling the transition of community memberships for individual nodes in the network. Unlike many existing approaches for modeling social networks that estimate parameters by their most likely values (i.e., point estimation), in this study, we employ a Bayesian treatment for parameter estimation that computes the posterior distributions for all the unknown parameters. This Bayesian treatment allows us to capture the uncertainty in parameter values and therefore is more robust to data noise than point estimation. In addition, an efficient algorithm is developed for Bayesian inference to handle large sparse social networks. Extensive experimental studies based on both synthetic data and real-life data demonstrate that our model achieves higher accuracy and reveals more insights in the data than several state-of-the-art algorithms.
Article
Recent work in Artificial Intelligence (AI) is exploring the use of formal ontologies as a way of specifying content-specific agreements for the sharing and reuse of knowledge among software entities. We take an engineering perspective on the development of such ontologies. Formal ontologies are viewed as designed artifacts, formulated for specific purposes and evaluated against objective design criteria. We describe the role of ontologies in supporting knowledge sharing activities, and then present a set of criteria to guide the development of ontologies for these purposes. We show how these criteria are applied in case studies from the design of ontologies for engineering mathematics and bibliographic data. Selected design decisions are discussed, and alternative representation choices are evaluated against the design criteria.
White House launches big data R&D effort', Computerworld
  • G Gross
The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through
  • J F Gantz
  • D Reinsel
  • C Chute
  • W Schlichting
  • J Mcarthur
  • S Minton
  • I Xheneti
Gantz, J. F., Reinsel, D., Chute, C., Schlichting, W., McArthur, J., Minton, S., Xheneti, I., et al. (2007). The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010.
Community detection in graphs
  • S Fortunato
Fortunato, S. (2009). Community detection in graphs. arXiv:0906.0612. doi:10.1016/j.physrep.2009.11.002