Conference Paper

Topic: Social Networks Parallel Crawling for Online Social Networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Given a huge online social network, how do we retrieve information from it through crawling? Even better, how do we improve the crawling performance by using parallel crawlers that work independently? In this paper, we present the framework of parallel crawlers for online social networks, utilizing a centralized queue. To show how this works in practice, we describe our implementation of the crawlers for an online auction website. The crawlers work independently, therefore the failing of one crawler does not affect the others at all. The framework ensures that no redundant crawling would occur. Using the crawlers that we built, we visited a total of approximately 11 million auction users, about 66,000 of which were completely crawled.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Moreover, most of the data is repeatedly overwritten due to the limited memory of handheld devices. Therefore, device and operating system based forensic recovery do not provide complete information; this fact is acknowledged by several authors (Chau et al., 2007;Cho and Garcia-Molina, 2002;Ding et al., 2013;Psallidas et al., 2013;Wong et al., 2014). Despite the incomplete information, device analysis is instrumental in retrieving information such as additional profiles, passwords and deleted artifacts by the user, that may not be retrieved otherwise. ...
... Later studies suggested the use of web crawlers for online data extraction, including complete history from media sites, to overcome the limitation of incomplete retrieval (Chau et al., 2007;Cho and Garcia-Molina, 2002;Ding et al., 2013;Psallidas et al., 2013;Wong et al., 2014). Web crawlers provide more detailed collection due to their systematic browsing in addition to finding and following subsequent hyperlinks. ...
Article
In recent years, examination of the social media networks has become an integral part of investigations. Law enforcement agencies and legal practitioners frequently utilize social networks to quickly access the information related to the participants of any illicit incident. However, the forensic process needs collection and analysis of the information which is immense, heterogeneous, and spread across multiple social networks. This process is technically intricate due to heterogeneous and unstructured online social networks (OSNs). Hence, creating cognitive challenges and massive workloads for the investigators. Therefore, it is imperative to develop automated and reliable solutions to assist investigators. Capturing the forensic information in the structured form is crucial for automation, sharing, and interoperability. This paper introduces the design of a multi-layer framework; from collection to evidence analysis. The central component of this framework is a hybrid ontology approach that involves multiple ontologies to manage the unstructured data and integrate various social media data collections. This approach aims to find the evidence by automated methods that are trustworthy and therefore admissible in a court of law.
... Hence, the chances are low to recover the forensic data from gadgets completely. This fact is further acknowledged in numerous other studies (Chau et al., 2007;Cho and Garcia-Molina, 2002;Ding et al., 2013;Psallidas et al., 2013;Wong et al., 2014). ...
... Initially, web crawlers are suggested to extract online data from social media sites (Cho and Garcia-Molina, 2002;Chau et al., 2007;Ding et al., 2013;Psallidas et al., 2013;Wong et al., 2014). A web crawler starts with a target URL and systematically browses through that web page and identifies the hyperlinks for recursive visits. ...
Article
Social Media (SM) evidence is a new and rapidly emerging frontier in digital forensics. The trail of digital information on social media, if explored correctly, can offer remarkable support in criminal investigations. However, exploring social media for potential evidence and presenting these proofs in court is not a straightforward task. Social media evidence must be collected by a legally and scientifically appropriate forensic process and also coincide with the privacy rights of individuals. Following the legal process is a challenging task for legal practitioners and investigators due to the highly dynamic and heterogeneous nature of social media. Forensic investigators can conduct effective investigations and collect legally sound evidence efficiently if they are provided with sophisticated tools to manage the diversity and size of social media content. This article explains the current state of evidence acquisition, admissibility, and jurisdiction in social media forensics. It also describes the immediate challenges for the collection, analysis, presentation, and validation of social media evidence in legal proceedings. Furthermore, the research gaps in the domain and few research objectives with potential research directions are presented.
... In order to provide a direct access to the needed research data for some research communities, both researchers and organizations have sought to find a compromise solution respecting Twitter policies (Abdulrahman et al. 2011, McCreadie et al. 2012). However, the proposed direct access methods did not cover all the researchers needs (Chau et al. 2007). To deal with such problem, many researchers have implemented their own platforms integrating new extraction techniques dealing with the Twitter interfaces restrictions. ...
... However, the issue that rises in this case is how these crawlers can be managed in parallel. Chau et al. (2007) managed their parallel crawlers using a centralized coordinator and a data master server managing the sub-list of users queue that has to be processed by each crawler. Canali et al. (2011) integrated a centralized engine module coordinating between the different parallel crawling tasks by exploiting the MapReduce programming paradigm. ...
Thesis
During crisis events such as disasters, the need of real-time information retrieval (IR) from microblogs remains inevitable. However, the huge amount and the variety of the shared information in real time during such events over-complicate this task. Unlike existing IR approaches based on content analysis, we propose to tackle this problem by using user-centricIR approaches with solving the wide spectrum of methodological and technological barriers inherent to : 1) the collection of the evaluated users data, 2) the modeling of user behavior, 3) the analysis of user behavior, and 4) the prediction and tracking of prominent users in real time. In this context, we detail the different proposed approaches in this dissertation leading to the prediction of prominent users who are susceptible to share the targeted relevant and exclusive information on one hand and enabling emergency responders to have a real-time access to the required information in all formats (i.e. text, image, video, links) on the other hand. These approaches focus on three key aspects of prominent users identification. Firstly, we have studied the efficiency of state-of-the-art and new proposed raw features for characterizing user behavior during crisis events. Based on the selected features, we have designed several engineered features qualifying user activities by considering both their on-topic and off-topic shared information. Secondly, we have proposed a phase-aware user modeling approach taking into account the user behavior change according to the event evolution over time. This user modeling approach comprises the following new novel aspects (1) Modeling microblog users behavior evolution by considering the different event phases (2) Characterizing users activity over time through a temporal sequence representation (3) Time-series-based selection of the most discriminative features characterizing users at each event phase. Thirdly, based on this proposed user modeling approach, we train various prediction models to learn to differentiate between prominent and non-prominent users behavior during crisis event. The learning task has been performed using SVM and MoG-HMMs supervised machine learning algorithms. The efficiency and efficacy of these prediction models have been validated thanks to the data collections extracted by our multi-agents system MASIR during two flooding events who have occured in France and the different ground-truths related to these collections.
... The common method to gather relevant and usable information is following the hyperlinks and expanding the domain. Later, with the creation and explosive growth of OSNs, the attention was largely shifted to harvesting public information on these networks [12,13,14]. Chau et al. [12] were able to crawl approximately 11 million auction users on Ebay by a parallel crawler. ...
... Later, with the creation and explosive growth of OSNs, the attention was largely shifted to harvesting public information on these networks [12,13,14]. Chau et al. [12] were able to crawl approximately 11 million auction users on Ebay by a parallel crawler. The record of successful crawling belongs to Kwak et al. [14] who gathered 41.7 million public user profiles, 1.47 billion social relations and 106 million tweets. ...
Article
Full-text available
We investigate a graph probing problem in which an agent has only an incomplete view GGG' \subsetneq G of the network and wishes to explore the network with least effort. In each step, the agent selects a node u in GG' to probe. After probing u, the agent gains the information about u and its neighbors. All the neighbors of u become \emph{observed} and are \emph{probable} in the subsequent steps (if they have not been probed). What is the best probing strategy to maximize the number of nodes explored in k probes? This problem serves as a fundamental component for other decision-making problems in incomplete networks such as information harvesting in social networks, network crawling, network security, and viral marketing with incomplete information. While there are a few methods proposed for the problem, none can perform consistently well across different network types. In this paper, we establish a strong (in)approximability for the problem, proving that no algorithm can guarantees finite approximation ratio unless P=NP. On the bright side, we design learning frameworks to capture the best probing strategies for individual network. Our extensive experiments suggest that our framework can learn efficient probing strategies that \emph{consistently} outperform previous heuristics and metric-based approaches.
... In order to provide a direct access to the needed research data for some research communities, both researchers and organizations have sought to find a compromise solution respecting Twitter policies (Abdulrahman et al. 2011, McCreadie et al. 2012). However, the proposed direct access methods did not cover all the researchers needs (Chau et al. 2007). To deal with such problem, many researchers have implemented their own platforms integrating new extraction techniques dealing with the Twitter interfaces restrictions. ...
... However, the issue that rises in this case is how these crawlers can be managed in parallel. Chau et al. (2007) managed their parallel crawlers using a centralized coordinator and a data master server managing the sub-list of users queue that has to be processed by each crawler. Canali et al. (2011) integrated a centralized engine module coordinating between the different parallel crawling tasks by exploiting the MapReduce programming paradigm. ...
... Specifically, we define that different research topics address social professional networks and are divided in issues and tasks. The issues emerge from the need for crawling, storing, managing and treating the data from the networks [17,20,38,43,46,48,51,74,88,91,115,135,138]. Then, the tasks represent the ways that such networks can be analyzed, used, improved and applied in different contexts [4,5,14,33,34,45,54,55,62,76,82,85,97,111,119,122]. ...
... The focus is obtaining the data from social networks or other sources. Current approaches include data from social networks websites [20,43,88,91], digital libraries [17,51,138] and the web [48,115] (i.e., researchers use data from digital libraries and the web to build the structure of social networks). ...
Article
Social professional networks provide features not available in other networks. For example, LinkedIn and AngelList facilitate professional networking, and GitHub enables committing and sharing code. Such social networks also provide data with information about users, their behavior, interactions and posted content. Here, we aim to foster a deeper understanding of the social professional networks types, definitions, features, analyses and applications while providing a useful taxonomy about their use.
... Graph traversal defines as visits every vertex and edge exactly once only in a welldefined order [110,111,112]. When using a certain graph algorithm, make sure that each vertex of the graph visits once in a time [113,114,115]. ...
Thesis
There are various pharmaceutical products, such as cosmetics, drugs, and others. Some of the products mixed drugs and herbs without knowing their interactions. Some people believe that when they consume drugs and herbs together, it's more effective to cure their diseases, so they can reduce the portion of the drugs and save more money by just getting herbs since herbs are easy to find. Drug-herb interactions (DHIs) define the interactions between conventional drugs and herbal medicines. DHIs mostly take place in prescribed drugs, dietary supplements, and a small portion of foods. The information on DHIs is scattered since it has heterogeneous databases and website resources, and some of the databases require a subscription or payment to gather the information. As a result, gathering information about DHIs on herbs and drugs is necessary for a variety of purposes, particularly for researchers. The improvement of web crawlers is proposed to collect information from the resources on the Internet. In web crawlers, the indexing algorithms are used to determine the page's relevance according to its priority. Breadth First Search (BFS) and PageRank indexing algorithms are used to perform an index of the heterogeneous website's entries. As a result, we believe that employing the focus web crawler will improve efficiency.
... Furthermore, TONIC operates in an initially unknown SN graph, while in topologybased link prediction, a large part of the network topology is given. Obtaining personal information from a SN by intelligent crawling (Chau et al. 2007;Mislove et al. 2007;Korolova et al. 2008) or other methods (Bonneau, Anderson, and Danezis 2009) was previously discussed, where the goal was to uncover large portions of a SN. In TONIC we wish to retrieve information about a specific target and avoid further crawling, requiring different problem formulation and heuristics. ...
Article
In this paper we introduce the Target Oriented Network Intelligence Collection (TONIC) problem, which is the problem of finding profiles in a social network that contain information about a given target via automated crawling. We formalize TONIC as a search problem and a best-first approach is proposed for solving it.Several heuristics are presented to guide this search.These heuristics are based on the topology of the currently known part of the social network.The efficiency of the proposed heuristics and the effect of the graph topology on their performance is experimentally evaluated on the Google+ social network.
... We already have a lot and different efficient type of web crawlers that are already been described in the citations in the last section. All these crawlers guarantee or function to extract data store them, take queries execute the faster way to 6 search/crawl the web through different methods like focused crawlers incremental web crawlers etc. ...
Experiment Findings
Full-text available
World wide web is massive, preferrable and appropriate source of information and its user are increasing exponentially more faster than ever before. The web consist of structured and unstructured data. A web crawler crawls from one page to another fetch data, files, load content and index them. An efficient crawler can get our hands over some vital information, but still “Importance of security still not understood by the developers”-Cyber Security Tech Accord. As said above the web consist of various structured unstructured data consisting of sensitive information, and due to lack of awareness consumer are unaware about the seriousness of the threat and how they can be affected by the same. Thus, a better approach is required to move forward with both the entities in hand – network and security. We present the base structure for a web crawler or spyder that automatically browse the website, all the hyperlinks it contains in an automated manner, and followed by the same is the security test for all what we fetched. The proposed web crawler based on domain-directory based web crawling to parse and go through content and a parallel thread runs to test for vulnerabilities within all the content being discovered. In the wrong hands, it provide a faster way to a black hat hacker to get inside the system, install malware, inject code or perform other attack vectors, in a safe hand it can help to go through the whole website at once, discover all the vulnerable areas and develop the same.
... Web 2.0. The advent of the user-generated content philosophy and the participatory culture that was brought by Web 2.0 sites such as blogs, forums and social media, formed a new generation of specialized crawlers that focused on forum [29][30][31][32][33][34], blog/microblog [35,36], and social media [37][38][39][40] spidering. The need for specialized crawlers for these websites emerged from the quality and creation rate of content usually found in forums/blogs, the well-defined structure that is inherent in forums/blogs that makes it possible to even develop frameworks for creating blog crawlers [41], and the implementation particularities that make other types of crawlers inappropriate or inefficient for the task. ...
Article
Full-text available
In today’s world, technology has become deep-rooted and more accessible than ever over a plethora of different devices and platforms, ranging from company servers and commodity PCs to mobile phones and wearables, interconnecting a wide range of stakeholders such as households, organizations and critical infrastructures. The sheer volume and variety of the different operating systems, the device particularities, the various usage domains and the accessibility-ready nature of the platforms creates a vast and complex threat landscape that is difficult to contain. Staying on top of these evolving cyber-threats has become an increasingly difficult task that presently relies heavily on collecting and utilising cyber-threat intelligence before an attack (or at least shortly after, to minimize the damage) and entails the collection, analysis, leveraging and sharing of huge volumes of data. In this work, we put forward inTIME, a machine learning-based integrated framework that provides an holistic view in the cyber-threat intelligence process and allows security analysts to easily identify, collect, analyse, extract, integrate, and share cyber-threat intelligence from a wide variety of online sources including clear/deep/dark web sites, forums and marketplaces, popular social networks, trusted structured sources (e.g., known security databases), or other datastore types (e.g., pastebins). inTIME is a zero-administration, open-source, integrated framework that enables security analysts and security stakeholders to (i) easily deploy a wide variety of data acquisition services (such as focused web crawlers, site scrapers, domain downloaders, social media monitors), (ii) automatically rank the collected content according to its potential to contain useful intelligence, (iii) identify and extract cyber-threat intelligence and security artifacts via automated natural language understanding processes, (iv) leverage the identified intelligence to actionable items by semi-automatic entity disambiguation, linkage and correlation, and (v) manage, share or collaborate on the stored intelligence via open standards and intuitive tools. To the best of our knowledge, this is the first solution in the literature to provide an end-to-end cyber-threat intelligence management platform that is able to support the complete threat lifecycle via an integrated, simple-to-use, yet extensible framework.
... BFS is one of the simplest graph traversal algorithms for mining social network graphs. This algorithm is easy to implement and has proved to be optimal for sampling large social network graphs by crawling the OSNs (Catanese et al., 2011;Chau et al., 2007;Erlandsson et al., 2015;Mislove et al., 2007). The BFS algorithm has been employed in the first module of our crawler to extract the complete friends and FOF network of a seed user (see Figure 2). ...
Article
Full-text available
The rapid proliferation and extensive use of online social networks (OSNs) like Facebook, Twitter, Instagram, etc., has attracted the attention of academia and industry, since these networks store massive information in them. But, acquiring data from these OSNs, which is a prerequisite for conducting any research on them, is a daunting task, which can be because of privacy concerns on one hand and complexity of underlying technologies of these complex networks, on the other. This paper presents the design and implementation of a crawler based on browser simulation for extraction of Facebook users profile data while preserving the privacy. The breadth-first-search (BFS) algorithm approach was also adopted for sampling of around 0.235 million Facebook users. Though the main purpose of this work is the design of a crawler still, the results have been briefly presented in terms of various social network metrics and analysed from different aspects of privacy. 322 S.I. Bhat et al.
... BFS is one of the simplest graph traversal algorithms for mining social network graphs. This algorithm is easy to implement and has proved to be optimal for sampling large social network graphs by crawling the OSNs (Catanese et al., 2011;Chau et al., 2007;Erlandsson et al., 2015;Mislove et al., 2007). The BFS algorithm has been employed in the first module of our crawler to extract the complete friends and FOF network of a seed user (see Figure 2). ...
... Hadirnya jejaring media sosial semakin menjembatani seseorang untuk memiliki banyak keterhubungan, baik secara offline maupun online. Fasilitas untuk berkolaborasi dan berinteraksi sosial sangat digemari saat ini, yang dibuktikan dengan semakin membengkaknya angka pengguna media jejaring sosial, seperti facebook, twitter, flikr dan sebagainya (5) . ...
Article
AbstrakPada kasus HIV dalam skala nasional, menunjukkan bahwa kelompok heteroseks juga termasuk sebagai kelompokutama yang paling berisiko menderita HIV/AIDS. Peningkatan ini mencolok terijadi sejak 2015 angkanya masih di 4.241 kasus, dan meningkat hingga lebih dari dua kali lipat pada 2016 yang mencapai 13.063 kasus. Data pemetaaninteraksi di sosial media khususnya wilayah Kendari terdapat sekitar 800 akun yang memberi interaksi perihal Gay.Hal ini diindikasikan akan mempengaruhi prevalensi kejadian HIV/AIDS di Kota Kendari. Penelitian ini bertujuan untukmemetakan interaksi perilaku berisiko Gay sebagai early warning system kasus HIV/AIDS. Social Network Analysismerupakan studi yang mempelajari tentang hubungan manusia dengan memanfaatkan teori graf. Penerapan SocialNetworkAnalysis dalam suatu aplikasi mampu menggambarkan relasi atau hubungan antar individu denganmelakukan visualisasi terkait centrality (titik pusat), between centrality (jalur pendek), juga closeness centrality yaknirata-rata jalur terpendek dari interaksi akun di laman FB. Untuk platform Facebook berdasarkan pada hasilpenghitungan diketahui bahwa akun yang berpengaruh terhadap interaksi jejaring sosial adalah akun Gay Kendariyang unggul pada nilai degree centrality,betweeness centrality, dan Closeness centrality. Akun Gay Kendari palingberpengaruh dalam interaksi jaringan sosial Facebook. Melalui social network analysis, penelitian ini memberikangambaran relasi perilaku berisiko LSL/Gay sebagai early warning system kasus HIV/AIDS di kota kendariKata kunci: analisis jaringan sosiai, gay, sistem peringatan dini, HIV/AIDS
... The BFS is easy to implement and efficient; it produces accurate results if applied on social graphs which can be modeled as unweighted graphs. Therefore, it has been applied in a large number of studies about the topology and structure of Online Social Networks [43,223,88,234,36]. ...
Thesis
Internet of Things (IoT) is leading to a paradigm shift within the logistics industry. The advent of IoT has been changing the logistics service management ecosystem. Logistics services providers today use sensor technologies such as GPS or telemetry to collect data in realtime while the delivery is in progress. The realtime collection of data enables the service providers to track and manage their shipment process efficiently. The key advantage of realtime data collection is that it enables logistics service providers to act proactively to prevent outcomes such as delivery delay caused by unexpected/unknown events. Furthermore, the providers today tend to use data stemming from external sources such as Twitter, Facebook, and Waze. Because, these sources provide critical information about events such as traffic, accidents, and natural disasters. Data from such external sources enrich the dataset and add value in analysis. Besides, collecting them in real-time provides an opportunity to use the data for on-the-fly analysis and prevent unexpected outcomes (e.g., such as delivery delay) at run-time. However, data are collected raw which needs to be processed for effective analysis. Collecting and processing data in real-time is an enormous challenge. The main reason is that data are stemming from heterogeneous sources with a huge speed. The high-speed and data variety fosters challenges to perform complex processing operations such as cleansing, filtering, handling incorrect data, etc. The variety of data – structured, semi-structured, and unstructured – promotes challenges in processing data both in batch-style and real-time. Different types of data may require performing operations in different techniques. A technical framework that enables the processing of heterogeneous data is heavily challenging and not currently available. In addition, performing data processing operations in real-time is heavily challenging; efficient techniques are required to carry out the operations with high-speed data, which cannot be done using conventional logistics information systems. Therefore, in order to exploit Big Data in logistics service processes, an efficient solution for collecting and processing data in both realtime and batch style is critically important. In this thesis, we developed and experimented with two data processing solutions: SANA and IBRIDIA. SANA is built on Multinomial Naïve Bayes classifier whereas IBRIDIA relies on Johnson's hierarchical clustering (HCL) algorithm which is hybrid technology that enables data collection and processing in batch style and realtime. SANA is a service-based solution which deals with unstructured data. It serves as a multi-purpose system to extract the relevant events including the context of the event (such as place, location, time, etc.). In addition, it can be used to perform text analysis over the targeted events. IBRIDIA was designed to process unknown data stemming from external sources and cluster them on-the-fly in order to gain knowledge/understanding of data which assists in extracting events that may lead to delivery delay. According to our experiments, both of these approaches show a unique ability to process logistics data. However, SANA is found more promising since the underlying technology (Naïve Bayes classifier) out-performed IBRIDIA from performance measuring perspectives. It is clearly said that SANA was meant to generate a graph knowledge from the events collected immediately in realtime without any need to wait, thus reaching maximum benefit from these events. Whereas, IBRIDIA has an important influence within the logistics domain for identifying the most influential category of events that are affecting the delivery. Unfortunately, in IBRIRDIA, we should wait for a minimum number of events to arrive and always we have a cold start. Due to the fact that we are interested in re-optimizing the route on the fly, we adopted SANA as our data processing framework
... Anche detti programmi di 'social media crawling'(Chau et al. 2007) che raccolgono i dati pubblicati sui social attraverso l'uso di uno o più agenti software chiamati crawler. 5 Un hashtag è un tipo di etichetta utilizzato su alcuni servizi web e social network come aggregatore tematico, la sua funzione è di rendere più facile per gli utenti trovare messaggi su un tema o contenuto specifico. ...
Chapter
Il web è ormai uno strumento informativo di primaria importanza che veicola molteplici immagini della realtà sociale che incidono sull'opinione pubblica. Le migrazioni, che rappresentano un fenomeno importante e largamente percepito come critico, costituiscono un oggetto di dibattito quotidiano a livello sia mediatico che politico, molto presente sul web. I cittadini, per comprenderlo, si rivolgono sempre più spesso al web in quanto fonte informativa di facile accesso, con tutti i rischi però connessi a una lettura acritica delle informazioni. Uno studio su quanto e quali termini vengono cercati con Google offre una panoramica interessante dell'immagine del migrante e del modo in cui essa viene approfondita nella nostra attualità, oltre a fornirci una misura dell'influenza di specifiche parole chiave nella costruzione della coscienza collettiva.
... Boldi et al. have proposed an efficient solution to improve performance of crawling process using distributed multiple crawlers running in parallel [3], [4]. By optimizing the number of parallel agents, performance of the crawler can be remarkably improved [5], [6]. ...
... O conjunto dessas interações nas RSO (Lampe et al. 2008, Olsen e Kraft 2009) forma um grande volume de dados (Péres et al. 2015). Para processar esse grande volume de dados é necessária a utilização de processamento de dados com grande eficiência computacional (Chau et al. 2007). ...
... Concerning the data collection and crawling procedure, the authors in [98] presented in a short way, the framework of parallel crawlers for OSNs by employing a centralized queue. Their crawlers used Breadth First approach for fast crawling of eBay profiles. ...
... Other approaches, such as [30,7,17,11], focus mainly on sampling cost. Specifically, [30] analyzes how rapidly a crawler can reach nodes and links; [7] proposes a framework of parallel crawlers based on Breadth First Search (BFS); [17] investigates the impact of different sampling techniques on the computation of the average node degree of a network; [11] studies several crawling strategies and determines the sampling quality guaranteed by them and the computation effort they require. ...
... O uso regular das RSO permite a ampliação da comunicação e da interação entre seus usuários, tornando insignificante a distância física entre as localidades dos mesmos. Essa comunicação produz um crescente volume de dados e cada vez mais são necessárias técnicas computacionais eficientes, como o processamento paralelo e distribuído, para a realização de um processamento adequado para a recuperação dos conteúdos das RSO (Chau et al. 2007). O uso das RSO nos dispositivos móveis acelera o crescimento deste volume de dados (Péres et al. 2015), podendo até mesmo relatar informações de bio-sensoriamento. ...
... A study in (Chau, Pandit, Wang, & Faloutsos, 2007) has presented a parallel framework to collect data from online auction websites. In order to further shorten the time for data collection, we created multiple user sessions on the agent machines to run multiple scripts simultaneously. ...
Article
Obtaining the desired dataset is still a prime challenge faced by researchers while analyzing Online Social Network (OSN) sites. Application Programming Interfaces (APIs) provided by OSN service providers for retrieving data impose several unavoidable restrictions which make it difficult to get a desirable dataset. In this paper, we present an iMacros technology-based data crawler-IMcrawler, capable of collecting every piece of information which is accessible through a browser from the Facebook website. The proposed crawler addresses most of the challenges allied with web data extraction approaches and most of the APIs provided by OSN service providers. Two broad sections have been extracted from a Facebook user profile, namely, ‘Personal Information’ and ‘Wall Activities’.
... Data yang digunakan dalam penelitian ini merupakan data sekunder yang berupa seluruh unggahan user di dalam platform yang memuat konten penyebaran country branding dengan keyword dan atau hashtag "Wonderful Indonesia"dan minimal memiliki satu interaksi dengan user lain seperti like, retweet dan mention). Pengambilan data pada platform Google Plus dan Facebook diambil dengan menggunakan teknik data scrapping [19] menggunakan scrapper extention pada peramban Google Chome. Kemudian untuk pengambilan data pada platform twitter teknik yang digunakan adalah teknik data crawling [20] menggunakan software R Studio dengan mengakses application programming interface (API). ...
Article
Full-text available
p>Dalam upaya mencapai target pertumbuhan industri pariwisata dan meningkatkan jumlah kedatangan wisatawan asing, Kementerian Pariwisata melakukan berbagi upaya yang salah satunya adalah dengan merambah pemasaran online menggunakan situs jejaring sosial. Penelitian ini dilakukan dengan tujuan untuk memodelkan, menganalisis dan mengevaluasi proses penyebaran informasi mengenai country branding “Wonderful Indonesia” pada top platform situs jejaring sosial Google Plus, Twitter dan Facebook dengan menggunakan pendekatan social network analysis . Dalam penelitian ini akan dilakukan visualisasi jaringan dengan menggunakan metode undirected graph , kemudian menghitung nilai properti jaringan dan mengukur nilai centrality untuk mengidentifikasi aktor-aktor berpengaruh di dalam jaringan. Berdasarkan hasil penelitian, diketahui pola interaksi penyebaran country branding “Wonderful Indonesia” pada ketiga top platform menunjukan pola yang terpecah-pecah ke dalam sub-sub jaringan (komunitas). Terdapat 37 komunitas pada platform Google Plus, 272 komunitas pada plaform Twitter dan 54 komunitas pada plaform facebook. Kemudian Twitter unggul dalam enam atribut properti jaringan sehingga dinilai memiliki performa penyebaran yang lebih baik dibanding dengan jaringan pada platform Google Plus dan Facebook. Berdasarkan hasil hitung centrality maka diketahui akun Tri Rini Nuringtyas pada platform Google Plus, akun SportourismID pada platform Twitter dan akun PlaneTourIndonesia pada platform Facebook merupakan aktor-aktor yang paling berpengaruh dan dapat diberdayakan oleh Kementerian Pariwisata Republik Indonesia untuk meningkatkan penyebaran country branding dan kampanye pariwisata “Wonderful Indonesia”.</p
... La raccolta dei dati generati ed ospitati sui social media viene effettuata mediante una procedura che prende il nome di 'social media crawling' (Chau et al. 2007). La raccolta dati vera e propria è eseguita da uno o più agenti software chiamati crawler. ...
Chapter
Full-text available
In questo capitolo introdurremo le caratteristiche e le peculiarità dei dati social. Ci soffermeremo sulle criticità relative alla raccolta e all’analisi di tali dati, approfondendo alcune metodologie che permettono di utilizzare i dati social per effettuare analisi predittive o comprendere fenomeni di ampio impatto sociale. La discussione sviluppata nel prosieguo di questo capitolo si propone di illustrare la metodologia utilizzata per le analisi contenute in questo volume e farà quindi particolare riferimento allo scenario delle elezioni regionali della Toscana 2015.
... Table 2 shows an example of URI parts. Chau et al. [9] Parallel crawling for online social networks ...
Chapter
Full-text available
World wide web (WWW) is a huge collection of unorganized documents. To build the database from this unorganized network, web crawlers are often used. The crawler which interacts with millions of web pages needs to be efficient in order to make a search engine powerful. This utmost requirement necessitates the parallelization of web crawlers. In this work, a fuzzy-based technique for uniform resource locater (URL) assignment in dynamic web crawler is proposed that utilizes the task splitting property of the processor. In order to optimize the performance of the crawler, the proposed scheme addresses two important aspects, (i) creation of crawling framework with load balancing among parallel crawlers, and (ii) making of crawling process faster by using parallel crawlers with efficient network access. Several experiments are conducted to monitor the performance of the proposed scheme. The results prove the effectiveness of the proposed scheme.
... É produzida uma quantidade de dados cada vez maior [5] oriunda dessas variadas e crescentes formas de interação [2,[6][7][8][9]. Para uma melhor recuperação e análise desses dados, são necessárias técnicas computacionais mais eficientes e que trabalhem com processamento paralelo e distribuído [10]. ...
... É produzida uma quantidade de dados cada vez maior [5] oriunda dessas variadas e crescentes formas de interação [2,[6][7][8][9]. Para uma melhor recuperação e análise desses dados, são necessárias técnicas computacionais mais eficientes e que trabalhem com processamento paralelo e distribuído [10]. ...
Conference Paper
Os Aplicativos Sociais (AS) trabalham de forma integrada com as Redes Sociais Online (RSO) e permitem tanto a criação de ricos conteúdos para compartilhar nas RSO quanto a extração e a mineração dos dados compartilhados nas RSO. Esse trabalho apresenta o AS FitRank, que faz a criação de rankings customizáveis de atividades físicas compartilhadas no Facebook, com a possibilidade de até 60 variações de tipos de rankings. Ele faz a extração dos dados dos compartilhamentos feitos pelos principais AS de monitoramento de atividades físicas, bem como permite que o usuário compartilhe os seus rankings em seu perfil do Facebook. A sua inovação é agrupar em um mesmo ranking usuários de onze AS de monitoramento de atividades físicas, possibilitando uma maior socialização entre os mesmos e uma maior motivação à prática esportiva e ao combate ao sedentarismo, estimulado seus usuários a uma mudança de comportamento para uma vida mais saudável. O FitRank trabalha com o conceito de gamificação, na forma de uma competição social para o comportamento saudável. Ele é uma parte integrante de uma Tese de Doutorado em andamento que fará a mineração dos dados extraídos para compor o Padrão Comportamental do Usuário (PCU) em relação a sua prática de atividades físicas, como o objetivo de fazer a correlação de hábitos saudáveis com a prática de atividades físicas, além de predizer o comportamento saudável do usuário, com a motivação do uso do Facebook. Neste trabalho são apresentados os resultados preliminares de uso do FitRank, com as informações sociodemográficas e também de execução de atividades físicas pelos seus usuários.
... Nowell et al. [6] develop approaches to link prediction based on measures for analyzing the "proximity" of nodes in a network. Chau et al. [7] developed a framework of parallel crawlers for online social networks, utilizing a centralized queue. ...
Conference Paper
Full-text available
Nowadays, online social networks become an important part in people's everyday life. Most of the people share their feelings, views, likings and disliking using social networks. By analysing social networks' data properly, it is possible to identify the behavioural patterns of the users. Considering this fact, in this paper we present a framework to analyse social networks' data to identify human behaviours. We have developed a framework for the collection and analysis of large data by crawling public data obtained from the users of online social networks. The system can analyze data posted by the users in two different languages. Experimental results show that social network is a good resource for estimating attitudes of the people.
... Kwak et al. [9] crawled the whole Twitter snapshot in July 2009: 42M profiles and 1.5B links. Chau et al. [5] proposed a framework of parallel crawlers with a centralized queue. Ye et al. [16] proposed greedy and lottery strategies for crawling static social graph snapshots and studied the strategies on Flickr, LiveJournal, Orkut, and Youtube. ...
Conference Paper
We study the problem of graph tracking with limited information. In this paper, we focus on updating a social graph snapshot. Say we have an existing partial snapshot, G1, of the social graph stored at some system. Over time G1 becomes out of date. We want to update G1 through a public API to the actual graph, restricted by the number of API calls allowed. Periodically recrawling every node in the snapshot is prohibitively expensive. We propose a scheme where we exploit indegrees and outdegrees to discover changes to the actual graph. When there is ambiguity, we probe the graph and verify edges. We propose a novel strategy designed for limited information that can be adapted to different levels of staleness. We evaluate our strategy against recrawling on real datasets and show that it saves an order of magnitude of API calls while introducing minimal errors.
... In this case, ‫ݒ‬ ଵ,ଵ is a friend of ‫ݒ‬ ଵ,ଶ and ‫ݒ‬ ଵ,ଶ is also a friend Type 4 is considered a bilateral friendship. In this paper, bilateral friendships were used a well-known graph traversal algorithm that has been widely used as a crawling strategy to extract data from social networks [8,[15][16][17][18][19]. A BFS program was written and used to extract data from Sina Weibo. ...
Article
Full-text available
With the rapid growth of users in social networking services, data is generated in thousands of terabytes every day. Practical frameworks for data extraction from social networking sites have not been well investigated yet. In this paper, a methodology for data extraction with respect to Sina Weibo is discussed. In order to design a proper method for data extraction, the properties of complex networks and the challenges when extracting data from complex networks are discussed first. Then, the reason for choosing Sina Weibo as the data source is given. After that, the methods for data gathering are introduced and the techniques for data sampling and data clean-up are discussed. Over 1 million users and hundreds of millions of social relations between them were extracted from Sina Weibo using the methods proposed in this paper.
... Essas interações [3][4][5][6][7] geram um grande volume de dados [8] e para fazer a recuperação e a análise desses dados é necessária a utilização de recursos computacionais que contenham grande eficiência [9]. O Facebook é a RSO com a maior quantidade de usuários [10] e a sociabiliade entre seus usuários provoca sentimentos e contágio emocional entre os mesmos [11,12]. ...
Conference Paper
The development of Social Applications (AS), that work with bio-sensors (GPS, accelerometers, gyroscopes, heart monitors, smart wristbands and smart watches) and which allow increased sharing of relevant information in social networks, is growing with the popularity of social networks combined to use AS mobile devices. Thus, it can define a User's Behavioral Pattern (PCU) with a particular focus of study. Therefore, this paper presents the work in progress of a doctoral dissertation that will study the time evolution of the use of AS in the publications of Facebook users' profiles. The case study is related to publications data mining of AS for physical activities (Fitness) to correlate healthy habits and physical activity, in order to predict the user healthy behavior and hence an improvement in their quality life. To do this, we are developing a data extraction tool throught an AS to Facebook, and its attractive point is the generation of competitive rankings customized by users and that can be published to your Facebook profile. Given the human competitive nature, it is expected a good spread of use of this AS, which will allow data mining to define the PCU healthy habits, where this PCU can be used to entice users to have a better quality of life and in this sense, decrease physical inactivity and risk for diseases associated with inactivity.
... For example, in 2007, Chau et al. [52] emphasized how easy it is to retrieve information from OSNs. Chau et al. described their implementation of crawlers for OSNs. ...
Article
Full-text available
The serious privacy and security problems related to online social networks (OSNs) are what fueled two complementary studies as part of this thesis. In the first study, we developed a general algorithm for the mining of data of targeted organizations by using Facebook (currently the most popular OSN) and socialbots. By friending employees in a targeted organization, our active socialbots were able to find new employees and informal organizational links that we could not find by crawling with passive socialbots. We evaluated our method on the Facebook OSN and were able to reconstruct the social networks of employees in three distinct, actual organizations. Furthermore, in the crawling process with our active socialbots we discovered up to 13.55% more employees and 22.27% more informal organizational links in contrast to the crawling process that was performed by passive socialbots with no company associations as friends. In our second study, we developed a general algorithm for reaching specific OSN users who declared themselves to be employees of targeted organizations, using the topologies of organizational social networks and utilizing socialbots. We evaluated the proposed method on targeted users from three actual organizations on Facebook, and two actual organizations on the Xing OSN (another popular OSN platform). Eventually, our socialbots were able to reach specific users with a success rate of up to 70% on Facebook, and up to 60% on Xing.
... • Describes various crawling modes including the performance evaluation criteria for each crawling mode. 2 Chau et al. [9] Parallel crawling for online social networks ...
Conference Paper
Vehicular Ad-hoc Networks (VANETs) are prone to several types of attacks due to their decentralized nature and mobility. The involvement of traffic and human beings makes it important to secure VANETs to a maximum extent. There is an urgent need to move one step ahead of just preventing the attacks. We need to have a mechanism that can access and analyze the risk priority of an attacker's way of attacking and defender's way of defending. The attacker may have several ways for achieving his goal of causing damage and every way has a different severity and impact.. So, the assessment of all these ways can help the defender to get the insight into the psyche of attacker. This assessment will definitely help the society to get prevention from such unwanted attacks. In this paper, a risk priority assessment model of SSL SYN attack is designed using attack-defense tree model in VANETs. Attack-defense tree model is used to analyze and present the approaches used by attackers to achieve SSL SYN attack. A game theoretic approach is adopted by attacker and defender to maximize their objectives. The rationality of the attack defense model using game theory is investigated using risk priority number on the basis of severity, occurrence and detection ratings. The mathematical assessment shows the impact of this work.
... The essential role played by scalability in this context has been recognized quite early by Chau et al. [12]: In their work, a master machine and different parallel, independent crawling agents have been used to capture users' profiles on eBay. The related results confirm that the very bottleneck in crawling performance is represented by the download-rate of the Internet connection. ...
Book
Full-text available
● تقدم الانترنت تطوير كبير للكثير من المجالات ومنها التسويق ، ويمثل التحري التسويقي جزءاً هاما من نظام المعلومات التسويقي ، ولقد قمنا في هذه الدراسة باقتراح و تصميم نظام للتحري التسويقي معتمد على الانترنت كمصدر للبيانات بالاستفادة من أنظمة ذكاء الأعمال وتقنيات التخزين في مستودعات البيانات ، للوصول الى منهجية للنظام المقترح ، وتطبيق عملي للنظام على سوق البورصة السورية الذي يعتبر ذو تكاملية ووثوقية للبيانات ، حيث تم تصميم وبناء عملية (ETL) متكاملة لبناء مستودع لبيانات السوق وتصميم Cube مقترح لحالتنا الدراسية يحقق الجمع بين البيانات (الماكروية والميكروية للسوق) ،تم دراسة وتطبيق خوارزميات التنقيب في البيانات كخوارزميات التصنيف والسلاسل الزمنية ، وكانت النتائج جيدة في النماذج التنبؤية التي وجدت علاقات تصنيفية قوية لمتغيرات السوق الداخلية، مع عدم ايجاد علاقة قوية بين المتغيرات المختلفة لبيئات البحث ، واظهار نتائج جيدة للتنبؤ بسعر السهم بواسطة خوارزمية السلاسل الزمنية
Article
Social media embeddedness relationships consist of online social networks formed by self-organized individual actors and significantly affect many aspects of our lives. Since the high cost and inefficiency of using population networks generated by social media embeddedness relationships to study practical issues, sampling techniques have become increasingly important than ever. Our work consists of three parts. We first comprehensively analyze current sampling selection methods, evaluation indexes, and evaluation methods in terms of technological evolution. In the second part, we systematically conduct sampling tests using representative large-scale social media datasets. The test results indicate that unequal-probability sampling methods can construct similar sample networks at the macroscale and microscale and outperform the equal-probability methods. However, non-negligible sampling errors at the mesoscale seriously affect the sampling reliability and validity. MANOVA tests show that the direct cause of sampling errors is the low in-degree nodes with medium-high betweenness located between the core and periphery, and current sampling methods can't accurately sample such complex interconnected structures. In the third part, we summarize the pros and cons of current sampling methods and provide suggestions for future work.
Article
Full-text available
Nowadays, millions of people use Online Social Networks (OSNs) like Twitter, Facebook and Sina Microblog, to express opinions on current events. The widespread use of these OSNs has also led to the emergence of social bots. What is more, the existence of social bots is so powerful that some of them can turn into influential users. In this paper, we studied the automated construction technology and infiltration strategies of social bots in Sina Microblog, aiming at building friendly and influential social bots to resist malicious interpretations. Firstly, we studied the critical technology of Sina Microblog data collection, which indicates that the defense mechanism of that is vulnerable. Then, we constructed 96 social bots in Sina Microblog and researched the influence of different infiltration strategies, like different attribute settings and various types of interactions. Finally, our social bots gained 5546 followers in the 42-day infiltration period with a 100% survival rate. The results show that the infiltration strategies we proposed are effective and can help social bots escape detection of Sina Microblog defense mechanism as well. The study in this paper sounds an alarm for Sina Microblog defense mechanism and provides a valuable reference for social bots detection.
Article
Full-text available
This report attempts to provide a descriptive analysis of online social media networks of Facebook, Twitter and LinkedIn based on graph theory concepts. The first part of the study focuses on identifying various activities of the above networks and constructing a suitable graph model for each social media to represent respective activities. The next part of this study addresses the evaluation of the constructed models of each social media separately in order to identify the applicability of them to analyze behavioral patterns and characteristics of the users. Final part of this study focuses on proposing a method to provide relevant information on online social networks to outsiders, which will be helpful for competent decision making without violating the user privacy.
Article
Online social networks (OSNs) are structures that help users to interact, exchange, and propagate new ideas. The identification of the influential users in OSNs is a significant process for accelerating the propagation of information that includes marketing applications or hindering the dissemination of unwanted contents, such as viruses, negative online behaviors, and rumors. This article presents a detailed survey of influential users’ identification algorithms and their performance evaluation approaches in OSNs. The survey covers recent techniques, applications, and open research issues on analysis of OSN connections for identification of influential users.
Article
Sampling is a commonly used technique for studying structural properties of online social networks (OSNs). Due to privacy, business, and performance concerns, OSN service providers impose limitations on data access for third parties. The implication of this practice is that one needs to come up with an applicable sampling scheme that can function under these limitations to efficiently estimate structural properties of interest. In this paper, we study how accurately some important properties of graphs can be estimated under a limited data access model. More specifically, we consider random neighbor access (RNA) model as a rather limited data access model in OSNs. In the RNA model, the only query available to get data from the studied graph is the random neighbor query which returns the id of a random neighbor for a given vertex id. We propose various sampling schemes and estimators for average degree and network size under the RNA model. We conduct extensive experiments on both real world OSN graphs and synthetic graphs (1) to measure the performance of the proposed estimators and (2) to identify the factors affecting the accuracy of our estimators. We find that while the average degree estimators can make accurate estimations with reasonable sample sizes despite the extreme data access limitations of the RNA model, network size estimators require quite large sample sizes for accurate estimations.
Article
Full-text available
The number of people and organizations using online social networks as a new way of communication is continually increasing. Messages that users write in networks and their interactions with other users leave a digital trace that is recorded. In order to understand what is going on in these virtual environments, it is necessary systems that collect, process, and analyze the information generated. The majority of existing tools analyze information related to an online event once it has finished or in a specific point of time (i.e., without considering an in-depth analysis of the evolution of users’ activity during the event). They focus on an analysis based on statistics about the quantity of information generated in an event. In this article, we present a multi-agent system that automates the process of gathering data from users’ activity in social networks and performs an in-depth analysis of the evolution of social behavior at different levels of granularity in online events based on network theory metrics. We evaluated its functionality analyzing users’ activity in events on Twitter.
Conference Paper
Full-text available
Recent years have witnessed the dramatic popularity of online social networking services, in which millions of members publicly articulate mutual "friendship" relations. Guided by ethnographic research of these online communities, we have designed and implemented a visualization system for playful end-user exploration and navigation of large scale online social networks. Our design builds upon familiar node link network layouts to contribute customized techniques for exploring connectivity in large graph structures, supporting visual search and analysis, and automatically identifying and visualizing community structures. Both public installation and controlled studies of the system provide evidence of the system's usability, capacity for facilitating discovery, and potential for fun and engaged social activity
Article
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.
Visualizing Online Social Networks Figure 2. A sample eBay user profile page, listing the recent feedback received by the user. User IDs have been edited to protect privacy
  • J Heer
  • D Boyd
  • Vizster
Heer, J., and Boyd, D. Vizster: Visualizing Online Social Networks. IEEE Symposium on Information Visualization (InfoVis), 2005 Figure 2. A sample eBay user profile page, listing the recent feedback received by the user. User IDs have been edited to protect privacy. WWW 2007 / Poster PaperTopic: Social Networks 1284