Polyvios Pratikakis’s research while affiliated with University of Crete and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (62)


Analysis of Server Throughput For Managed Big Data Analytics Frameworks
  • Preprint
  • File available

June 2025

Emmanouil Anagnostakis

·

Polyvios Pratikakis

Managed big data frameworks, such as Apache Spark and Giraph demand a large amount of memory per core to process massive volume datasets effectively. The memory pressure that arises from the big data processing leads to high garbage collection (GC) overhead. Big data analytics frameworks attempt to remove this overhead by offloading objects to storage devices. At the same time, infrastructure providers, trying to address the same problem, attribute more memory to increase memory per instance leaving cores underutilized. For frameworks, trying to avoid GC through offloading to storage devices leads to high Serialization/Deserialization (S/D) overhead. For infrastructure, the result is that resource usage is decreased. These limitations prevent managed big data frameworks from effectively utilizing the CPU thus leading to low server throughput. We conduct a methodological analysis of server throughput for managed big data analytics frameworks. More specifically, we examine, whether reducing GC and S/D can help increase the effective CPU utilization of the server. We use a system called TeraHeap that moves objects from the Java managed heap (H1) to a secondary heap over a fast storage device (H2) to reduce the GC overhead and eliminate S/D over data. We focus on analyzing the system's performance under the co-location of multiple memory-bound instances to utilize all available DRAM and study server throughput. Our detailed methodology includes choosing the DRAM budget for each instance and how to distribute this budget among H1 and Page Cache (PC). We try two different distributions for the DRAM budget, one with more H1 and one with more PC to study the needs of both approaches. We evaluate both techniques under 3 different memory-per-core scenarios using Spark and Giraph with native JVM or JVM with TeraHeap. We do this to check throughput changes when memory capacity increases.

Download


TeraHeap: Exploiting Flash Storage for Mitigating DRAM Pressure in Managed Big Data Frameworks

October 2024

·

11 Reads

ACM Transactions on Programming Languages and Systems

Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive datasets that do not always fit on the managed heap. Therefore, frameworks temporarily move long-lived objects outside the heap (off-heap) on a fast storage device. However, this practice results in (1) high serialization/deserialization (S/D) cost and (2) high memory pressure when off-heap objects are moved back for processing. In this article, we propose TeraHeap , a system that eliminates S/D overhead and expensive GC scans for a large portion of objects in analytics frameworks. TeraHeap relies on three concepts: (1) It eliminates S/D by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It offers a simple hint-based interface, allowing analytics frameworks to leverage object knowledge to populate H2. (3) It reduces GC cost by fencing the collector from scanning H2 objects while maintaining the illusion of a single managed heap, ensuring memory safety. We implement TeraHeap in OpenJDK8 and OpenJDK17 and evaluate it with fifteen widely used applications in two real-world big data frameworks, Spark and Giraph. We find that for the same DRAM size, TeraHeap improves performance by up to 73% and 28% compared to native Spark and Giraph. Also, it can still provide better performance by consuming up to 4.6×4.6\times and 1.2×1.2\times less DRAM than native Spark and Giraph, respectively. TeraHeap can also be used for in-memory frameworks and applying it to the Neo4j Graph Data Science library improves its performance by up to 26%. Finally, it outperforms Panthera, a state-of-the-art garbage collector for hybrid DRAM-NVM memories, by up to 69%.



Russo-Ukrainian War: Prediction and explanation of Twitter suspension

June 2023

·

22 Reads

On 24 February 2022, Russia invaded Ukraine, starting what is now known as the Russo-Ukrainian War, initiating an online discourse on social media. Twitter as one of the most popular SNs, with an open and democratic character, enables a transparent discussion among its large user base. Unfortunately, this often leads to Twitter's policy violations, propaganda, abusive actions, civil integrity violation, and consequently to user accounts' suspension and deletion. This study focuses on the Twitter suspension mechanism and the analysis of shared content and features of the user accounts that may lead to this. Toward this goal, we have obtained a dataset containing 107.7M tweets, originating from 9.8 million users, using Twitter API. We extract the categories of shared content of the suspended accounts and explain their characteristics, through the extraction of text embeddings in junction with cosine similarity clustering. Our results reveal scam campaigns taking advantage of trending topics regarding the Russia-Ukrainian conflict for Bitcoin and Ethereum fraud, spam, and advertisement campaigns. Additionally, we apply a machine learning methodology including a SHapley Additive explainability model to understand and explain how user accounts get suspended.


BotArtist: Twitter bot detection Machine Learning model based on Twitter suspension

May 2023

·

57 Reads

·

1 Citation

Twitter as one of the most popular social networks, offers a means for communication and online discourse, which unfortunately has been the target of bots and fake accounts, leading to the manipulation and spreading of false information. Towards this end, we gather a challenging, multilingual dataset of social discourse on Twitter, originating from 9M users regarding the recent Russo-Ukrainian war, in order to detect the bot accounts and the conversation involving them. We collect the ground truth for our dataset through the Twitter API suspended accounts collection, containing approximately 343K of bot accounts and 8M of normal users. Additionally, we use a dataset provided by Botometer-V3 with 1,777 Varol, 483 German accounts, and 1,321 US accounts. Besides the publicly available datasets, we also manage to collect 2 independent datasets around popular discussion topics of the 2022 energy crisis and the 2022 conspiracy discussions. Both of the datasets were labeled according to the Twitter suspension mechanism. We build a novel ML model for bot detection using the state-of-the-art XGBoost model. We combine the model with a high volume of labeled tweets according to the Twitter suspension mechanism ground truth. This requires a limited set of profile features allowing labeling of the dataset in different time periods from the collection, as it is independent of the Twitter API. In comparison with Botometer our methodology achieves an average 11% higher ROC-AUC score over two real-case scenario datasets.


Fig. 1. Example of a Twitter list.
Fig. 2. Illustrative user-list graph containing 11 users (brand and celebrity accounts) grouped into three Twitter lists.
Fig. 3. Illustration of the data set's similarity cloud.
Fig. 4. List-user graph depicting three brand users and three celebrity users from our test set.
Fig. 5. Illustration of the Jaccard-normalized list similarity.

+3

Mining Twitter lists to extract brand-related associative information for celebrity endorsement

May 2023

·

36 Reads

·

2 Citations

European Journal of Operational Research



What Tweets and YouTube comments have in common? Sentiment and graph analysis on data related to US elections 2020

January 2023

·

202 Reads

·

18 Citations

Most studies analyzing political traffic on Social Networks focus on a single platform, while campaigns and reactions to political events produce interactions across different social media. Ignoring such cross-platform traffic may lead to analytical errors, missing important interactions across social media that e.g. explain the cause of trending or viral discussions. This work links Twitter and YouTube social networks using cross-postings of video URLs on Twitter to discover the main tendencies and preferences of the electorate, distinguish users and communities’ favouritism towards an ideology or candidate, study the sentiment towards candidates and political events, and measure political homophily. This study shows that Twitter communities correlate with YouTube comment communities: that is, Twitter users belonging to the same community in the Retweet graph tend to post YouTube video links with comments from YouTube users belonging to the same community in the YouTube Comment graph. Specifically, we identify Twitter and YouTube communities, we measure their similarity and differences and show the interactions and the correlation between the largest communities on YouTube and Twitter. To achieve that, we have gather a dataset of approximately 20M tweets and the comments of 29K YouTube videos; we present the volume, the sentiment, and the communities formed in YouTube and Twitter graphs, and publish a representative sample of the dataset, as allowed by the corresponding Twitter policy restrictions.


Discovery and Classification of Twitter Bots

May 2022

·

258 Reads

·

7 Citations

SN Computer Science

Online social networks (OSN) are used by millions of users, daily. This user-base shares and discovers different opinions on popular topics. The social influence of large groups may be affected by user beliefs or be attracted by the interest in particular news or products. A large number of users, gathered in a single group or number of followers, increases the probability to influence more OSN users. Botnets, collections of automated accounts controlled by a single agent, are a common mechanism for exerting maximum influence. Botnets may be used to better infiltrate the social graph over time and create an illusion of community behaviour, amplifying their message and increasing persuasion. This paper investigates Twitter botnets, their behavior, their interaction with user communities, and their evolution over time. We analyze a dense crawl of a subset of Twitter traffic, amounting to nearly all interactions by Greek-speaking Twitter users for a period of 36 months. The collected users are labeled as botnets, based on long-term and frequent content similarity events. We detect over a million events, where seemingly unrelated accounts tweeted nearly identical content, at almost the same time. We filter these concurrent content injection events and detect a set of 1850 accounts that repeatedly exhibit this pattern of behavior, suggesting that they are fully or in part controlled and orchestrated by the same entity. We find botnets that appear for brief intervals and disappear, as well as botnets that evolve and grow, spanning the duration of our dataset. We analyze the statistical differences between the bot accounts and the human users, as well as the botnet interactions with the user communities and the Twitter trending topics.


Citations (36)


... In addition, we identify topics related to cryptocurrencies and NFTs posted by spam accounts attempting to exploit popular hashtags which is also supported by related work [53]. ...

Reference:

Exploring Crisis-Driven Social Media Patterns: A Twitter Dataset of Usage During the Russo-Ukrainian War
Russo-Ukrainian War: Prediction and explanation of Twitter suspension

... M-Turk's database includes around 500,000 participants across 200 countries, which improves generalizability [29]. M-Turk has received considerable attention from scholars publishing in journals in the business, management, and engineering fields due to the reach of its database, including the European Journal of Operational Research [58], Industrial Marketing Management [71], and Decision Support Systems [30]. Subject recruitment and experiment design can considerably impact the quality of data [76]. ...

Mining Twitter lists to extract brand-related associative information for celebrity endorsement

European Journal of Operational Research

... Column Cache [36] utilizes Parquet format and execution plan information in Spark SQL for efficient data reading. TeraHeap [37] reduces overhead by using garbage collection hints when serializing and deserializing objects temporarily in data analytics frameworks. Nozawa et al. [38] use format information to optimize sharing of Apache Arrow files across multiple processes. ...

TeraHeap: Reducing Memory Pressure in Managed Big Data Frameworks
  • Citing Conference Paper
  • March 2023

... On the other hand, the comments related to the emotional design of the kindergarten teacher's PD videos did not go beyond the ethical value framework and there were no negative comments as topics such as news and politics (La Gatta et al., 2023;Shevtsov et al., 2023), marketing (Xiao, 2023), and myths and folklore (Zarenti & Katsadoros, 2023); which receive higher trolling with sarcastic or suspicious comments according to YouTube (YouTube, 2022a). This is consistent with Cook et al. (2023) who indicated that social media, including YouTube, is always a target for trolls. ...

What Tweets and YouTube comments have in common? Sentiment and graph analysis on data related to US elections 2020

... further analysis. Relating to the removal of bots (i.e., automated tweets), it is known that bot-generated tweets often exhibit recurring patterns in their content [69], such as in advertisements and services (like weather or traffic news). Consequently, in the next step, we removed bot-generated tweets with frequently repeated word pairs for 12 landmarks. ...

Discovery and Classification of Twitter Bots

SN Computer Science

... Using a bigger dataset (57.3 million tweets, 7.7 million users), Shevtsov et al. investigated the frequency of tweets in the Russian-Ukrainian War (Shevtsov et al., 2022). In an analysis of tweets from December 31, 2021 to March 3, 2022 of the same war, Agarwal et al. used the "bing" dictionary. ...

Twitter Dataset on the Russo-Ukrainian War

... De igual modo, los autores coinciden en considerar el retweet como el indicador más importante del interés o valor de las noticias que poseen los mensajes en la red social, como también de los procesos de inuencia social y de persuasión (Lim y Lee-Won, 2017; Moya-Sánchez y Herrera-Damas, 2016; Park y Kaye, 2017; Sotiropoulos et al., 2019;Suh et al., 2010;Wu y Shen, 2015), al reconocer que el retweet permite un mayor alcance de audiencia fuera de sus seguidores directos de manera impredecible, siendo así más signicativo que el número de seguidores que solo representa una cantidad de lectores potenciales (Moya-Sánchez y Herrera-Damas, 2016). ...

TwitterMancer: Predicting User Interactions on Twitter
  • Citing Conference Paper
  • September 2019

... In the past years, experimental studies [12], [28]- [30], [70] have been conducted that provide qualitative insights into the characteristics of different graph partitioning algorithms to help selecting one. However, the studies are not sufficient for automatic partitioner selection for a given scenario and do not include an automatic method to incorporate new partitioning and graph processing algorithms. ...

Cut to Fit: Tailoring the Partitioning to the Computation

... GSMB was the first algorithm to learn an MB without requiring a complete Bayesian network. Inter-IAMB [86] alternates between the phases of IAMB, while FBED [83] and PFBP [89] are more advanced versions of IAMB. Simultaneous MB learning algorithms are computationally efficient as they minimize the number of tests performed; however, they require more data for each test, making them less effective for small sample sizes. ...

A greedy feature selection algorithm for Big Data of high dimensionality

Machine Learning

... Researchers should be aware that disclosing Twitter data breaches terms of service and is subject to banning from the site. Due to this punitive practice, well-studied datasets such as the Edinburgh Twitter Corpus [13] and SNAP [14] are no longer publicly available. The unavailability of public Twitter data seriously affects measuring the reproducibility of current surveys. ...

twAwler: A lightweight twitter crawler
  • Citing Article
  • April 2018