About
270
Publications
178,226
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,338
Citations
Introduction
I am a Senior Lecturer at Queen Mary University of London, and a Fellow at the Alan Turing Institute. My research is in the area of Internet Data Science, looking at topics ranging from cybersecurity to online user behaviour.
Current institution
Additional affiliations
Publications
Publications (270)
Social Web 2.0 features have become a vital component in a variety of multimedia systems, e.g., YouTube and Last.fm. Interestingly, adult video websites are also starting to adopt these Web 2.0 principles, giving rise to the term " Porn 2.0 ". This paper examines a large Porn 2.0 social network, through data covering 563k users. We explore a number...
Commercial Virtual Private Network (VPN) services have become a popular and convenient technology for users seeking privacy and anonymity. They have been applied to a wide range of use cases, with commercial providers often making bold claims regarding their ability to fulfil each of these needs, e.g., censorship circumvention, anonymity and protec...
The Internet has evolved into a huge video delivery infrastructure, with websites such as YouTube and Netflix appearing at the top of most traffic measurement studies. However, most traffic studies have largely kept silent about an area of the Internet that (even today) is poorly understood: adult media distribution. Whereas ten years ago, such ser...
This paper proposes a new delivery-centric abstraction, which extends the existing content-centric net-working API. A delivery-centric abstraction allows applications to generate content requests agnostic to location or protocol, with the additional ability to stipulate high-level requirements regarding such things as performance, security and reso...
A content-centric network is one which supports host-to-content routing, rather than the host-to-host routing of the existing Internet. This paper investigates the potential of caching data at the router-level in content-centric networks. To achieve this, two measurement sets are combined to gain an understanding of the potential caching benefits o...
Large language models (LLMs) demonstrate the ability to simulate human decision-making processes, enabling their use as agents in modeling sophisticated social networks, both offline and online. Recent research has explored collective behavioral patterns and structural characteristics of LLM agents within simulated networks. However, empirical comp...
This paper characterizes the self-disclosure behavior of Reddit users across 11 different types of self-disclosure. We find that at least half of the users share some type of disclosure in at least 10% of their posts, with half of these posts having more than one type of disclosure. We show that different types of self-disclosure are likely to rece...
Livestreaming by VTubers -- animated 2D/3D avatars controlled by real individuals -- have recently garnered substantial global followings and achieved significant monetary success. Despite prior research highlighting the importance of realism in audience engagement, VTubers deliberately conceal their identities, cultivating dedicated fan communitie...
The Fediverse, a group of interconnected servers providing a variety of interoperable services (e.g. micro-blogging in Mastodon) has gained rapid popularity. This sudden growth, partly driven by Elon Musk's acquisition of Twitter, has created challenges for administrators though. This paper focuses on one particular challenge: content moderation, e...
The Web3 ecosystem is increasingly evolving to multi-chain, with decentralized applications (dApps) distributing across different blockchains, which drives the need for cross-chain bridges for blockchain interoperability. However, it further opens new attack surfaces, and media outlets have reported serious attacks related to cross-chain bridges. N...
Real-time video analytics on edge devices has gained increasing attention across a wide range of business areas. However, edge devices usually have limited computing resources. Consequently, conventional approaches to video analytics either deploy simplified models on the edge (resulting in low accuracy) or transmit video content to the cloud (resu...
Ruoyu Li Qing Li Tao Lin- [...]
Yong Jiang
Device fingerprinting can be used by Internet Service Providers (ISPs) to identify vulnerable IoT devices for early prevention of threats. However, due to the wide deployment of middleboxes in ISP networks, some important data, e.g., 5-tuples and flow statistics, are often obscured, rendering many existing approaches invalid. It is further challeng...
The pitfalls of centralized social networks, such as Facebook and Twitter/X, have led to concerns about control, transparency, and accountability. Decentralized social networks have emerged as a result with the goal of empowering users. These decentralized approaches come with their own tradeoffs, and therefore multiple architectures exist. In this...
We present the first measurement of the user-effect and privacy impact of "Related Website Sets," a recent proposal to reduce browser privacy protections between two sites if those sites are related to each other. An assumption (both explicitly and implicitly) underpinning the Related Website Sets proposal is that users can accurately determine if...
The rise of generative AI is transforming the landscape of digital imagery, and exerting a significant influence on online creative communities. This has led to the emergence of AI-Generated Content (AIGC) social platforms, such as Civitai. These distinctive social platforms allow users to build and share their own generative AI models, thereby enh...
Harnessing the potential of large language models (LLMs) like ChatGPT can help address social challenges through inclusive, ethical, and sustainable means. In this paper, we investigate the extent to which ChatGPT can annotate data for social computing tasks, aiming to reduce the complexity and cost of undertaking web research. To evaluate ChatGPT'...
How similar are politicians to those who vote for them? This is a critical question at the heart of democratic representation and particularly relevant at times when political dissatisfaction and populism are on the rise. To answer this question we compare the online discourse of elected politicians and their constituents. We collect a two and a ha...
Threads, a new microblogging platform from Meta, was launched in July 2023. In contrast to prior new platforms, Threads was borne out of an existing parent platform, Instagram, for which all users must already possess an account. This offers a unique opportunity to study platform evolution, to understand how one existing platform can support the "b...
Multimodal out-of-context news is a common type of misinformation on online media platforms. This involves posting a caption, alongside an invalid out-of-context news image. Reflecting its importance, researchers have developed models to detect such misinformation. However, a common limitation of these models is that they only consider the scenario...
The advent of 5G and interactive live broadcasting has led to a growing trend of people preferring real-time interactive video services on mobile devices, particularly mobile phones. In this work, we measure the performance of Google congestion control (GCC) in cellular networks, which is the default congestion control algorithm for Web Real-Time C...
The recent development of decentralised and interoperable social networks (such as the "fediverse") creates new challenges for content moderators. This is because millions of posts generated on one server can easily "spread" to another, even if the recipient server has very different moderation policies. An obvious solution would be to leverage mod...
An important concept in organisational behaviour is how hierarchy affects the voice of individuals, whereby members of a given organisation exhibit differing power relations based on their hierarchical position. Although there have been prior studies of the relationship between hierarchy and voice, they tend to focus on more qualitative small-scale...
Online comments within news articles are a key way people share opinions. Discovering insightful comments can, however, be challenging for readers. A solution to this problem is using comment curation, whereby professional editors select the highest quality comments manually --- referred to as ''editor-picks''. This paper studies the growing use of...
Since the Russian invasion of Ukraine, a large volume of biased and partisan news has been spread via social media platforms. As this may lead to wider societal issues, we argue that understanding how partisan news sharing impacts users' communication is crucial for better governance of online communities. In this paper, we perform a measurement st...
There have been numerous recent attempts to “decentralize” social media platforms, loosely referred to as Web3. Such ideas, often underpinned by blockchain solutions, offer decentralized equivalents of well-known services (e.g., forums, social networks, video sharing sites, microblogs). One particularly challenging function to implement in such a d...
Device fingerprinting can be used by Internet Service Providers (ISPs) to identify vulnerable IoT devices for early prevention of threats. However, due to the wide deployment of middleboxes in ISP networks, some important data, e.g., 5-tuples and flow statistics, are often obscured, rendering many existing approaches invalid. It is further challeng...
Organizational responsibilities can give people power but also expose them to scrutiny. This tension leads to divergent predictions about the use of potentially sensitive language: power might license it, while exposure might inhibit it. Analysis of peoples' language use in a large corpus of organizational emails using standardized Linguistic Inqui...
Web games, that are directly playable within web browsers, have recently garnered substantial popularity, particularly among younger demographics. The absence of a paywall for these games has raised concerns regarding the potential privacy-compromising monetization strategies. Comprehensive investigations have been carried out into domains like pai...
Uploading videos from low-cost cameras to the cloud for retrospective analysis presents challenges in privacy, network, and computation. To address these issues and achieve low latency, we propose READY, a novel client-cloud collaborative system. READY aims to enhance the quality of uploaded frames by selectively uploading only the frames relevant...
WallStreetBets (WSB), a Reddit community, has a key impact on real stock markets, as evidenced by the GameStop Short squeeze in 2021. In this work, we characterise the content and user properties that impact engagement in WSB. We show that regardless of WSB association with emojis and less formal terms, the engagement among community members depend...
The ever-growing volume of IoT traffic brings challenges to IoT anomaly detection systems. Existing anomaly detection systems perform all traffic detection on the control plane, which struggles to scale to the growing rates of traffic. In this paper, we propose HorusEye, a high throughput and accurate two-stage anomaly detection framework. In the f...
The Metaverse connects our physical reality with virtual worlds. Social VR platforms facilitate the creation of such virtual worlds, enabling activities such as interactive teaching, conferences, and community gatherings. These activities can be performed in mixed-mode, with some participants physically present in the same location. In this paper,...
On the 21st of February 2022, Russia recognised the Donetsk People's Republic and the Luhansk People's Republic, three days before launching an invasion of Ukraine. Since then, an active debate has taken place on social media, mixing organic discussions with coordinated information campaigns. The scale of this discourse, alongside the role that inf...
The term ghost booking has recently emerged as a new way to conduct humanitarian acts during the conflict between Russia and Ukraine in 2022. The phenomenon describes the events where netizens donate to Ukrainian citizens through no-show bookings on the Airbnb platform. Impressively, the social fundraising act that used to be organized on donation-...
From health to education, income impacts a huge range of life choices. Earlier research has leveraged data from online social networks to study precisely this impact. In this paper, we ask the opposite question: do different levels of income result in different online behaviors? We demonstrate it does. We present the first large-scale study of Next...
Large cloud service providers have built an increasing number of geo-distributed data centers (DCs) connected by Wide Area Networks (WANs). These DC-WANs carry both high-priority traffic from interactive services and low-priority traffic from bulk transfers. Given that a DC-WAN is an expensive resource, providers often manage it via traffic enginee...
The release of ChatGPT has uncovered a range of possibilities whereby large language models (LLMs) can substitute human intelligence. In this paper, we seek to understand whether ChatGPT has the potential to reproduce human-generated label annotations in social computing tasks. Such an achievement could significantly reduce the cost and complexity...
From health to education, income impacts a huge range of life choices. Many papers have leveraged data from online social networks to study precisely this. In this paper, we ask the opposite question: do different levels of income result in different online behaviors? We demonstrate it does. We present the first large-scale study of Nextdoor, a pop...
With the deployment of a growing number of smart home IoT devices, privacy leakage has become a growing concern. Prior work on privacy-invasive device localization, classification, and activity identification have proven the existence of various privacy leakage risks in smart home environments. However, they only demonstrate limited threats in real...
The acquisition of Twitter by Elon Musk has spurred controversy and uncertainty among Twitter users. The move raised as many praises as concerns, particularly regarding Musk's views on free speech. As a result, a large number of Twitter users have looked for alternatives to Twitter. Mastodon, a decentralized micro-blogging social network, has attra...
As an alternative to Twitter and other centralized social networks, the Fediverse is growing in popularity. The recent, and polemical, takeover of Twitter by Elon Musk has exacerbated this trend. The Fediverse includes a growing number of decentralized social networks, such as Pleroma or Mastodon, that share the same subscription protocol (Activity...
We share the largest dataset for the Pakistani Twittersphere consisting of over 49 million tweets, collected during one of the most politically active periods in the country. We collect the data after the deposition of the government by a No Confidence Vote in April 2022. This large-scale dataset can be used for several downstream tasks such as pol...
The metaverse is a network of shared virtual environments where people can interact synchronously through their avatars.
To enable this, it is necessary to accurately capture and recreate (physical) human motion. This is used to render avatars correctly, reflecting the motion of their corresponding users. In large-scale environments this must be do...
Like websites, mobile apps import a range of external resources from various third-party domains. In succession, the third-party domains can further load resources hosted on other domains. For each mobile app, this creates a dependency chain underpinned by a form of implicit trust between the app and transitively connected third-parties. Hence, a s...
There has been a significant expansion in the use of online social networks (OSNs) to support people experiencing mental health issues. This paper studies the role of Instagram influencers who specialize in coaching people with mental health issues. Using a dataset of 97k posts, we characterize such users' linguistic and behavioural features. We ex...
The metaverse is a network of shared virtual environments where people can interact synchronously through their avatars.
To enable this, it is necessary to accurately capture and recreate (physical) human motion. This is used to render avatars correctly, reflecting the motion of their corresponding users. In large-scale environments, this must be d...
Recent years have witnessed growing consolidation of web operations. For example, the majority of web traffic now originates from a few organizations, and even micro-websites often choose to host on large pre-existing cloud infrastructures. In response to this, the "Decentralized Web" attempts to distribute ownership and operation of web services m...
With the growth of the cryptocurrency ecosystem, there is expanding evidence that counterfeit cryptocurrency has also appeared. In this paper, we empirically explore the presence of counterfeit cryptocurrencies on Ethereum and measure their impact. By analyzing over 190K ERC-20 tokens (or cryptocurrencies) on Ethereum, we have identified 2,117 coun...
The "Decentralised Web" (DW) is an evolving concept, which encompasses technologies aimed at providing greater transparency and openness on the web. The DW relies on independent servers (aka instances) that mesh together in a peer-to-peer fashion to deliver a range of services (e.g. micro-blogs, image sharing, video streaming). However, toxic conte...
Reddit consists of sub-communities that cover a focused topic. This paper provides a list of relevant subreddits for the ongoing Russo-Ukrainian crisis. We perform an exhaustive subreddit exploration using keyword search and shortlist 12 subreddits as potential candidates that contain nominal discourse related to the crisis. These subreddits contai...
The "Decentralised Web" (DW) is an evolving concept, which encompasses technologies aimed at providing greater transparency and openness on the web. The DW relies on independent servers (aka instances) that mesh together in a peer-to-peer fashion to deliver a range of services (e.g. micro-blogs, image sharing, video streaming). However, toxic conte...
The Domain Name System (DNS) is fundamental to the operation of the Internet. Providing an up-to-date view of DNS behavior in-the-wild is thus important for various Internet stakeholders. Among the behavioral characteristics, failure is one of the most import aspects, because failures within DNS can have a dramatic impact on the wider Internet, mos...
The Internet Engineering Task Force (IETF) has developed many of the technical standards that underpin the Internet. The standards development process followed by the IETF is open and consensus-driven, but is inherently both a social and political activity, and latent influential structures might exist within the community. Exploring and understand...
Hate speech has proliferated on social media platforms in recent years. While this has been the focus of many studies, most works have exclusively focused on a single language, generally English. Low-resourced languages have been neglected due to the dearth of labeled resources. These languages, however, represent an important portion of the data d...
The "Decentralised Web" (DW) is an evolving concept, which encompasses technologies aimed at providing greater transparency and openness on the web. The DW relies on independent servers (aka instances) that mesh together in a peer-to-peer fashion to deliver a range of services (e.g. micro-blogs, image sharing, video streaming). However, toxic conte...
Social media is often used to disseminate information during crises, including wars, natural disasters and pandemics. This paper discusses the challenges faced during crisis situations, which social media can both contribute to and ameliorate. We discuss the role that information polarisation plays in exacerbating problems. We then discuss how cert...