About
80
Publications
59,309
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,178
Citations
Citations since 2017
Publications
Publications (80)
Recent advances in machine learning offer new ways to represent and study scholarly works and the space of knowledge. Graph and text embeddings provide a convenient vector representation of scholarly works based on citations and text. Yet, it is unclear whether their representations are consistent or provide different views of the structure of scie...
Research and development investments are key to scientific and economic development and to the well-being of society. Because scientific research demands significant resources, national scientific investment is a crucial driver of scientific production. As scientific production becomes increasingly multinational, it is critically important to study...
Narrative is a foundation of human cognition and decision making. Because narratives play a crucial role in societal discourses and spread of misinformation and because of the pervasive use of social media, the narrative dynamics on social media can have profound societal impact. Yet, systematic and computational understanding of online narratives...
Recent advances in machine learning research have produced powerful neural graph embedding methods, which learn useful, low-dimensional vector representations of network data. These neural methods for graph embedding excel in graph machine learning tasks and are now widely adopted. However, how and why these methods work -- particularly how network...
ChatGPT, the first large language model (LLM) with mass adoption, has demonstrated remarkable performance in numerous natural language tasks. Despite its evident usefulness, evaluating ChatGPT's performance in diverse problem domains remains challenging due to the closed nature of the model and its continuous updates via Reinforcement Learning from...
Social contagion is a ubiquitous and fundamental process that drives social changes. Although social contagion arises as a result of cognitive processes and biases, the integration of cognitive mechanisms with the theory of social contagion remains as an open challenge. In particular, studies on social phenomena usually assume contagion dynamics to...
Visiting multiple prescribers is a common method for obtaining prescription opioids for nonmedical use and has played an important role in fueling the United States opioid epidemic, leading to increased drug use disorder and overdose. Recent studies show that centrality of the bipartite network formed by prescription ties between patients and presc...
Science is essential to innovation and economic prosperity. Although studies have shown that national scientific development is affected by geographic, historic and economic factors, it remains unclear whether there are universal structures and trajectories of national scientific development that can inform forecasting and policy-making. Here, by e...
On social media, due to complex interactions between users' attention and recommendation algorithms, the visibility of users' posts can be unpredictable and vary wildly, sometimes creating unexpected viral events for `ordinary’ users. How do such events affect users' subsequent behaviors and long-term visibility on the platform? We investigate thes...
What science does, what science could do, and how to make science work? If we want to know the answers to these questions, we need to be able to uncover the mechanisms of science, going beyond metrics that are easily collectible and quantifiable. In this perspective piece, we link metrics to mechanisms by demonstrating how emerging metrics of scien...
Importance
During the pandemic, access to medical care unrelated to COVID-19 was limited because of concerns about viral spread and corresponding policies. It is critical to assess how these conditions affected modes of pain treatment, given the addiction risks of prescription opioids.
Objective
To assess the trends in opioid prescription and nonp...
The COVID-19 pandemic is a global crisis that has been testing every society and exposing the critical role of local politics in crisis response. In the United States, there has been a strong partisan divide between the Democratic and Republican party’s narratives about the pandemic which resulted in polarization of individual behaviors and diverge...
Graph embedding maps a graph into a convenient vector-space representation for graph analysis and machine learning applications. Many graph embedding methods hinge on a sampling of context nodes based on random walks. However, random walks can be a biased sampler due to the structural properties of graphs. Most notably, random walks are biased by t...
We investigate predictors of anti-Asian hate among Twitter users throughout COVID-19. With the rise of xenophobia and polarization that has accompanied widespread social media usage in many nations, online hate has become a major social issue, attracting many researchers. Here, we apply natural language processing techniques to characterize social...
This quality improvement study assesses the comorbidities associated with COVID-19 diagnostic codes in US health insurance claims.
Every second, the thoughts and feelings of millions of people across the world are recorded in the form of 140-character tweets using Twitter. However, despite the enormous potential presented by this remarkable data source, we still do not have an understanding of the Twitter population itself: Who are the Twitter users? How representative of the...
Framing is a process of emphasizing a certain aspect of an issue over the others, nudging readers or listeners towards different positions on the issue even without making a biased argument. Here, we propose FrameAxis, a method for characterizing documents by identifying the most relevant semantic axes ("microframes") that are overrepresented in th...
Background and aims:
Prescription drug seeking (PDS) from multiple prescribers is a primary means of obtaining prescription opioids; however, PDS behavior has likely evolved in response to policy shifts, and there is little agreement about how to operationalize it. We systematically compared the performance of traditional and novel PDS indicators....
Network embedding is a general-purpose machine learning technique that encodes network structure in vector spaces with tunable dimension. Choosing an appropriate embedding dimension – small enough to be efficient and large enough to be effective – is challenging but necessary to generate embeddings applicable to a multitude of tasks. Existing strat...
Effective control of an epidemic relies on the rapid discovery and isolation of infected individuals. Because many infectious diseases spread through interaction, contact tracing is widely used to facilitate case discovery and control. However, what determines the efficacy of contact tracing has not been fully understood. Here we reveal that, compa...
Recent advancements in data science technologies have allowed researchers to utilize large-scale records of human mobility to study various topics from city growth models to tracing outbreaks and analyzing the labor market. In this paper, after introducing recent studies on human mobility using transportation data, we briefly review the existing st...
Science is considered essential to innovation and economic prosperity. Understanding how nations build scientific capacity is therefore crucial to promote economic growth and national development. Although studies have shown that national scientific development is affected by geographic, historic, and economic factors, it remains unclear whether th...
Graph embedding techniques, which learn low-dimensional representations of a graph, are achieving state-of-the-art performance in many graph mining tasks. Most existing embedding algorithms assign a single vector to each node, implicitly assuming that a single representation is enough to capture all characteristics of the node. However, across many...
The COVID-19 pandemic is a global crisis that has been testing every society and exposing the critical role of local politics in crisis response. In the United States, there has been a strong partisan divide which resulted in polarization of individual behaviors and divergent policy adoption across regions. Here, to better understand such divide, w...
To what extent can we predict the structure of online conversation trees? We present a generative model to predict the size and evolution of threaded conversations on social media by combining machine learning algorithms. The model is evaluated using datasets that span two topical domains (cryptocurrency and cyber-security) and two platforms (Reddi...
Importance
In response to the increase in opioid overdose deaths in the United States, many states recently have implemented supply-controlling and harm-reduction policy measures. To date, an updated policy evaluation that considers the full policy landscape has not been conducted.
Objective
To evaluate 6 US state-level drug policies to ascertain...
Framing is an indispensable narrative device for news media because even the same facts may lead to conflicting understandings if deliberate framing is employed. Therefore, identifying media framing is a crucial step to understanding how news media influence the public. Framing is, however, difficult to operationalize and detect, and thus tradition...
We propose FrameAxis, a method of characterizing the framing of a given text by identifying the most relevant semantic axes ("microframes") defined by antonym word pairs. In contrast to the traditional framing analysis, which has been constrained by a small number of manually annotated general frames, our unsupervised approach provides much more de...
We propose a method for extracting hierarchical backbones from a bipartite network. Our method leverages the observation that a hierarchical relationship between two nodes in a bipartite network is often manifested as an asymmetry in the conditional probability of observing the connections to them from the other node set. Our method estimates both...
This paper examines network prominence in a co-prescription network as an indicator of opioid doctor shopping (i.e., fraudulent solicitation of opioids from multiple prescribers). Using longitudinal data from a large commercially insured population, we construct a network where a tie between patients is weighted by the number of shared opioid presc...
Simulating and predicting planetary-scale techno-social systems poses heavy computational and modeling challenges. The DARPA SocialSim program set the challenge to model the evolution of GitHub, a large collaborative software-development ecosystem, using massive multi-agent simulations. We describe our best performing models and our agent-based sim...
Groups of firms often achieve a competitive advantage through the formation of geo-industrial clusters. Although many exemplary clusters are the subjects of case studies, systematic approaches to identify and analyze the hierarchical structure of geo-industrial clusters at the global scale are scarce. In this work, we use LinkedIn's employment hist...
Simulating and predicting planetary-scale techno-social systems poses heavy computational and modeling challenges. The DARPA SocialSim program set the challenge to model the evolution of GitHub, a large collaborative software-development ecosystem, using massive multi-agent simulations. We describe our best performing models and our agent-based sim...
Clustering is one of the most universal approaches for understanding complex data. A pivotal aspect of clustering analysis is quantitatively comparing clusterings; clustering comparison is the basis for many tasks such as clustering evaluation, consensus clustering, and tracking the temporal evolution of clusters. In particular, the extrinsic evalu...
The nature of what people enjoy is not just a central question for the creative industry, it is a driving force of cultural evolution. It is widely believed that successful cultural products balance novelty and conventionality: they provide something familiar but at least somewhat divergent from what has come before, and occupy a satisfying middle...
The neural network is a powerful computing framework that has been exploited by biological evolution and by humans for solving diverse problems. Although the computational capabilities of neural networks are determined by their structure, the current understanding of the relationships between a neural network’s architecture and function is still pr...
Groups of firms often achieve a competitive advantage through the formation of geo-industrial clusters. Although many exemplary clusters, such as Hollywood or Silicon Valley, have been frequently studied, systematic approaches to identify and analyze the hierarchical structure of the geo-industrial clusters at the global scale are rare. In this wor...
Social media and social networking platforms have flourished with the rapid development of mobile technology and the ubiquitous use of the Internet. As a result, memes, or pieces of information spreading from person to person, can be reshared among users quickly and gain huge popularity. As viral memes have tremendous social and economic impact, de...
In this final chapter, we consider the state-of-the-art for spreading in social systems and discuss the future of the field. As part of this reflection, we identify a set of key challenges ahead. The challenges include the following questions: how can we improve the quality, quantity, extent, and accessibility of datasets? How can we extract more i...
In this chapter, we apply the theoretical framework introduced in the previous chapter to study how the modular structure of the social network affects the spreading of complex contagion. In particular, we focus on the notion of optimal modularity, that predicts the occurrence of global cascades when the network exhibits just the right amount of mo...
Because word semantics can substantially change across communities and contexts, capturing domain-specific word semantics is an important challenge. Here, we propose SEMAXIS, a simple yet powerful framework to characterize word semantics using many semantic axes in word- vector spaces beyond sentiment. We demonstrate that SEMAXIS can capture nuance...
Clustering is a central approach for unsupervised learning. After clustering is applied, the most fundamental analysis is to quantitatively compare clusterings. Such comparisons are crucial for the evaluation of clustering methods as well as other tasks such as consensus clustering. It is often argued that, in order to establish a baseline, cluster...
We investigate the predictability of successful memes using their early
spreading patterns in the underlying social networks. We propose and analyze a
comprehensive set of features and develop an accurate model to predict future
popularity of a meme given its early spreading patterns. Our paper provides the
first comprehensive comparison of existin...
Food occupies a central position in every culture and it is therefore of great interest to understand the evolution of food culture. The advent of the World Wide Web and online recipe repositories have begun to provide unprecedented opportunities for data-driven, quantitative study of food culture. Here we harness an online database documenting rec...
How does network structure affect diffusion? Recent studies suggest that the answer depends on the type of contagion. Complex contagions, unlike infectious diseases (simple contagions), are affected by social reinforcement and homophily. Hence, the spread within highly clustered communities is enhanced, while diffusion across communities is hampere...
Online social networks exhibit small-world network characteristics, implying that information can spread in the network quickly and widely. This ability to spread information rapidly has led to high expectations for word-of-mouth and viral campaigns in online social networks. However, a recent study of the Flickr social network has shown that popul...
Supplementary Information
Supplementary Dataset 1
Supplementary Dataset 2
The cultural diversity of culinary practice, as illustrated by the variety of
regional cuisines, raises the question of whether there are any general
patterns that determine the ingredient combinations used in food today or
principles that transcend individual tastes and recipes. We introduce a flavor
network that captures the flavor compounds shar...
Plants have unique features that evolved in response to their environments and ecosystems. A full account of the complex cellular networks that underlie plant-specific functions is still missing. We describe a proteome-wide binary protein-protein interaction map for the interactome network of the plant Arabidopsis thaliana containing about 6200 hig...
Many systems, from power grids and the internet, to the brain and
society, can be modeled using networks of coupled overlapping modules.
The elements of these networks perform individual and collective tasks
such as generating and consuming electrical load or transmitting data.
We study the robustness of these systems using percolation theory: a
ra...
Many complex systems, from power grids and the internet, to the brain and
society, can be modeled using modular networks. Modules, densely interconnected
groups of elements, often overlap due to elements that belong to multiple
modules. The elements and modules of these networks perform individual and
collective tasks such as generating and consumi...
Every second, the thoughts and feelings of millions of people across the world are recorded in the form of 140-character tweets using Twitter. However, despite the enormous potential presented by this remarkable data source, we still do not have an understanding of the Twitter population itself: Who are the Twitter users? How representative of the...
Networks have become a key approach to understanding systems of interacting objects, unifying the study of diverse phenomena including biological organisms and human society. One crucial step when studying the structure and dynamics of networks is to identify communities: groups of related nodes that correspond to functional subunits such as protei...
Social network analysis has long been an untiring topic of sociology. However, until the era of information technology, the availability of data, mainly collected by the traditional method of personal survey, was highly limited and prevented large-scale analysis. Recently, the exploding amount of automatically generated data has completely changed...
User generated content (UGC), now with millions of video producers and consumers, is reshaping the way people watch video and TV. In particular, UGC sites are creating new viewing patterns and social interactions, empowering users to be more creative, and generating new business opportunities. Compared to traditional video-on-demand (VoD) systems,...
Identifying modular network structure is generally a problem of finding
the correct community membership of each node in a network. An
alternative approach, clustering links, naturally accounts for real
world characteristics such as strong community overlap, multi-partite
structure, and hierarchical organization. By introducing a pair-wise
link sim...
Modular and hierarchical organization are two of the most important organizing principles observed in many complex networks. It has often been assumed that detecting a hierarchy also implies finding modular struc-ture. However, highly overlapping community structure, present in many real networks including social and biological networks, interferes...
Online social networking services are among the most popular Internet services according to Alexa.com and have become a key feature in many Internet services. Users interact through various features of online social networking services: making friend relationships, sharing their photos, and writing comments. These friend relationships are expected...
User Generated Content (UGC) is re-shaping the way people watch video and TV, with millions of video producers and consumers. In particular, UGC sites are creating new view- ing patterns and social interactions, empowering users to be more creative, and developing new business opportunities. To better understand the impact of UGC systems, we have a...
Social networking services are a fast-growing business in the Internet. However, it is unknown if online relationships and their growth patterns are the same as in real-life social networks. In this paper, we compare the structures of three online social networking services: Cyworld, MySpace, and orkut, each with more than 10 million users, respect...
We study the nonequilibrium phase transition in a model for epidemic spreading on scale-free networks. The model consists of two particle species A and B, and the coupling between them is taken to be asymmetric; A induces B while B suppresses A. This model describes the spreading of an epidemic on networks equipped with a reactive immune system. We...
Today's social networking services have tens of millions of users, and are growing fast. Their sheer size poses a significant challenge in capturing and analyzing their topological characteristics. Snowball sampling is a popular method to crawl and sample network topologies, but requires a high sampling ratio for accurate estimation of certain metr...