Guang Xiang’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (14)


Figure 1: Illustration of the λ +v operator. The light gray boxes show the parallel span and the dark boxes show the span's Viterbi alignment. In this example, the parallel message contains a "translation" of a b to A B.
Figure 2: Precision, recall and accuracy curves for parallel data detection. The y-axis denotes the scores for each metric, and the x-axis denotes the percentage of the highest scoring sentence pairs that are kept. 
Microblogs as Parallel Corpora
  • Data
  • File available

August 2013

·

282 Reads

·

43 Citations

·

Guang Xiang

·

Chris Dyer

·

[...]

·

In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring par-allel text: some users create post multilingual mes-sages targeting international audiences while oth-ers "retweet" translations. We present an efficient method for detecting these messages and extract-ing parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counter-part of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields sub-stantial translation quality improvements in trans-lating microblog text and modest improvements in translating edited news commentary. The re-sources in described in this paper are available at

Download

Figure 1: User posts containing keywords for the start of Radiation. Event keywords are in bold and temporal expressions are in italics.
Extracting Events with Informal Temporal References in Personal Histories in Online Communities

August 2013

·

126 Reads

·

14 Citations

We present a system for extracting the dates of illness events (year and month of the event occurrence) from posting histories in the context of an online medical support community. A temporal tagger retrieves and normalizes dates mentioned informally in social media to actual month and year referents. Building on this, an event date extraction system learns to integrate the likelihood of candidate dates extracted from time-rich sentences with temporal constraints extracted from event-related sentences. Our integrated model achieves 89.7% of the maximum performance given the performance of the temporal expression retrieval step.



Leveraging high-level and low-level features for multimedia event detection

October 2012

·

118 Reads

·

69 Citations

This paper addresses the challenge of Multimedia Event Detection by proposing a novel method for high-level and low-level features fusion based on collective classification. Generally, the method consists of three steps: training a classifier from low-level features; encoding high-level features into graphs; and diffusing the scores on the established graph to obtain the final prediction. The final prediction is derived from multiple graphs each of which corresponds to a high-level feature. The paper investigates two graph construction methods using logarithmic and exponential loss functions, respectively and two collective classification algorithms, i.e. Gibbs sampling and Markov random walk. The theoretical analysis demonstrates that the proposed method converges and is computationally scalable and the empirical analysis on TRECVID 2011 Multimedia Event Detection dataset validates its outstanding performance compared to state-of-the-art methods, with an added benefit of interpretability.


Detecting offensive tweets via topical feature discovery over a large scale twitter corpus

October 2012

·

733 Reads

·

315 Citations

In this paper, we propose a novel semi-supervised approach for detecting profanity-related offensive content in Twitter. Our approach exploits linguistic regularities in profane language via statistical topic modeling on a huge Twitter corpus, and detects offensive tweets using automatically these generated features. Our approach performs competitively with a variety of machine learning (ML) algorithms. For instance, our approach achieves a true positive rate (TP) of 75.1% over 4029 testing tweets using Logistic Regression, significantly outperforming the popular keyword matching baseline, which has a TP of 69.7%, while keeping the false positive rate (FP) at the same level as the baseline at about 3.77%. Our approach provides an alternative to large scale hand annotation efforts required by fully supervised learning approaches.


A Supervised Approach to Predict Company Acquisition with Factual and Topic Features Using Profiles and News Articles on TechCrunch

January 2012

·

2,418 Reads

·

49 Citations

Proceedings of the International AAAI Conference on Web and Social Media

Merger and Acquisition (M&A) prediction has been an interesting and challenging research topic in the past a few decades. However, past work has only adopted nu-merical features in building models, and yet the valu-able textual information from the great variety of so-cial media sites has not been touched at all. To fully explore this information, we used the profiles and news articles for companies and people on TechCrunch, the leading and largest public database for the tech world, which anybody can edit. Specifically, we explored topic features via topic modeling techniques, as well as a set of other novel features of our design within a machine learning framework. We conducted experiments of the largest scale in the literature, and achieved a high true positive rate (TP) between 60% to 79.8% with a false positive rate (FP) mostly between 0% and 8.3% over company categories with a small number of missing at-tributes in the CrunchBase profiles.


CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

September 2011

·

826 Reads

·

541 Citations

ACM Transactions on Information and System Security

Phishing is a plague in cyberspace. Typically, phish detection methods either use human-verified URL blacklists or exploit Web page features via machine learning techniques. However, the former is frail in terms of new phish, and the latter suffers from the scarcity of effective features and the high false positive rate (FP). To alleviate those problems, we propose a layered anti-phishing solution that aims at (1) exploiting the expressiveness of a rich set of features with machine learning to achieve a high true positive rate (TP) on novel phish, and (2) limiting the FP to a low level via filtering algorithms. Specifically, we proposed CANTINA+, the most comprehensive feature-based approach in the literature including eight novel features, which exploits the HTML Document Object Model (DOM), search engines and third party services with machine learning techniques to detect phish. Moreover, we designed two filters to help reduce FP and achieve runtime speedup. The first is a near-duplicate phish detector that uses hashing to catch highly similar phish. The second is a login form filter, which directly classifies Web pages with no identified login form as legitimate. We extensively evaluated CANTINA+ with two methods on a diverse spectrum of corpora with 8118 phish and 4883 legitimate Web pages. In the randomized evaluation, CANTINA+ achieved over 92% TP on unique testing phish and over 99% TP on near-duplicate testing phish, and about 0.4% FP with 10% training phish. In the time-based evaluation, CANTINA+ also achieved over 92% TP on unique testing phish, over 99% TP on near-duplicate testing phish, and about 1.4% FP under 20% training phish with a two-week sliding window. Capable of achieving 0.4% FP and over 92% TP, our CANTINA+ has been demonstrated to be a competitive anti-phishing solution.


Modeling People’s Place Naming Preferences in Location Sharing

September 2010

·

42 Reads

·

51 Citations

Most location sharing applications display people's locations on a map. However, people use a rich variety of terms to refer to their locations, such as "home," "Starbucks," or "the bus stop near my house." Our long-term goal is to create a system that can automatically generate appropriate place names based on real-time context and user preferences. As a first step, we analyze data from a two-week study involving 26 participants in two different cities, focusing on how people refer to places in location sharing. We derive a taxonomy of different place naming methods, and show that factors such as a person's perceived familiarity with a place and the entropy of that place (i.e. the variety of people who visit it) strongly influence the way people refer to it when interacting with others. We also present a machine learning model for predicting how people name places. Using our data, this model is able to predict the place naming method people choose with an average accuracy higher than 85%.


A Hierarchical Adaptive Probabilistic Approach for Zero Hour Phish Detection

September 2010

·

44 Reads

·

25 Citations

Lecture Notes in Computer Science

Phishing attacks are a significant threat to users of the Internet, causing tremendous economic loss every year. In combating phish, industry relies heavily on manual verification to achieve a low false positive rate, which, however, tends to be slow in responding to the huge volume of unique phishing URLs created by toolkits. Our goal here is to combine the best aspects of human verified blacklists and heuristic-based methods, i.e., the low false positive rate of the former and the broad and fast coverage of the latter. To this end, we present the design and evaluation of a hierarchical blacklist-enhanced phish detection framework. The key insight behind our detection algorithm is to leverage existing human-verified blacklists and apply the shingling technique, a popular near-duplicate detection algorithm used by search engines, to detect phish in a probabilistic fashion with very high accuracy. To achieve an extremely low false positive rate, we use a filtering module in our layered system, harnessing the power of search engines via information retrieval techniques to correct false positives. Comprehensive experiments over a diverse spectrum of data sources show that our method achieves 0% false positive rate (FP) with a true positive rate (TP) of 67.15% using search-oriented filtering, and 0.03% FP and 73.53% TP without the filtering module. With incremental model building capability via a sliding window mechanism, our approach is able to adapt quickly to new phishing variants, and is thus more responsive to the evolving attacks.


A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval ABSTRACT

October 2009

·

156 Reads

·

134 Citations

Phishing is a significant security threat to the Internet, which causes tremendous economic loss every year. In this pa- per, we proposed a novel hybrid phish detection method based on information extraction (IE) and information re- trieval (IR) techniques. The identity-based component of our method detects phishing webpages by directly discover- ing the inconsistency between their identity and the identity they are imitating. The keywords-retrieval component uti- lizes IR algorithms exploiting the power of search engines to identify phish. Our method requires no training data, no prior knowledge of phishing signatures and specific imple- mentations, and thus is able to adapt quickly to constantly appearing new phishing patterns. Comprehensive experi- ments over a diverse spectrum of data sources with 11449 pages show that both components have a low false positive rate and the stacked approach achieves a true positive rate of 90.06% with a false positive rate of 1.95%.


Citations (13)


... In a similar work, Wen and Rose (2012) investigated the behavioral trajectory of participants by analyzing cancer-related events in online medical discourse. In subsequent work, Wen et al. (2013) created a temporal tagger to extract cancer-related event dates to explore treatment trajectories. ...

Reference:

Characterizing Information Seeking Events in Health-Related Social Discourse
Extracting Events with Informal Temporal References in Personal Histories in Online Communities

... Subsequently, Wei et al. (2008) pioneer a perspective centered on the acquisition of technological patents, introducing technological variables and utilizing ensemble learning algorithms to predict whether candidate target companies are likely to be acquired. Building upon this foundation, Xiang et al. (2021) further enrich the literature on predicting M&A targets. They incorporate valuable textual information extracted from news articles on Tech-Crunch to construct textual variables, supplementing the existing financial, managerial, and technological variables. ...

A Supervised Approach to Predict Company Acquisition with Factual and Topic Features Using Profiles and News Articles on TechCrunch

Proceedings of the International AAAI Conference on Web and Social Media

... The earliest automatic hate speech detection systems relied on different linguistic features such as lexical and syntactic representations (Chen et al., 2012), template-based and parts-of-speech (POS) tagging (Warner and Hirschberg, 2012), topicmodelling (Xiang et al., 2012), or a combination of lexical, POS, character bigram and term frequencyinverse document frequency (Tf-idf) representations (Dinakar et al., 2012). With a focus on hate speech in English, the model performance of these early systems yielded moderate results with limited applications to other language conditions (Jahan and Oussalah, 2023). ...

Detecting offensive tweets via topical feature discovery over a large scale twitter corpus
  • Citing Conference Paper
  • October 2012

... Blitzer et al. [13] used structural correspondence learning to model the correlation between data features in different fields, and utilized key features for discrimination. Jiang et al. [14] further introduced latent space into multi perspective data and extended its features to address the differences caused by multi perspective data. Schematic illustration as shown in Figure 1 (a). ...

Leveraging high-level and low-level features for multimedia event detection
  • Citing Conference Paper
  • October 2012

... There are several studies and datasets which focus on the translation of social media texts, such as TweetMT (San Vicente et al., 2016), the tweet corpus proposed by Mubarak et al. (2020) and the Weibo corpus developed by Ling et al. (2013). However, none of these focus on the translation of emotions. ...

Microblogs as Parallel Corpora

... Due to this problem, various anti-phishing approaches have been proposed to solve the problem. These approaches include feature-based techniques [2], [3], blacklist-based [4], [5], [6], [7], and content-based approaches applying machine learning algorithms have attempted to solve the problem [8], [2]. However, there is still high false positive causing inaccuracy in online transaction. ...

Modeling Content from Human-Verified Blacklists for Accurate Zero-Hour Phish Detection
  • Citing Article

... Another body of research focuses on the examination of phishing sites and server characteristics and relies on blacklisting. Some of these works leverage crowdsourcing [25], [26] and reputation systems [27] to improve accuracy and speed. While these solutions have proven to be suitable against general phishing and known threats, they face significant limitations against spear-phishing, as blacklists do not generalize well to unknown [28]. ...

Smartening the Crowds: Computational Techniques for Improving Human Verification to Fight Phishing Scams

... To overcome the issues previously discussed, several papers suggest methods for creating and maintaining blacklists and whitelists (see, e.g., [20], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33]). The details of the main research efforts offered in this context are discussed in what follows, whereas an overview of these efforts is presented in Table 2. ...

A Hierarchical Adaptive Probabilistic Approach for Zero Hour Phish Detection
  • Citing Conference Paper
  • September 2010

Lecture Notes in Computer Science

... The literature divides these methods into two main groups: fully-automated methods and those with human assistance. Until the advent of advanced methods, video summarization techniques, such as Clustering-based Video Summarization [9] and Attention-based Video Summarization [19], were applied to the problem of automatic movie trailer generation. Because of this fact, all the approaches, which focus on movie trailer generation, were using video summarization techniques as competitors in the evaluation stage. ...

Clever Clustering vs. Simple Speed-Up for Summarizing BBC Rushes
  • Citing Conference Paper
  • September 2007

... Even though prior work has emphasized users' need to adapt the location accuracy [6,10,11], our study participants did not feel the need for changing the granularity of their shared location. Six participants explored the feature but all of them set it back to the default high accuracy before the end of the study, wondering why one would share their location while at the same time disguising it by reduced accuracy. ...

Modeling People’s Place Naming Preferences in Location Sharing
  • Citing Conference Paper
  • September 2010