Preprint

Linking Social Media Posts to News with Siamese Transformers

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the author.

Abstract

Many computational social science projects examine online discourse surrounding a specific trending topic. These works often involve the acquisition of large-scale corpora relevant to the event in question to analyze aspects of the response to the event. Keyword searches present a precision-recall trade-off and crowd-sourced annotations, while effective, are costly. This work aims to enable automatic and accurate ad-hoc retrieval of comments discussing a trending topic from a large corpus, using only a handful of seed news articles.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
This paper is part of a wider work studying the co-influences of Twitter and the production of information by traditional media. A strong prerequisite of this study is to collect, with the limitations of the Twitter API, tweets linked to media events that are representative of the real Twitter activity. This paper describes two proposed approaches to handle this important task. The first one, inspired by information retrieval , puts the focus on query formulation. It consists on bridging the vocabulary gap between traditional news articles and tweets, by iteratively modifying the queries sent to the Twitter API depending on the tweets retrieved by previous queries. The second approach consists in streaming a representative sample of all emitted tweets and dynamically clustering them in events. We also discuss approaches to evaluate the collected datasets under the point of view of their representativity of the real activity on Twitter.
Article
Full-text available
We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. After deriving its properties, we show how its Jacobian can be efficiently computed, enabling its use in a network trained with backpropagation. Then, we propose a new smooth and convex loss function which is the sparsemax analogue of the logistic loss. We reveal an unexpected connection between this new loss and the Huber classification loss. We obtain promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference. For the latter, we achieve a similar performance as the traditional softmax, but with a selective, more compact, attention focus.
Article
Full-text available
Recurrent neural networks (RNNs) have proven to be powerful models in problems involving sequential data. Recently, RNNs have been augmented with "attention" mechanisms which allow the network to focus on different parts of an input sequence when computing their output. We propose a simplified model of attention which is applicable to feed-forward neural networks and demonstrate that it can solve some long-term memory problems (specifically, those where temporal order doesn't matter). In fact, we show empirically that our model can solve these problems for sequence lengths which are both longer and more widely varying than the best results attained with RNNs.
Article
Full-text available
We present a new technique called "t-SNE" that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualiza-tions produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets.
Article
Full-text available
Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large data sets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a data set with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. For large, web scale, data sets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale data set, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale data sets that could not be handled before.
Article
Full-text available
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.
Article
With the rapid proliferation of social media, increasingly more people express their opinions and reviews (user-generated content (UGC)) on recent news articles through various online services, such as news portals, forums, discussion groups, and microblogs. Clearly, identifying hot topics that users greatly care about can improve readers’ news browsing experience and facilitate research into interaction analysis between news and UGC. Furthermore, it is of great benefit to public opinion monitoring and management for both industry and government agencies. However, it is extremely time consuming, if not impossible, to manually examine the large amount of available social content. In this article, we formally define the news comment alignment problem and propose a novel framework that: (1) automatically extracts topics from a given news article and its associated comments, (2) identifies and extends positive examples with different degrees of confidence using three methods (i.e., hypersphere, density, and cluster chain), and (3) completes the alignment between news sentences and comments through a weighted-SVM classifier. Extensive experiments show that our proposed framework significantly outperforms state-of-the-art methods.
Conference Paper
In recent years, the rise of social media such as Twitter has been changing the way people acquire information. Meanwhile, traditional information sources such as news articles are still irreplaceable. These have led to a new branch of study on understanding the relationship between news articles and social media posts and fusing information from these heterogeneous sources. In this paper, we present a system that is able to effectively and efficiently link news and relevant tweets. Specifically, given a news stream and a tweet stream, the system discovers tweets that are relevant to each news in the news stream.
Article
It is important to obtain public opinion about a news article. Microblogs such as Twitter are popular and an important medium for people to share ideas. An important portion of tweets are related to news or events. Our aim is to find tweets about newspaper reports and measure the popularity of these reports on Twitter. However, it is a challenging task to match informal and very short tweets with formal news reports. In this study, we formulate this problem as a supervised classification task. We propose to form a training set using tweets containing a link to the news and the content of the same news article. We preprocess tweets by removing unnecessary words and symbols and apply stemming by means of morphological analysers. We apply binary classifiers and anomaly detection to this task. We also propose a textual similarity-based approach. We observed that preprocessing of tweets increases accuracy. The textual similarity method obtains results with the highest recognition rate. Success increases in some cases when report text is used with tweets containing a link to the news report within the training set of classification studies. We propose that this study, which is made directly in consideration of tweet texts that measure the trends of national newspaper reports on social media, has a higher significance when compared to Twitter analyses made by using a hashtag. Given the limited number of scientific studies on Turkish tweets, this study makes a contribution to the literature.
Conference Paper
Many current Natural Language Processing [NLP] techniques work well assuming a large context of text as input data. However they become ineffective when applied to short texts such as Twitter feeds. To overcome the issue, we want to find a related newswire document to a given tweet to provide contextual support for NLP tasks. This requires robust modeling and understanding of the semantics of short texts. The contribution of the paper is two-fold: 1. we introduce the Linking-Tweets-to-News task as well as a dataset of linked tweet-news pairs, which can benefit many NLP applications; 2. in contrast to previous research which focuses on lexical features within the short texts (text-to-word information), we propose a graph based latent variable model that models the inter short text correlations (text-to-text information). This is motivated by the observation that a tweet usually only covers one aspect of an event. We show that using tweet specific feature (hashtag) and news specific feature (named entities) as well as temporal constraints, we are able to extract text-to-text correlations, and thus completes the semantic picture of a short text. Our experiments show significant improvement of our new model over baselines with three evaluation metrics in the new task.
Article
We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to-end, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.
Conference Paper
With the rapid proliferation of social media, more and more people freely express their opinions (or comments) on news, products, and movies through online services such as forums, discussion groups, and microblogs. Those comments may be concerned with different aspects (topics) of the target Web document (e.g., a news page). It would be interesting to align the social comments to the corresponding subtopics contained in the Web document. In this paper, we propose a novel framework that is able to automatically detect the subtopics from a given Web document, and also align the associated social comments with the detected subtopics. This provides a new view of the Web standard document and its associated user generated content through topics, which facilitates the readers to quickly focus on those hot topics or grasp topics that they are interested in. Extensive experiments show that our proposed framework significantly outperforms the existing state-of-the-art methods in social content alignment.
Conference Paper
Comments constitute an important part of Web 2.0. In this paper, we consider comments on news articles. To simplify the task of relating the comment content to the article content the comments are about, we propose the idea of showing comments alongside article segments and explore automatic mapping of comments to article segments. This task is challenging because of the vocabulary mismatch between the articles and the comments. We present supervised and unsupervised techniques for aligning comments to segments the of article the comments are about. More specifically, we provide a novel formulation of supervised alignment problem using the framework of structured classification. Our experimental results show that structured classification model performs better than unsupervised matching and binary classification model.
Conference Paper
We propose a new paradigm for displaying comments: showing comments alongside parts of the article they correspond to. We evaluate the effectiveness of various approaches for this task and show that a combination of bag of words and topic models performs the best.
Neural article pair modeling for wikipedia sub-article matching
  • Muhao Chen
  • Changping Meng
  • Gang Huang
  • Carlo Zaniolo
Muhao Chen, Changping Meng, Gang Huang, and Carlo Zaniolo. Neural article pair modeling for wikipedia sub-article matching. Lecture Notes in Computer Science, page 3-19, 2019.
Learning phrase representations using rnn encoder-decoder for statistical machine translation
  • Kyunghyun Cho
  • Caglar Bart Van Merrienboer
  • Dzmitry Gulcehre
  • Fethi Bahdanau
  • Holger Bougares
  • Yoshua Schwenk
  • Bengio
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
Going beyond corr-lda for detecting specific comments on news & blogs
  • Trapit Mrinal Kanti Das
  • Chiranjib Bansal
  • Bhattacharyya
Mrinal Kanti Das, Trapit Bansal, and Chiranjib Bhattacharyya. Going beyond corr-lda for detecting specific comments on news & blogs. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM '14, pages 483-492, New York, NY, USA, 2014. ACM.
Bert: Pre-training of deep bidirectional transformers for language understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018.
  • Qipeng Guo
  • Xipeng Qiu
  • Pengfei Liu
  • Yunfan Shao
  • Xiangyang Xue
  • Zheng Zhang
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. Star-transformer, 2019.
Bridging vocabularies to link tweets and news
  • Tuan-Anh Hoang-Vu
  • Aline Bessa
  • Luciano Barbosa
  • Juliana Freire
Tuan-Anh Hoang-Vu, Aline Bessa, Luciano Barbosa, and Juliana Freire. Bridging vocabularies to link tweets and news.
Convolutional neural networks for sentence classification
  • Yoon Kim
Yoon Kim. Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
Pytorch metric learning
  • Kevin Musgrave
  • Ser-Nam Lim
  • Serge Belongie
Kevin Musgrave, Ser-Nam Lim, and Serge Belongie. Pytorch metric learning. https:// github.com/KevinMusgrave/pytorch_metric_learning, 2019.
Sampling matters in deep embedding learning
  • Chao-Yuan
  • R Wu
  • Alexander J Manmatha
  • Philipp Smola
  • Krahenbuhl
Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017.
Interactive attention for semantic text matching
  • Sendong Zhao
  • Yong Huang
  • Chang Su
  • Yuantong Li
  • Fei Wang
Sendong Zhao, Yong Huang, Chang Su, Yuantong Li, and Fei Wang. Interactive attention for semantic text matching. arXiv preprint arXiv:1911.06146, 2019.