Conference Paper

Text Style Transfer with Contrastive Transfer Pattern Mining

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Given the scarcity of parallel data (i.e., text pairs conveying the same content but differing in styles) and the labor-intensive nature of annotating such pairs, existing research has predominantly focused on unsupervised TST. Recent contributions in this domain, including studies by Lee et al., 2021;Huang et al., 2021;Suzgun et al., 2022;Ramesh Kashyap et al., 2022;Han et al., 2023), have demonstrated significant progress. Despite notable success, these works primarily concentrate on the transfer of a single sentence, which we call short TST. ...
... Given the scarcity of parallel data (i.e., text pairs conveying the same content but differing in styles) and the labor-intensive nature of annotating such pairs, existing research has predominantly focused on unsupervised TST. Recent contributions in this domain, including studies by Lee et al., 2021;Huang et al., 2021;Suzgun et al., 2022;Ramesh Kashyap et al., 2022;Han et al., 2023), have demonstrated significant progress. Despite notable success, these works primarily concentrate on the transfer of a single sentence, which we call short TST. ...
Preprint
Text style transfer (TST) aims to vary the style polarity of text while preserving the semantic content. Although recent advancements have demonstrated remarkable progress in short TST, it remains a relatively straightforward task with limited practical applications. The more comprehensive long TST task presents two challenges: (1) existing methods encounter difficulties in accurately evaluating content attributes in multiple words, leading to content degradation; (2) the conventional vanilla style classifier loss encounters obstacles in maintaining consistent style across multiple generated sentences. In this paper, we propose a novel method SC2, where a multilayer Joint Style-Content Weighed (JSCW) module and a Style Consistency loss are designed to address the two issues. The JSCW simultaneously assesses the amounts of style and content attributes within a token, aiming to acquire a lossless content representation and thereby enhancing content preservation. The multiple JSCW layers further progressively refine content representations. We design a style consistency loss to ensure the generated multiple sentences consistently reflect the target style polarity. Moreover, we incorporate a denoising non-autoregressive decoder to accelerate the training. We conduct plentiful experiments and the results show significant improvements of SC2 over competitive baselines. Our code: https://github.com/jiezhao6/SC2.
... Text style transfer (TST) has been a long history from the early works, i.e., the eariler attempts are the frame language-based systems (McDonald and Pustejovsky, 1985) and schema-based Natural Language Generation (Hovy, 1987) in the 1980s, and more recent attempts such as CTPM (contrastive transfer pattern mining) (Han et al., 2023) and TST BT (Text Style Transfer Back Translation) (Wei et al., 2023). The goal is to change the text style, such as formality, and politeness with preserving the sense of the input text. ...
Article
Full-text available
The k-means algorithm is generally the most known and used clustering method. There are various extensions of k-means to be proposed in the literature. Although it is an unsupervised learning to clustering in pattern recognition and machine learning, the k-means algorithm and its extensions are always influenced by initializations with a necessary number of clusters a priori. That is, the k-means algorithm is not exactly an unsupervised clustering method. In this paper, we construct an unsupervised learning schema for the k-means algorithm so that it is free of initializations without parameter selection and can also simultaneously find an optimal number of clusters. That is, we propose a novel unsupervised k-means (U-k-means) clustering algorithm with automatically finding an optimal number of clusters without giving any initialization and parameter selection. The computational complexity of the proposed U-k-means clustering algorithm is also analyzed. Comparisons between the proposed U-k-means and other existing methods are made. Experimental results and comparisons actually demonstrate these good aspects of the proposed U-k-means clustering algorithm.
Conference Paper
Full-text available
This paper focuses on the task of sentiment transfer on non-parallel text, which modifies sentiment attributes (e.g., positive or negative) of sentences while preserving their attribute-independent contents. Existing methods adopt RNN encoder-decoder structure to generate a new sentence of a target sentiment word by word, which is trained on a particular dataset from scratch and have limited ability to produce satisfactory sentences. When people convert the sentiment attribute of a given sentence, a simple but effective approach is to only replace the sentiment tokens of the sentence with other expressions indicative of the target sentiment, instead of building a new sentence from scratch. Such a process is very similar to the task of Text Infilling or Cloze. With this intuition, we propose a two steps approach: Mask and Infill. In the \emph{mask} step, we identify and mask the sentiment tokens of a given sentence. In the \emph{infill} step, we utilize a pre-trained Masked Language Model (MLM) to infill the masked positions by predicting words or phrases conditioned on the context\footnote{In this paper, \emph{content} and \emph{context} are equivalent, \emph{style}, \emph{attribute} and \emph{label} are equivalent.}and target sentiment. We evaluate our model on two review datasets \emph{Yelp} and \emph{Amazon} by quantitative, qualitative, and human evaluations. Experimental results demonstrate that our model achieve state-of-the-art performance on both accuracy and BLEU scores.
Conference Paper
Lexical disambiguation is a major challenge for machine translation systems, especially if some senses of a word are trained less often than others. Identifying patterns of overgeneralization requires evaluation methods that are both reliable and scalable. We propose contrastive conditioning as a reference-free black-box method for detecting disambiguation errors. Specifically, we score the quality of a translation by conditioning on variants of the source that provide contrastive disambiguation cues. After validating our method, we apply it in a case study to perform a targeted evaluation of sequence-level knowledge distillation. By probing word sense disambiguation and translation of gendered occupation names, we show that distillation-trained models tend to overgeneralize more than other models with a comparable BLEU score. Contrastive conditioning thus highlights a side effect of distillation that is not fully captured by standard evaluation metrics. Code and data to reproduce our findings are publicly available.
Article
We consider the task of text attribute transfer: transforming a sentence to alter a specific attribute (e.g., sentiment) while preserving its attribute-independent content (e.g., changing "screen is just the right size" to "screen is too small"). Our training data includes only sentences labeled with their attribute (e.g., positive or negative), but not pairs of sentences that differ only in their attributes, so we must learn to disentangle attributes from attribute-independent content in an unsupervised way. Previous work using adversarial methods has struggled to produce high-quality outputs. In this paper, we propose simpler methods motivated by the observation that text attributes are often marked by distinctive phrases (e.g., "too small"). Our strongest method extracts content words by deleting phrases associated with the sentence's original attribute value, retrieves new phrases associated with the target attribute, and uses a neural model to fluently combine these into a final output. On human evaluation, our best method generates grammatical and appropriate responses on 22% more inputs than the best previous system, averaged over three attribute transfer datasets: altering sentiment of reviews on Yelp, altering sentiment of reviews on Amazon, and altering image captions to be more romantic or humorous.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
We present persona-based models for handling the issue of speaker consistency in neural response generation. A speaker model encodes personas in distributed embeddings that capture individual characteristics such as background information and speaking style. A dyadic speaker-addressee model captures properties of interactions between two interlocutors. Our models yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models, with similar gain in speaker consistency as measured by human judges.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Dual contrastive learning: Text classification via label-aware data augmentation
  • Q Chen
  • R Zhang
  • Y Zheng
  • Y Mao
Q. Chen, R. Zhang, Y. Zheng, and Y. Mao. 2022. Dual contrastive learning: Text classification via label-aware data augmentation. arXiv preprint arXiv:2201.08702.
Bert: Pre-training of deep bidirectional transformers for language understanding
  • Jacob Devlin
  • Ming Wei Chang
  • Kenton Lee
  • Kristina Toutanova
Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171-4186.
Non-linguistic supervision for contrastive learning of sentence embeddings
  • Yiren Jian
  • Chongyang Gao
  • Soroush Vosoughi
Yiren Jian, Chongyang Gao, and Soroush Vosoughi. 2022b. Non-linguistic supervision for contrastive learning of sentence embeddings. arXiv preprint arXiv:2209.09433.
Supervised contrastive learning
  • Prannay Khosla
  • Piotr Teterwak
  • Chen Wang
  • Aaron Sarna
  • Yonglong Tian
  • Phillip Isola
  • Aaron Maschinot
  • Ce Liu
  • Dilip Krishnan
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems, 33:18661-18673.
Multiple-attribute text rewriting
  • Guillaume Lample
  • Sandeep Subramanian
  • Eric Smith
  • Ludovic Denoyer
  • Marc'aurelio Ranzato
  • Y-Lan Boureau
Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc'Aurelio Ranzato, and Y-Lan Boureau. 2018. Multiple-attribute text rewriting. In International Conference on Learning Representations.
Revision in continuous space: Unsupervised text style transfer without adversarial learning
  • D Liu
  • J Fu
  • Y Zhang
  • C Pal
  • J Lv
D. Liu, J. Fu, Y. Zhang, C. Pal, and J. Lv. 2020. Revision in continuous space: Unsupervised text style transfer without adversarial learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8376-8383.
Bleu: a method for automatic evaluation of machine translation
  • Kishore Papineni
  • Salim Roukos
  • Todd Ward
  • Wei-Jing Zhu
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311-318.
Coda: Contrast-enhanced and diversity-promoting data augmentation for natural language understanding
  • Y Qu
  • D Shen
  • Y Shen
  • S Sajeev
  • J Han
  • W Chen
Y. Qu, D. Shen, Y. Shen, S. Sajeev, J. Han, and W. Chen. 2020. Coda: Contrast-enhanced and diversity-promoting data augmentation for natural language understanding. In International Conference on Learning Representations.
Visualizing data using t-sne
  • Laurens Van Der Maaten
  • Geoffrey Hinton
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579-2605.
  • Yequan Wang
  • Jiawen Deng
Yequan Wang, Jiawen Deng, Aixin Sun, and Xuying Meng. 2022. Perplexity from plm is unreliable for evaluating text quality. arXiv preprint arXiv:2210.05892.
Sequence level contrastive learning for text summarization
  • Shusheng Xu
  • Xingxing Zhang
  • Yi Wu
  • Furu Wei
Shusheng Xu, Xingxing Zhang, Yi Wu, and Furu Wei. 2022. Sequence level contrastive learning for text summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11556-11565.