Jerret Ross’s research while affiliated with IBM Research - Thomas J. Watson Research Center and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (13)


Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets
  • Preprint

December 2019

·

41 Reads

Mingrui Liu

·

·

Jerret Ross

·

[...]

·

Adaptive gradient algorithms perform gradient-based updates using the history of gradients and are ubiquitous in training deep neural networks. While adaptive gradient methods theory is well understood for minimization problems, the underlying factors driving their empirical success in min-max problems such as GANs remain unclear. In this paper, we aim at bridging this gap from both theoretical and empirical perspectives. First, we analyze a variant of Optimistic Stochastic Gradient (OSG) proposed in~\citep{daskalakis2017training} for solving a class of non-convex non-concave min-max problem and establish O(ϵ4)O(\epsilon^{-4}) complexity for finding ϵ\epsilon-first-order stationary point, in which the algorithm only requires invoking one stochastic first-order oracle while enjoying state-of-the-art iteration complexity achieved by stochastic extragradient method by~\citep{iusem2017extragradient}. Then we propose an adaptive variant of OSG named Optimistic Adagrad (OAdagrad) and reveal an \emph{improved} adaptive complexity O~(ϵ21α)\widetilde{O}\left(\epsilon^{-\frac{2}{1-\alpha}}\right)~\footnote{Here O~()\widetilde{O}(\cdot) compresses a logarithmic factor of ϵ\epsilon.}, where α\alpha characterizes the growth rate of the cumulative stochastic gradient and 0α1/20\leq \alpha\leq 1/2. To the best of our knowledge, this is the first work for establishing adaptive complexity in non-convex non-concave min-max optimization. Empirically, our experiments show that indeed adaptive gradient algorithms outperform their non-adaptive counterparts in GAN training. Moreover, this observation can be explained by the slow growth rate of the cumulative stochastic gradient, as observed empirically.



Figure 2: Left and Center Panels: Comparison of the ensembling methods on COCO validation set using synonyms-based similarity matrix with topK and randomized beam search. Right Panel: Comparison of ensembling methods when the predictions of the input models are shuffled according to the neighborhood structure defined by K. It can be seen that the W. Barycenter ensembling is able to recover from the word shuffling and produce better captions then the simple averaging methods, which are not able to exploit the provided side information. Human Evaluation. We performed human evaluation on Amazon MTurk on a challenging set of images out of context of MS-COCO (Dognin et al., 2018). We compared three ensembling techniques: arithmetic, geometric and W. barycenter. For W. barycenter we used the similarity matrix K defined by visual word2vec (Kottur et al., 2016). For the three models we use randomized beam search. We asked MTurkers to give a score for each caption on a scale 1-5 and choose the best captions based on correctness and detailedness. Captions examples are given in Fig. 6 (Appendix). Fig. 3 shows that W. barycenter has an advantage over the basic competing ensembling techniques.
Figure 5: Visualization of the word distributions of W. barycenter for different similarity matrices K based on GloVe (rows denote the distance of K from identity K − I F and corresponding ). Large entropic regularization generates K close to uninformative matrices of all 1's. This eventually leads to a barycenter which is close to a uniform distribution spreading the probability mass almost equally across all the words.
Figure 6: Examples of captions for several images. BA: Wasserstein Barycenter, AM: Arithmetic mean, GM: Geometric mean, GT: Ground truth.
Wasserstein Barycenter Model Ensembling
  • Preprint
  • File available

February 2019

·

98 Reads

In this paper we propose to perform model ensembling in a multiclass or a multilabel learning setting using Wasserstein (W.) barycenters. Optimal transport metrics, such as the Wasserstein distance, allow incorporating semantic side information such as word embeddings. Using W. barycenters to find the consensus between models allows us to balance confidence and semantics in finding the agreement between the models. We show applications of Wasserstein ensembling in attribute-based classification, multilabel learning and image captioning generation. These results show that the W. ensembling is a viable alternative to the basic geometric or arithmetic mean ensembling.

Download

Citations (5)


... These biased GAI predictions may influence the final decisions made by humans. Belgodere et al. (2023) conducted the study on an admission bar exam dataset sourced from the Law School Admission Council (LSAC). This dataset records each student with their personal information of gender, race, Law School Admission Test (LSAT) score, undergraduate GPAs. ...

Reference:

Potential Societal Biases of ChatGPT in Higher Education: A Scoping Review
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

IEEE Journal on Emerging and Selected Topics in Circuits and Systems

... This would facilitate communication and the identification of cross-disciplinary opportunities and help chemistry to be viewed and understood globally. To achieve this goal will require to draft a map of chemical space representing all subfields of chemistry and their mutual relationships, not an easy task for which multiple approaches to molecular representation including artificial intelligence might be required [42][43][44][45][46][47][48]. ...

Large-scale chemical language representations capture molecular structure and properties

Nature Machine Intelligence

... Many of the latest pre-trained chemical models employ self-supervised pre-training tasks on huge unlabeled datasets of 2D chemical structures. [44][45][46][47] Conversely, there are numerous instances of quasi-transfer learning, involving pre-training on datasets of ab initio calculated properties of the size comparable to the available experimental datasets. 12,37 We propose the atomic feature extraction from the models pre-trained for different chemical tasks on larger datasets, and we evaluate it by predicting experimental 13 C chemical shis. ...

Do Large Scale Molecular Language Representations Capture Important Structural Information?

... Unlike tabular data where rows are independent, sequential tabular data exhibits dependent rows with dependent columns [14,28]. This data exhibits strong dynamic and static patterns capturing local transient patterns and global identity [16]. For example, the customer's clickstream activity and the amount spent exhibit a dynamic pattern, while the type of product owned exhibits a static pattern replicated across the sequence for an elongated period. ...

Tabular Transformers for Modeling Multivariate Time Series

... You et al. [31,32] detect visual concepts (regions, objects, attributes, etc.) and combine the visual features with the concepts to generate captions. Dai et al. [33][34][35] approach image captioning as conditional GAN training. Zhang et al. [36][37][38] integrate part-of-speech information to ensure captions better adhere to language habits and grammar rules. ...

Adversarial Semantic Alignment for Improved Image Captions
  • Citing Conference Paper
  • June 2019