Eric Gaussier

Eric Gaussier
  • PhD in Computer Science, Masters in Computer Science and Applied Mathematics
  • Professor (Full) at Grenoble Alpes University

About

290
Publications
90,689
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
12,568
Citations
Current institution
Grenoble Alpes University
Current position
  • Professor (Full)
Additional affiliations
Position
  • Professor (Full)
January 2008 - December 2010
Laboratoire d'Informatique de Grenoble
September 2006 - December 2011

Publications

Publications (290)
Preprint
In recent years, large language models (LLMs) have demonstrated exceptional power in various domains, including information retrieval. Most of the previous practices involve leveraging these models to create a single embedding for each query, each passage, or each document individually, a strategy exemplified and used by the Retrieval-Augmented Gen...
Preprint
The rapid development of large language models (LLMs) like Llama has significantly advanced information retrieval (IR) systems. However, using LLMs for long documents, as in RankLLaMA, remains challenging due to computational complexity, especially concerning input token length. Furthermore, the internal mechanisms of LLMs during ranking are still...
Preprint
Tree-based ensemble methods such as random forests, gradient-boosted trees, and Bayesianadditive regression trees have been successfully used for regression problems in many applicationsand research studies. In this paper, we study ensemble versions of probabilisticregression trees that provide smooth approximations of the objective function by ass...
Conference Paper
We introduce in this survey the major concepts, models, and algorithms proposed so far to infer causal relations from observational time series, a task usually referred to as causal discovery in time series. To do so, after a description of the underlying concepts and modelling assumptions, we present different methods according to the family of ap...
Preprint
Full-text available
Constraint-based and noise-based methods have been proposed to discover summary causal graphs from observational time series under strong assumptions which can be violated or impossible to verify in real applications. Recently, a hybrid method (Assaad et al, 2021) that combines these two approaches, proved to be robust to assumption violation. Howe...
Preprint
Although neural information retrieval has witnessed great improvements, recent works showed that the generalization ability of dense retrieval models on target domains with different distributions is limited, which contrasts with the results obtained with interaction-based models. To address this issue, researchers have resorted to adversarial lear...
Article
On a wide range of natural language processing and information retrieval tasks, transformer-based models, particularly pre-trained language models like BERT, have demonstrated tremendous effectiveness. Due to the quadratic complexity of the self-attention mechanism, however, such models have difficulties processing long documents. Recent works deal...
Article
Research indicates that investors rely on various criteria to evaluate early-stage companies. However, past research in this area has focused on subsets of factors and does not distinguish between the predictors and causal determinants of start-up valuation. In our study, we applied machine learning and causal discovery to analyze a comprehensive d...
Article
Full-text available
In this study, we focus on mixed data which are either observations of univariate random variables which can be quantitative or qualitative, or observations of multivariate random variables such that each variable can include both quantitative and qualitative components. We first propose a novel method, called CMIh, to estimate conditional mutual i...
Article
Full-text available
This study addresses the problem of learning a summary causal graph on time series with potentially different sampling rates. To do so, we first propose a new causal temporal mutual information measure for time series. We then show how this measure relates to an entropy reduction principle that can be seen as a special case of the probability raisi...
Conference Paper
This study addresses the problem of learning an extended summary causal graph from time series. The algorithms we propose fit within the well-known constraint-based framework for causal discovery and make use of information-theoretic measures to determine (in)dependencies between time series. We first introduce generalizations of the causation entr...
Article
We study here a way to approximate information retrieval metrics through a softmax-based approximation of the rank indicator function. Indeed, this latter function is a key component in the design of information retrieval metrics, as well as in the design of the ranking and sorting functions. Obtaining a good approximation for it thus opens the doo...
Preprint
Full-text available
This study addresses the problem of learning an extended summary causal graph on time series. The algorithms we propose fit within the well-known constraint-based framework for causal discovery and make use of information-theoretic measures to determine (in)dependencies between time series. We first introduce generalizations of the causation entrop...
Preprint
Full-text available
We consider in this paper the problem of predicting the ability of a startup to attract investments using freely, publicly available data. Information about startups on the web usually comes either as unstructured data from news, social networks, and websites or as structured data from commercial databases, such as Crunchbase. The possibility of pr...
Article
We introduce in this survey the major concepts, models, and algorithms proposed so far to infer causal relations from observational time series, a task usually referred to as causal discovery in time series. To do so, after a description of the underlying concepts and modelling assumptions, we present different methods according to the family of ap...
Preprint
On a wide range of natural language processing and information retrieval tasks, transformer-based models, particularly pre-trained language models like BERT, have demonstrated tremendous effectiveness. Due to the quadratic complexity of the self-attention mechanism, however, such models have difficulties processing long documents. Recent works deal...
Article
Bilingual corpora are an essential resource used to cross the language barrier in multilingual natural language processing tasks. Among bilingual corpora, comparable corpora have been the subject of many studies as they are both frequent and easily available. In this paper, we propose to make use of formal concept analysis to first construct concep...
Chapter
We address, in the context of time series, the problem of learning a summary causal graph from observations through a model with independent and additive noise. The main algorithm we propose is a hybrid method that combines the well-known constraint-based framework for causal graph discovery and the noise-based framework that gained much attention...
Preprint
We address in this study the problem of learning a summary causal graph on time series with potentially different sampling rates. To do so, we first propose a new temporal mutual information measure defined on a window-based representation of time series. We then show how this measure relates to an entropy reduction principle that can be seen as a...
Preprint
Information retrieval (IR) systems traditionally aim to maximize metrics built on rankings, such as precision or NDCG. However, the non-differentiability of the ranking operation prevents direct optimization of such metrics in state-of-the-art neural IR models, which rely entirely on the ability to compute meaningful gradients. To address this shor...
Chapter
We address the problem of startup valuation from a machine learning perspective with a focus on European startups. More precisely, we aim to infer the valuation of startups corresponding to the funding rounds for which only the raised amount was announced. To this end, we mine Crunchbase, a well-established source of information on companies. We st...
Article
Full-text available
Query Expansion (QE) is widely applied to improve the retrieval performance of ad-hoc search, using different techniques and several data sources to find expansion terms. In Information Retrieval literature, selecting expansion terms remains a challenging task that relies on the extraction of term relationships. In this paper, we propose a new lear...
Article
Distributional semantic word representations are at the basis of most modern NLP systems. Their usefulness has been proven across various tasks, particularly as inputs to deep learning models. Beyond that, much work investigated fine-tuning the generic word embeddings to leverage linguistic knowledge from large lexical resources. Some work investig...
Article
We study in this paper the problem of jointly clustering and learning representations. As several previous studies have shown, learning representations that are both faithful to the data to be clustered and adapted to the clustering algorithm can lead to better clustering performance, all the more so that the two tasks are performed jointly. We pro...
Chapter
Different users may be interested in different clustering views underlying a given collection (e.g., topic and writing style in documents). Enabling them to provide constraints reflecting their needs can then help obtain tailored clustering results. For document clustering, constraints can be provided in the form of seed words, each cluster being c...
Chapter
While in standard clustering no side information is used, users might be interested in providing additional information to influence the clustering. In case of document clustering, additional information can take the form of pairwise constraints where a user provides additional information about pairs of documents as must-link and cannot-link const...
Preprint
The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, whi...
Article
Recently, several information retrieval (IR) models have been proposed in order to boost the retrieval performance using term dependencies. However, in the context of the Arabic language, most IR researchers have focused on the problem of stemming, which is highly challenging in this language. In this paper, we propose to explore whether term depen...
Article
Pseudo-relevance feedback (PRF) is a very effective query expansion approach, which reformulates queries by selecting expansion terms from top k pseudo-relevant documents. Although standard PRF models have been proven effective to deal with vocabulary mismatch between users’ queries and relevant documents, expansion terms are selected without consi...
Preprint
Full-text available
We propose in this paper a new, hybrid document embedding approach in order to address the problem of document similarities with respect to the technical content. To do so, we employ a state-of-the-art graph techniques to first extract the keyphrases (composite keywords) of documents and, then, use them to score the sentences. Using the ranked sent...
Preprint
Full-text available
This paper describes our submission to the E2E NLG Challenge. Recently, neural seq2seq approaches have become mainstream in NLG, often resorting to pre- (respectively post-) processing delexicalization (relexicalization) steps at the word-level to handle rare words. By contrast, we train a simple character level seq2seq model, which requires no pre...
Preprint
Full-text available
Tree-based ensemble methods, as Random Forests and Gradient Boosted Trees, have been successfully used for regression in many applications and research studies. Furthermore, these methods have been extended in order to deal with uncertainty in the output variable, using for example a quantile loss in Random Forests (Meinshausen, 2006). To the best...
Chapter
Query expansion approaches based on term correlation such as association rules (ARs) have proved significant improvement in the performance of the information retrieval task. However, the highly sized set of generated ARs is considered as a real hamper to select only most interesting ones for query expansion. In this respect, we propose a new learn...
Preprint
Full-text available
We study in this paper the problem of jointly clustering and learning representations. As several previous studies have shown, learning representations that are both faithful to the data to be clustered and adapted to the clustering algorithm can lead to better clustering performance, all the more so that the two tasks are performed jointly. We pro...
Article
Recurrent neural networks (RNNs) recently received considerable attention for sequence modeling and time series analysis. Many time series contain periods, e.g. seasonal changes in weather time series or electricity usage at day and night time. Here, we first analyze the behavior of RNNs with an attention mechanism with respect to periods in time s...
Conference Paper
Semantic annotation in the biomedical domain raises the problem of classifying texts with large-scale taxonomies, a problem sometimes referred to as extreme classification. In this presentation, we will give an overview of this problem and the main solutions proposed, with a focus on textual collections and the BioASQ challenge.
Article
The EASY-FCFS heuristic is the basic building block of job scheduling policies in most parallel High Performance Computing platforms. Despite its simplicity, and the guarantee of no job starvation, it could still be improved on a per-system basis. Such tuning is difficult because of non-linearities in the scheduling process. The study conducted in...
Article
Experimental results of cross-language information retrieval (CLIR) do not indicate why a model fails or how a model could be improved. One basic research question is thus whether it is possible to provide conditions by which one can evaluate any existing or new CLIR strategy analytically and one can improve the design of CLIR models. Inspired by t...
Article
Full-text available
Term mismatch is a common limitation of traditional information retrieval (IR) models where relevance scores are estimated based on exact matching of documents and queries. Typically, good IR model should consider distinct but semantically similar words in the matching process. In this paper, we propose a method to incorporate word embedding (WE) s...
Article
Predicting the diffusion of information in social networks is a key problem for applications like Opinion Leader Detection, Buzz Detection or Viral Marketing. Many diffusion models are direct extensions of the Cascade and Threshold models, initially proposed for epidemiology and social studies. In such models, the diffusion process is based on the...
Article
Comparable corpora serve as an important substitute for parallel resources in cases of under-resourced language pairs. Previous work mostly aims to find a better strategy to exploit existing comparable corpora, while ignoring the variety in corpus quality. The quality of comparable corpora affects a lot its usability in practice, a fact that has be...
Conference Paper
Full-text available
We propose here an extended attention model for sequence-to-sequence recurrent neural networks (RNNs) designed to capture (pseudo-)periods in time series. This extended attention model can be deployed on top of any RNN and is shown to yield state-of-the-art performance for time series forecasting on several univariate and multivariate time series.
Article
We study task composition in crowdsourcing and the effect of personalization and diversity on performance. A central process in crowdsourcing is task assignment, the mechanism through which workers find tasks. On popular platforms such as AMT, task assignment is facilitated by the ability to sort tasks by creation date or reward amount. Task compos...
Article
Latent factor models are increasingly popular for modeling multi-relational knowledge graphs. By their vectorial nature, it is not only hard to interpret why this class of models works so well, but also to understand where they fail and how they might be improved. We conduct an experimental survey of state-of-the-art models, not towards a purely co...
Preprint
Latent factor models are increasingly popular for modeling multi-relational knowledge graphs. By their vectorial nature, it is not only hard to interpret why this class of models works so well, but also to understand where they fail and how they might be improved. We conduct an experimental survey of state-of-the-art models, not towards a purely co...
Conference Paper
Full-text available
In this paper, we propose to complement the context vectors used in bilingual lexicon extraction from comparable corpora with concept vectors, that aim at capturing all the words related to the concepts associated with a given word. This allows one to rely on a representation that is less sparse, especially in specialized domains where the use of a...
Article
In this paper we are interested in finding good IR scoring functions by exploring the space of all possible IR functions. Earlier approaches to do so however only explore a small sub-part of the space, with no control on which part is explored and which is not. We aim here at a more systematic exploration by first defining a grammar to generate pos...
Article
Full-text available
In this paper, we study the use of recurrent neural networks (RNNs) for modeling and forecasting time series. We first illustrate the fact that standard sequence-to-sequence RNNs neither capture well periods in time series nor handle well missing values, even though many real life times series are periodic and contain missing values. We then propos...
Article
Full-text available
In statistical relational learning, knowledge graph completion deals with automatically understanding the structure of large knowledge graphs---labeled directed graphs---and predicting missing relationships---labeled edges. State-of-the-art embedding models propose different trade-offs between modeling expressiveness, and time and space complexity....
Conference Paper
Full-text available
The exchangeability assumption in topic models like Latent Dirichlet Allocation (LDA) often results in inferring inconsistent topics for the words of text spans like noun-phrases, which are usually expected to be topically coherent. We propose copulaLDA, that extends LDA by integrating part of the text structure to the model and relaxes the conditi...
Conference Paper
Traditional Information Retrieval (IR) models are based on bag-of-words paradigm, where relevance scores are computed based on exact matching of keywords. Although these models have already achieved good performance, it has been shown that most of dissatisfaction cases in relevance are due to term mismatch between queries and documents. In this pap...
Chapter
Recently, Word Embeddings have been introduced as a major breakthrough in Natural Language Processing (NLP) to learn viable representation of linguistic items based on contextual information or/and word co-occurrence. In this paper, we investigate Arabic document classification using Word and document Embeddings as representational basis rather tha...
Article
Full-text available
Multivariate time series naturally exist in many fields, like energy, bioinformatics, signal processing, and finance. Most of these applications need to be able to compare these structured data. In this context, dynamic time warping (DTW) is probably the most common comparison measure. However, not much research effort has been put into improving i...
Conference Paper
In statistical relational learning, the link prediction problem is key to automatically understand the structure of large knowledge bases. As in previous studies, we propose to solve this problem through latent factorization. However, here we make use of complex valued embeddings. The composition of complex embeddings can handle a large variety of...
Conference Paper
We propose in this paper two new models for modeling topic and word-topic dependencies between consecutive documents in document streams. The first model is a direct extension of Latent Dirichlet Allocation model (LDA) and makes use of a Dirichlet distribution to balance the influence of the LDA prior parameters wrt to topic and word-topic distribu...
Conference Paper
Estimating the centroid of a set of time series under time warp is a major topic for many temporal data mining applications, as summarization a set of time series, prototype extraction or clustering. The task is challenging as the estimation of centroid of time series faces the problem of multiple temporal alignments. This work compares the major p...
Article
Full-text available
In statistical relational learning, the link prediction problem is key to automatically understand the structure of large knowledge bases. As in previous studies, we propose to solve this problem through latent factorization. However, here we make use of complex valued embeddings. The composition of complex embeddings can handle a large variety of...
Article
Rank K Binary Matrix Factorization (BMF) approximates a binary matrix by the product of two binary matrices of lower rank, K. Several researchers have addressed this problem, focusing on either approximations of rank 1 or higher, using either the L1 or L2-norms for measuring the quality of the approximation. The rank 1 problem (for which the L1 and...
Article
Full-text available
We present the international training program in Data Science at master 2 level. This program is supported by both Grenoble Alpes University and Grenoble INP. In this article, we elaborate on the specific features of the program, its strategic position, operating and historical features, the detailed contents of courses and perspectives of evolutio...
Article
Full-text available
Diagonalization, or eigenvalue decomposition, is very useful in many areas of applied mathematics, including signal processing and quantum physics. Matrix decomposition is also a useful tool for approximating matrices as the product of a matrix and its transpose, which relates to unitary diagonalization. As stated by the spectral theorem, only norm...
Article
Presents information on the 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA'2015).
Article
Full-text available
In this paper, we study flat and hierarchical classification strategies in the context of largescale taxonomies. Addressing the problem from a learning-theoretic point of view, we first propose a multi-class, hierarchical data dependent bound on the generalization error of classifiers deployed in large-scale taxonomies. This bound provides an expla...
Article
Forecasting keyword activities in social networking sites has been the subject of many studies, as such activities represent, in many cases, a direct estimate of the spread of real-world phenomena, e.g. box-office revenues or flu epidemic. Most of these studies rely on point-wise, regression-like prediction algorithms and focus on few, usually unam...
Article
We present the international training program in Data Science at master 2 level. This program is supported by both Grenoble Alpes University and Grenoble INP. In this article, we elaborate on the specific features of the program, its strategic position, operating and historical features, the detailed contents of courses and perspectives of evolutio...
Article
Full-text available
To address multilingual document classification in an efficient and effective manner, we claim that a synergy between classical IR techniques such as vector model and some advanced data mining methods, especially Formal Concept Analysis, is particularly appropriate. We propose in this paper, a new statistical approach for extracting inter-language...
Article
Full-text available
The job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into acc...
Conference Paper
Full-text available
The job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into acc...
Conference Paper
Using the appropriate metric is crucial for the performance of most of machine learning algorithms. For this reason, a lot of effort has been put into distance and similarity learning. However, it is worth noting that this research field lacks theoretical guarantees that can be expected on the generalization capacity of the classifier associated to...
Article
Full-text available
The importance of metrics in machine learning has attracted a growing interest for distance and similarity learning, and especially the Mahalanobis distance. However, it is worth noting that this research field lacks theoretical guarantees that can be expected on the generalization capacity of the classifier associated to a learned metric. The theo...
Conference Paper
Full-text available
The paper presents a method of constructing a supervised topic model of a major conference. The supervised part is the expert information about document-topic correspondence. To exploit the expert information we use a regularization term which penalizes difference between a constructed and an expert-given model. To find optimal parameters we add th...
Conference Paper
Full-text available
Hyper-parameter tuning is a resource-intensive task when optimizing classification models. The commonly used k-fold cross validation can become intractable in large scale settings when a classifier has to learn billions of parameters. At the same time, in real-world, one often encounters multi-class classification scenarios with only a few labeled...
Conference Paper
In the context of large-scale problems, traditional multiclass classification approaches have to deal with class imbalancement and complexity issues which make them inoperative in some extreme cases. In this paper we study a transformation that reduces the initial multiclass classification of examples into a binary classification of pairs of exampl...
Article
Full-text available
The problem of summarizing a large collection of homogeneous items has been addressed extensively in particular in the case of geo-tagged datasets (e.g. Flickr photos and tags). In our work, we study the problem of summarizing large collections of heterogeneous items. For example, a user planning to spend extended periods of time in a given city wo...
Conference Paper
The problem of summarizing a large collection of homogeneous items has been addressed extensively in particular in the case of geo-tagged datasets (e.g. Flickr photos and tags). In our work, we study the problem of summarizing large collections of heterogeneous items. For example, a user planning to spend extended periods of time in a given city wo...
Conference Paper
Full-text available
In this paper we investigate the effect of the heuristic IR constraints on IR term-document scoring functions within the recently proposed function discovery framework. In the earlier study the constraints were empirically validated as a whole. Moreover, only the group of form constraints was utilized and the other prominent group, the adjustment c...
Conference Paper
Full-text available
We study here the problem of estimating the parameters of standard IR models (as BM25 or language models) on new collections without any relevance judgments, by using collections with already available relevance judgements. We propose different query representations that allow mapping queries (with and without relevance judgments, from different co...
Article
Full-text available
The importance of metrics in machine learning has attracted a growing interest for distance and similarity learning. We study here this problem in the situation where few labeled data (and potentially few unlabeled data as well) is available, a situation that arises in several practical contexts. We also provide a complete theoretical analysis of t...
Conference Paper
Full-text available
Averaging a set of time series is a major topic for many temporal data mining tasks as summarization, extracting prototype or clustering. Time series averaging should deal with the tricky multiple temporal alignment problem; a still challenging issue in various domains...

Network

Cited By