Mihai Lupu

Mihai Lupu
Research Studios Austria · Research Studio Data Science

About

148
Publications
21,317
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,277
Citations
Additional affiliations
June 2011 - present
TU Wien
Position
  • Researcher

Publications

Publications (148)
Article
Full-text available
Machine learning research, particularly in genomics, is often based on wide shaped datasets, i.e. datasets having a large number of features, but a small number of samples. Such configurations raise the possibility of chance influence (the increase of measured accuracy due to chance correlations) on the learning process and the evaluation results....
Preprint
Full-text available
Machine learning research, particularly in genomics, is often based on wide shaped datasets, i.e. datasets having a large number of features, but a small number of samples. Such configurations raise the possibility of chance influencing the learning process and the evaluation results. Prior research underlined the problem of generalization of model...
Chapter
Personal data is a necessity in many fields for research and innovation purposes, and when such data is shared, the data controller carries the responsibility of protecting the privacy of the individuals contained in their dataset. The removal of direct identifiers, such as full name and address, is not enough to secure the privacy of individuals a...
Article
Image retrieval has been an active research domain for over 30 years and historically it has focused primarily on precision as an evaluation criterion. Similar to text retrieval, where the number of indexed documents became large and many relevant documents exist, it is of high importance to highlight diversity in the search results to provide bett...
Chapter
We replicate recent experiments attempting to demonstrate an attractive hypothesis about the use of the Fisher kernel framework and mixture models for aggregating word embeddings towards document representations and the use of these representations in document classification, clustering, and retrieval. Specifically, the hypothesis was that the use...
Preprint
We replicate recent experiments attempting to demonstrate an attractive hypothesis about the use of the Fisher kernel framework and mixture models for aggregating word embeddings towards document representations and the use of these representations in document classification, clustering, and retrieval. Specifically, the hypothesis was that the use...
Conference Paper
Extensive research in de-anonymisation has shown that in datasets not containing any personally identifying information (PII)—name, address, etc.—individuals can be identified through quasi-identifiers (QIs)—attributes whose combination serves as a unique identifier. In order to deal with this issue, necessary anonymisation measures need to be take...
Article
Full-text available
The empirical nature of Information Retrieval (IR) mandates strong experimental practices. A keystone of such experimental practices is the Cranfield evaluation paradigm. Within this paradigm, the collection of relevance judgments has been the subject of intense scientific investigation. This is because, on one hand, consistent, precise, and numero...
Chapter
Full-text available
Machine learning research, e.g. genomics research, is often based on sparse datasets that have very large numbers of features, but small samples sizes. Such configuration promotes the influence of chance on the learning process as well as on the evaluation. Prior research underlined the problem of generalization of models obtained based on such dat...
Conference Paper
We motivate the need for, and describe the contents of a novel patent research collection, publicly available and for free, covering multimodal and multilingual data from six patent authorities. The new patent test collection complements existing patent test collections, which are vertical (one domain or one authority over many years). Instead, the...
Article
We take a look back at the conferences we attended in 2018. As every year, we try to cover a mix of small and large, academic and industry conferences, to get the pulse of our patent information industry, but also to make sure that developments in computer-based information processing are on our radar. The buzz around anything “AI” that we observed...
Chapter
The training and use of word embeddings for information retrieval has recently gained considerable attention, showing competitive performance across various domains. In this study, we explore the use of word embeddings for patent retrieval, a challenging domain, especially for methods based on distributional semantics. We hypothesize that the previ...
Article
To support the objectives of the journal, to publish new research and insights covering a broad spectrum of Intellectual Property information retrieval and patent analytics related practices and methods, the editors, together with the team at IFI CLAIMS ® Patent Services, have put together a patent research collection, publicly available and for fr...
Preprint
Over the recent years, the availability of datasets containing personal, but anonymized information has been continuously increasing. Extensive research has revealed that such datasets are vulnerable to privacy breaches: being able to reveal sensitive information about individuals through deanonymization methods. Here, we provide a taxonomy of the...
Article
Full-text available
Since its inception in 2013, one of the key contributions of the CLEF eHealth evaluation campaign has been the organization of an ad-hoc information retrieval (IR) benchmarking task. This IR task evaluates systems intended to support laypeople searching for and understanding health information. Each year the task provides registered participants wi...
Article
Full-text available
Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the se...
Article
Full-text available
Nowadays, there is a proliferation of available information sources from different modalities—text, images, audio, video and more. Information objects are not isolated anymore. They are frequently connected via metadata, semantic links, etc. This leads to various challenges in graph-based information retrieval. This paper is concerned with the reac...
Article
We take a look back at the conferences we attended in 2017. There was a mix between small and large, academic and industry, that gives a broad and exciting view of the patent information usage landscape. The industry conferences have demonstrated, once again, the hot buzz around Industry 4.0 and anything “AI”, while the academic conferences show th...
Chapter
This chapter reports on the results and provides a brief overview of the topics addressed by the 25 lectures and 8 industrial talks given in the three Training Schools organized in the scope of the KEYSTONE (Semantic KEYword-based Search on sTructured data sOurcEs) COST action IC1302.
Article
Full-text available
We explore the use of unsupervised methods in Cross-Lingual Word Sense Disambiguation (CL-WSD) with the application of English to Persian. Our proposed approach targets the languages with scarce resources (low-density) by exploiting word embedding and semantic similarity of the words in context. We evaluate the approach on a recent evaluation bench...
Conference Paper
Full-text available
Every year more than 25 test collections are built among the main Information Retrieval (IR) evaluation campaigns. They are extremely important in IR because they become the evaluation praxis for the forthcoming years. Test collections are built mostly using the pooling method. The main advantage of this method is that it drastically reduces the nu...
Conference Paper
Exploitation of term relatedness provided by word embedding has gained considerable attention in recent IR literature. However, an emerging question is whether this sort of relatedness fits to the needs of IR with respect to retrieval effectiveness. While we observe a high potential of word embedding as a resource for related terms, the incidence o...
Article
Full-text available
Recent advances in neural word embedding provide significant benefit to various information retrieval tasks. However as shown by recent studies, adapting the embedding models for the needs of IR tasks can bring considerable further improvements. The embedding models in general define the term relatedness by exploiting the terms' co-occurrences in s...
Chapter
Information retrieval on the (social) web moves from a pure term-frequency-based approach to an enhanced method that includes conceptual multimodal features on a semantic level. In this paper, we present an approach for semantic-based keyword search and focus especially on its optimization to scale it to real-world sized collections in the social m...
Article
Readers of this journal are well aware that automation technology has played a significant role in searching for patent information and, as artificial intelligence is once again (after the first, 1960s, and second, 1980s, golden eras of AI) a trending topic at both academic and industry conferences, the editorial team of this journal would like to...
Conference Paper
Word embedding promises a quantification of the similarity between terms. However, it is not clear to what extent this similarity value can be of practical use for subsequent information access tasks. In particular, which range of similarity values is indicative of the actual term relatedness? We first observe and quantify the uncertainty of word e...
Conference Paper
Full-text available
Query Auto Completion is the task of suggesting queries to the users of a search engine while they are typing a query in the search box. Over the recent years there has been a renewed interest in research on improving the quality of this task. The published improvements were assessed by using offline evaluation techniques and metrics. In this paper...
Conference Paper
Full-text available
The empirical nature of Information Retrieval (IR) mandates strong experimental practices. The Cranfield/TREC evaluation paradigm represents a keystone of such experimental practices. Within this paradigm, the generation of relevance judgments has been the subject of intense scientific investigation. This is because, on one hand, consistent, precis...
Conference Paper
Full-text available
Recent studies have reconsidered the way we operationalise the pooling method, by considering the practical limitations often encountered by test collection builders. The biggest constraint is often the budget available for relevance assessments and the question is how best – in terms of the lowest pool bias – to select the documents to be assessed...
Chapter
The NII Testbeds and Community for Information access Research (ntcir) has been the first benchmarking campaign that created a test collection specifically for patent retrieval, in 2001/2002. Over the course of just over a decade, organisers and participants at NTCIR patent-related challenges have addressed the problem of mono- and multilingual pat...
Chapter
This chapter is the counterpart of the preceding chapter. It gives an overview of some of the most important terms and concepts used in search technology and information retrieval (IR) today. We hope it can be useful to readers who are not researchers in these areas. After a short dip into the history of the field, we start with a high level overvi...
Chapter
In this chapter we make some predictions for patent search in about 10 years’ time—in 2026. We base these predictions on the contents of the earlier part of the book, the observed differences between this second edition and the first edition of the book as well as on some data and trends not well represented in the book (for one reason or another)....
Chapter
Millions of existing patent documents and journal articles dealing with chemistry describe chemical structures by way of structure images (so-called Kekulé structures). While being human-readable, these structure images cannot be interpreted by a computer and are unusable in the context of most chemoinformatics applications: structure and substruct...
Article
Full-text available
In this paper we show how the performance of tweet clustering can be improved by leveraging character-based neural networks. The proposed approach overcomes the limitations related to the vocabulary explosion in the word-based models and allows for the seamless processing of the multilingual content. Our evaluation results and code are available on...
Conference Paper
Full-text available
Volatility prediction--an essential concept in financial markets--has recently been addressed using sentiment analysis methods. We investigate the sentiment of annual disclosures of companies in stock markets to forecast volatility. We specifically explore the use of recent Information Retrieval (IR) term weighting models that are effectively exten...
Conference Paper
Nowadays, there is a proliferation of information objects from different modalities—Text, Image, Audio, Video. Different types of relations between information objects (e.g. similarity or semantic) has motivated graph-based search in multimodal Information Retrieval. In this paper, we formulate a Random Walks problem along our model for multimodal...
Conference Paper
We reproduce recent research results combining semantic and information retrieval methods. Additionally, we expand the existing state of the art by combining the semantic representations with IR methods from the probabilistic relevance framework. We demonstrate a significant increase in performance, as measured by standard evaluation metrics.
Conference Paper
Full-text available
Pool bias is a well understood problem of test-collection based benchmarking in information retrieval. The pooling method itself is designed to identify all relevant documents. In practice, 'all' translates to 'as many as possible given some budgetary constraints' and the problem persists, albeit mitigated. Recently, methods to address this pool bi...
Conference Paper
Full-text available
Patent text is a mixture of legal terms and domain specific terms. In technical English text, a multi-word unit method is often deployed as a word formation strategy in order to expand the working vocabulary, i.e. introducing a new concept without the invention of an entirely new word. In this paper we explore query generation using natural languag...
Conference Paper
Full-text available
This paper provides an overview of the Retrieving Diverse Social Images task that is organized as part of the MediaEval 2016 Benchmarking Initiative for Multimedia Evaluation. The task addresses the problem of result diversification in the context of social photo retrieval where images, meta-data, text information, user tagging profiles and content...
Conference Paper
A recurring question in information retrieval is whether term associations can be properly integrated in traditional information retrieval models while preserving their robustness and effectiveness. In this paper, we revisit a wide spectrum of existing models (Pivoted Document Normalization, BM25, BM25 Verboseness Aware, Multi-Aspect TF, and Langua...
Conference Paper
Full-text available
In Information Retrieval, test collections are usually built using the pooling method. Many pooling strategies have been developed for the pooling method. Herein, we address the question of identifying the best pooling strategy when evaluating systems using precision-oriented measures in presence of budget constraints on the number of documents to...
Conference Paper
Full-text available
This paper details the collection, systems and evaluation methods used in the IR Task of the CLEF 2016 eHealth Evaluation Lab. This task investigates the effectiveness of web search engines in providing access to medical information for common people that have no or little medical knowledge. The task aims to foster advances in the development of se...
Conference Paper
Healthcare related queries are a treasure trove of information about the information needs of domain users, be they patients or doctors. However, unlike general queries, in order to make the most out of the information therein, such queries have to be processed within a medical terminology annotation pipeline. We show how this has been done in the...
Conference Paper
Full-text available
In this paper we introduce a new dataset, Div150Multi, that was designed to support shared evaluation of diversification techniques in different areas of social media photo retrieval and related areas. The dataset comes with associated relevance and diversity assessments performed by trusted annotators. The data consists of around 300 complex queri...
Article
Full-text available
CCS Concepts: • Information systems→Information retrieval; Web searching and information discovery; • Human-centered computing→Social networking sites.
Conference Paper
Full-text available
Recently, it has been discovered that it is possible to mitigate the Pool Bias of Precision at cutoff (P@n) when used with the fixed-depth pooling strategy, by measuring the effect of the tested run against the pooled runs. In this paper we extend this analysis and test the existing methods on different pooling strategies, simulated on a selection...
Conference Paper
We present results from questionnaire data that were collected from leading data analytics researchers and experts across Austria. The online survey addresses very pressing questions in the area of (big) data analysis. Our findings provide valuable insights about what top Austrian data scientists think about data analytics, what they consider as im...
Technical Report
Full-text available
This paper describes the contributions of Vienna University of Technology (TUW) to the MediaEval 2015 Retrieving Diverse Social Images challenge. Our approach consists of 3 phases: (1) Precision-oriented-phase: in which we focus only on the relevance of the documents; (2) Recall-oriented-phase: in which we focus only on the diversity aspect; (3) Me...
Conference Paper
When looking for information on the Web, the credibility of the source plays an important role in the information seeking experience. While data source credibility has been thoroughly studied for Web pages or blogs, the investigation of source credibility in image retrieval tasks is an emerging topic. In this paper, we first propose a novel dataset...
Conference Paper
Full-text available
BM25 is probably the most well known term weighting model in Information Retrieval. It has, depending on the formula variant at hand, 2 or 3 parameters (k1, b, and k3). This paper addresses b—the document length normalization parameter. Based on the observation that the two cases previously discussed for length normalization (multi-topicality and v...
Conference Paper
Full-text available
We approach the problem of retrievability from an analytical perspective, starting with modeling conjunctive and dis-junctive queries in a boolean model. We show that this represents an upper bound on retrievability for all other best match algorithms. We follow this with an observation of imbalance in the distribution of retrievability, using the...
Conference Paper
Full-text available
For many tasks in evaluation campaigns, especially those modeling narrow domain-specific challenges, lack of participation leads to a potential pooling bias due to the scarce number of pooled runs. It is well known that the reliability of a test collection is proportional to the number of topics and relevance assessments provided for each topic, bu...
Conference Paper
Full-text available
Creating systematic reviews is a painstaking task undertaken especially in domains where experimental results are the primary method to knowledge creation. For the review authors , analysing documents to extract relevant data is a demanding activity. To support the creation of systematic reviews, we have created DASyR—a semi-automatic document anal...
Conference Paper
Full-text available
We revisit text-based image retrieval for social media, exploring the opportunities offered by statistical semantics. We assess the performance and limitation of several complementary corpus-based semantic text similarity methods in combination with word representations. We compare results with state-of-the-art text search engines. Our deep learnin...
Conference Paper
This paper is concerned with potential recall in multimodal information retrieval in graph-based models. We provide a framework to leverage individuality and combination of features of different modalities through our formulation of faceted search. We employ a potential recall analysis on a test collection to gain insight on the corpus and further...
Conference Paper
Full-text available
In this paper we introduce a new dataset and its evaluation tools, Div150Cred, that was designed to support shared evaluation of diversification techniques in different areas of social media photo retrieval and related areas. The dataset comes with associated relevance and diversity assessments performed by human annotators. The data consists of 30...
Article
The velocity of multimodal information shared on web has increased significantly. Many reranking approaches try to improve the performance of multimodal retrieval, however not in the direction of true relevancy of a multimodal object. Metropolis-Hastings (MH) is a method based on Monte Carlo Markov Chain (MCMC) for sampling from a distribution when...
Article
Full-text available
Credibility, as the general concept covering trustworthiness and expertise, but also quality and reliability, is strongly debated in philosophy, psychology, and sociology, and its adoption in computer science is therefore fraught with difficulties. Yet its importance has grown in the information access community because of two complementing factors...
Conference Paper
Information retrieval on the (social) web moves from a pure term-frequency-based approach to an enhanced method that includes conceptual multimodal features on a semantic level. In this paper, we present an approach for semantic-based keyword search and focus especially on its optimization to scale it to real-world sized collections in the social m...
Book
Credibility in Information Retrieval presents a detailed analysis of existing credibility models from different information seeking research areas, with a focus on the Web and its pervasive social component. It shows that there is a very rich body of work pertaining to different aspects and interpretations of credibility, particularly for different...
Technical Report
Full-text available
The TUW-IMP team participated in the NTCIR-11 Math-2 task for retrieving mathematical formulae in scientific documents. This report describes our approach to solving the given math retrieval problem.
Conference Paper
Modeling data as a graph of objects is increasingly popular, as we move away from the relational DB model and try to introduce explicit semantics in IR. Conceptually, one of the main challenges in this context is how to “intelligently” traverse the graph and exploit the associations between the data objects. Two highly used methods in retrieving in...
Technical Report
Full-text available