Article

Database merging strategy based on logistic regression

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

With the development of network technology, users looking for information may send a request to various selected databases and then inspect multiple result lists. To avoid the need for inspecting multiple result lists, the database merging strategy merges the retrieval results produced by separate, autonomous servers into an effective, single ranked list. Our study deals with a particular aspect of this merging process, whereby only the rank of the retrieved records is available, and where a key points to different result lists. On the basis of this rather limited information, this paper describes the theoretical foundation and retrieval performance of our database merging approach based on logistic regression.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Therefore, different kinds of score normalization methods (which map internal retrieval status values to relevance degrees) have been proposed and their effectiveness has been investigated in [2,22,28] among others. Certain methods [8,26,53] are able to deal with those cases where only ranking information is available. ...
... Therefore, we decide to use the cubic model and the binary logistic model to convert ranking information into scores for the empirical study. The binary logistic model [8,32] is good when binary relevance judgment is used, but it cannot be used for graded relevance judgment. The cubic model [53] is good for graded relevance judgment as well as binary relevance judgment. ...
... In the TREC 2008 Blog opinion task, we use a binary logistic regression model [8,32] to obtain relevance scores from ranking information, which uses the equationFðtÞ ¼ 1=ð1 þe À a À b lnðtÞ Þto estimate relevance scores. The coefficients we obtain from regression analysis of all the runs are: a ¼2.183, b ¼ À0.718. ...
... Some non-linear methods for converting ranking into scores have also been discussed in Calvé and Savoy (2000), Nottelmann and Fuhr (2003), Wu, Bi, and McClean, 2007. The logistic regression model was investigated in Calvé and Savoy (2000), Nottelmann and Fuhr (2003) and the cubic regression model was investigated in Wu et al. (2007) for the relation between documents' rank and degree of relevance. ...
... Some non-linear methods for converting ranking into scores have also been discussed in Calvé and Savoy (2000), Nottelmann and Fuhr (2003), Wu, Bi, and McClean, 2007. The logistic regression model was investigated in Calvé and Savoy (2000), Nottelmann and Fuhr (2003) and the cubic regression model was investigated in Wu et al. (2007) for the relation between documents' rank and degree of relevance. ...
... Borda and the fitting method have been reviewed in Section 2. In the following, let us briefly discuss how to use the logistic model to convert ranks into scores. The logistic model (Calvé & Savoy, 2000) can be expressed by the following equation ...
... An alternative is to convert ranking information into scores if raw scores are not available or not reliable. Different models [1,3,12,15,22] have been investigated for such a purpose. In this piece of work, we use the logistic regression model for it. ...
... In this piece of work, we use the logistic regression model for it. This technique has been investigated for distributed information retrieval [3,15], but not for data fusion before. The rest of this paper is organized as follows: in Section 2 we review some related work on data fusion. ...
... For example, Borda count [1] works like this: for a ranked list of t documents, the first document in the list is given a score of t, the second document in the list is given a score of t − 1, ..., the last document in the list is given a score of 1. Thus all documents are assigned corresponding scores based on their rank positions and CombSum or the linear combination method can be applied accordingly. Some non-linear methods for score normalization have also been discussed in [3,12,15,22]. In [12], an exponential distribution for the set of non-relevant documents and a normal distribution for the set of relevant documents were used, then a mixture model was defined to fit the real score distribution. ...
Data
Full-text available
... The approach of combining results from multiple systems has been successfully utilized in the information retrieval community [1,5,18]. Simple methods like Borda Count [1] do not require training data and favor documents that are retrieved by more individual systems against documents that are retrieved by fewer or no systems. ...
... Simple methods like Borda Count [1] do not require training data and favor documents that are retrieved by more individual systems against documents that are retrieved by fewer or no systems. More sophisticated algorithms that utilize training data include Naive Bayesian method [1] and logistic regression model [5]. The Naive Bayesian method makes an independence assumption among results from multiple systems, which may be inaccurate in many cases. ...
... In fact, the exponential model can be seen as a multi-category extension of the logistic regression model for Meta retrieval system of information retrieval [1,5]. The graphical representation of this probabilistic model is shown inFigure 1. ...
Article
The task of biomedical named-entity recognition is to identify technical terms in the domain of biology that are of special interest to domain experts. While numerous algorithms have been proposed for this task, biomedical named-entity recognition remains a challenging task and an active area of research, as there is still a large accuracy gap between the best algorithms for biomedical named-entity recognition and those for general newswire named-entity recognition. The reason for such discrepancy in accuracy results is generally attributed to inadequate feature representations of individual entity recognition systems and external domain knowledge. In order to take advantage of the rich feature representations and external domain knowledge used by different systems, we propose several Meta biomedical named-entity recognition algorithms that combine recognition results of various recognition systems. The proposed algorithms – majority vote, unstructured exponential model and conditional random field – were tested on the GENIA biomedical corpus. Empirical results show that the F score can be improved from 0.72, which is attained by the best individual system, to 0.96 by our Meta entity recognition approach.
... In some advanced IR applications such as filtering, resource selection and data fusion, people find that relevance probabilities of documents are highly desirable. Therefore, different kinds of score normalization methods, which map internal retrieval status values to probabilities of relevance, have been proposed and their effectiveness has been investigated in [3,9,10,11,12]. To the best of our knowledge, using relevance probabilities to improve the usability of information retrieval systems/digital libraries has not been paid much attention before. This is one of the major issues which will be discussed in this paper. ...
... Therefore more effort is needed if we want to estimate the probabilities of relevance or degrees of relevance of those retrieved documents. However, how to provide such scores by information retrieval systems/digital libraries is beyond the score of this paper, though it is an important issue (see [3,10,11,12] and others for related discussion). Instead, we discuss why we need them and what we can do by using them. ...
... This can be estimated by using the relevance probability scores of the resultant list. For example, logistic functions [3] and cubic functions [20] have been found useful for such a purpose. We shall illustrate how to do this in Example 2 later in this section. ...
Conference Paper
Full-text available
In information retrieval systems and digital libraries, result presentation is a very important aspect. In this paper, we demonstrate that only a ranked list of documents, thought commonly used by many retrieval systems and digital libraries, is not the best way of presenting retrieval results. We believe, in many situations, an estimated relevance probability score or an estimated relevance score should be provided for every retrieved document by the information retrieval system/digital library. With such information, the usability of the retrieval result can be improved, and the Euclidean distance can be used as a very good system-oriented measure for the effectiveness of retrieval results. The relationship between the Euclidean distance and some ranking-based measures are also investigated.
... Calvé and Savoy [3] investigated the relation between rank and probability of relevance in resultant document lists using logistic regression. They demonstrated that with proper training, the logistic model is more effective than the round-robin approach, which takes documents from all component retrieval systems in an interleaving fashion. ...
... Last but not least, Calvé and Savoy's work in [3] is the most relevant to this paper. Assuming that only ranking information is available for all the documents retrieved from component systems, they used the logistic regression model to estimate the relation between rank and probability of relevance, and demonstrated that the logistic model worked well for results merging. ...
... The above equation is given in SPSS 1 , a statistical analysis software. Though it looks different from the one appeared in [3], they are very similar. In [3], the equation is ...
Conference Paper
Full-text available
In a distributed information retrieval system, how to merge results from different text databases is an important issue, since it affects the effectiveness of the result considerably. In many cases, the underlining systems only provide a ranked list of documents for any information need. In this paper, we investigate the relation between rank and relevance in resultant document lists, and find that the cubic model is a good option for this. Extensive experimentation is conducted to evaluate the performance of the cubic model for results merging. The experimental results demonstrate that the cubic model is better than the logistic model, which was suggested by a previous research.
... Many score normalization methods have been proposed: the zeroone method [34], the fitting method [61], Z-scores [41], the reciprocal rank [16], the logistic model [7] and so on. However, these score normalization methods are aimed at improving relevance-based performance and diversity is not an issue. ...
... In order to investigate the generalizability and robustness of these fusion methods, we carry out more experiments by the following procedure: from all the runs submitted to the web diversity task of TREC in the same year, we randomly select 3-20 runs to test the effectiveness of these methods. For any given number (3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20), 200 combinations are tested. Figs. 1 and 2 present the results with the TREC 2010 web diversity data set, with metrics ERR-IA@20 and α-nDCG@20, Tables 5 and 6 for the comparison between our methods and six other fusion methods that were proposed recently: ClustFuseCombSum [31], DDF [38], GA [22], dis*p 2 (Eq (1)). ...
Article
Full-text available
Search result diversification of text documents is especially necessary when a user issues a faceted or ambiguous query to the search engine. A variety of approaches have been proposed to deal with this issue in recent years. In this article, we propose a group of fusion-based result diversification methods with the aim to improve performance that considers both relevance and diversity. They are linear combinations of scores that are obtained from different component search systems. The weight of each search system is determined by considering three factors: performance, dissimilarity, and complementarity. There are two major contributions. Firstly, we find that all the three factors of performance and complementarity and dissimilarity are useful for effective weighting of linear combination. Secondly, we present the logarithmic function-based model for converting ranking information into scores. Experiments are carried out with four groups of results submitted to the TREC web diversity task. Experimental results show that some of the fusion methods that use the aforementioned techniques perform more effectively than the state-of-the-art fusion methods for result diversification.
... It has been found that the logistic model is good for score normalization of information retrieval results [4,15]. The logistic model is also used in this study. ...
... For each given number(3)(4)(5)(6)(7)(8)(9)(10), 200 randomly selected combinations are tested. The experimental results are shown in Figures 1-2, in which each data point is the average of 200 combinations and 150 queries in each combination. ...
Conference Paper
Full-text available
In the web age, publishing information and opinions online is very easy and fast. Since the web is reachable by a huge number of grassroots people, the number and scale of social networking sites are growing at a tremendous speed. It is an interesting thing to find out information, news & events, opinions, etc., exchanged in these sites. Thus quite a few researchers focus on this and some related issues. One major characteristic of these social networking sites is their dynamic nature. When new things or themes appear, they are discussed in quirk, and then forgotten very quickly. It is also true that the thriving and decline of such sites may happen very quickly. How to cope with this dynamic environment is a challenging issue for the information/opinion search services.
... Une revue de la littérature nous a permis d'établir une catégorisation des principaux modèles d'agrégation. Bien que plusieurs catégorisations existent dans la littérature spécialisée [Le Calvé 2000, Aslam 2001, Montague 2002, Renda 2003, Liu 2007, nous avons opté pour un raffinement des classifications existantes en y ajoutant de nouvelles dimensions liées à notre problématique (voir Table 2.1). En effet, en respectant les classifications données dans différents travaux [Aslam 2001, Renda 2003], nous distinguons les modèles à base de scores et les modèles à base de rangs scindés eux mêmes en modèles à base d'apprentissage ou sans apprentissage [Aslam 2002]. ...
... Ce problème peut être surmonté lorsque de l'apprentissage est utilisé comme dans le montre les travaux de [Le Calvé 2000, Si 2002a] mais nous rappelons qu'un échantillon centralisé ne peut pas être représentatif pour l'ensemble des pairs à cause de la dynamicité (des pairs peuvent apparaître ou disparaître après construction de l'échantillon). ...
Article
A huge part of the impetus of various internet technologies through the Peer-to-Peer (Peer-to-Peer or P2P) system can be seen as a reaction to the content centering detriment on the servers in front of passive clients. One of the distinctive features of any P2P system is what we often call direct connectivity between equal peers. The Peer-to-Peer increased the exchange flows between dynamic communities of users which tend to grow rapidly. We talk, therefore, about large-scale distributed systems in which the exchanged, shared and sought information reaches a more and more impressive volumes. Solving the aggregation problem in P2PIR systems the same way as its resolution in Distributed Information Retrieval (DIR) will miss a lot of intelligibility. In fact, the context has changed in RIP2P, given the scale factor and the lack of a global vision of the system in these networks that extend naturally to thousands or even millions peers. This will involve the removal of a broker server that is inadequate in this context and will raise the problem of finding new policies to aggregate results coming from heterogeneous peers in a single list while reflecting the user’s expectations. All these reasons prompted us to explore an aggregation mechanism based on user profiles deduced from their past behavior due to their interaction with query results. Our contributions, in this thesis, focus on two complementary axes. First, we propose a new vision of results aggregation in a large scale system. In this context, a profiles model and a hybrid score profiles-based approach are proposed. Second, we focused on the development of an evaluation framework of our approach in large-scale systems. In this thesis, we are mainly interested in the Information Retrieval problem in P2P systems (P2PIR) and focusing more specifically on the problem of results’ aggregation in such systems
... provides good results, especially with a value of 60 for c. Apart from the reciprocal function, the logistic function may also be used (Calvé and Savoy, 2000). The logistic model uses the following function score(rank) = e a+b * ln(rank) 1 + e a+b * ln(rank) = 1 1 + e −a−b * ln (rank) ( 3) to calculate scores for all the documents at different ranks. ...
... Two score normalization methods, the reciprocal model (see Equation 2) and the logistic model (see Equation 3) are tested in this study. The reciprocal model is straightforward by setting c=60 as in Cormack et al. (2009), while the logistic model needs to obtain coefficients by training (Calvé and Savoy, 2000). Using the data in some original runs, we obtain the values of the two coefficients: a=0.718, and b=-2.183. ...
Article
Data fusion is currently used extensively in information retrieval for various tasks. It has proved to be a useful technology because it is able to improve retrieval performance frequently. However, in almost all prior research in data fusion, static search environments have been used, and dynamic search environments have generally not been considered. In this article, we investigate adaptive data fusion methods that can change their behavior when the search environment changes. Three adaptive data fusion methods are proposed and investigated. To test these proposed methods properly, we generate a benchmark from a historic Text REtrieval Conference data set. Experiments with the benchmark show that 2 of the proposed methods are good and may potentially be used in practice.
... The same indexing scheme and retrieval procedure is used for each collection involved in this study. This type of distributed context more closely reflects digital libraries or search engines available on the Internet than do meta search engines, where different search engines may collaborate in response to a given user request (Selberg, 1999;Le Calvé & Savoy, 2000). ...
... Our model uses a constant K in order to normalize the collection score as well as the natural logarithm, an order-preserving transformation used in similar contexts (Le Calvé & Savoy, 2000). Based on this collection score, our merging algorithm calculates the collection weight denoted wk for the kth collection as follows: ...
... The same indexing scheme and retrieval procedure is used for each collection involved in this study. This type of distributed context more closely reflects digital libraries or search engines available on the Internet than do meta search engines, where different search engines may collaborate in response to a given user request (Selberg, 1999; Le Calvé & Savoy, 2000). ...
... Our model uses a constant K in order to normalize the collection score as well as the natural logarithm, an order-preserving transformation used in similar contexts (Le Calvé & Savoy, 2000). Based on this collection score, our merging algorithm calculates the collection weight denoted wk for the kth collection as follows: ...
Article
Full-text available
For our participation in TREC-10, we will focus on the searching distributed collections and also on designing and implementing a new search strategy to find homepages. Presented in the first part of this paper is a new merging strategy based on retrieved list lengths, and in the second part a development of our approach to creating retrieval models able to combine both Web page and URL address information when searching online service locations.
... Different retrieval system, which make different processing of text, can capture different features from documents and information needs. Indeed, experiments have shown that combining adequately different document representations (or results of different retrieval systems) can lead to better retrieval effectiveness than simply taking the document representation or the retrieval system which produces the best retrieval results on average [CS00]. Hypertext links. ...
... . This probability can be assessed by fitting a logistic regression to a sample of training queries [CS00]. In our experiments with the CACM and WT collection, we used only rank as the explicating variable, because the probability of relevance is more stable with the rank than with the score. ...
... Based on some of the opinions of these experts, the authors Concluded that logistics management is an activity that plans, implements, and controls transportation, storage, and distribution activities efficiently and effectively to meet customer needs so that profits can be maximized. According to (Le Calvé & Savoy, 2000), Logistics functions as a system that unites various components such as information flow, starting from suppliers (ordering and shipping), the information in the production process (inventory), or in information flow services within the company (coordination), to consumers (distribution of both goods and services) (Hemalatha et al., 2018;Yeo et al., 2015). So based on the theory above, the authors conclude that logistics management is an activity that functions to regulate a system to unite various components from suppliers to consumers, both distribution of goods and services (Javed & Wu, 2020;Stepaniuk, 2017;Starostka-Patyk, 1987;Uriarte-Miranda et al., 2018). ...
Article
Full-text available
This study aimed to determine the effect of fleet availability and control on the smooth delivery of PT. Cardig Logistics Indonesia. The survey method is used as a way to collect primary data. The population is taken from employees who work in the operational section and get a sample of 30 people. The writer used a descriptive statistical analysis method, multiple linear regression, correlation coefficient, determination coefficient, or determination to test the research hypothesis to conduct this research. The results showed that the availability of the fleet (X1) and controlling (X2) on the smooth delivery (Y) of PT. Cardig Logistics Indonesia has a positive and significant effect. From the research result, the more dominant is the variable of fleet availability.
... Linear regression model [15], [16] and [17] is preferable as prediction results are the probabilistic estimates. This equation of this model is represented using Equation 4. ...
... Based on some of the opinions of these experts, the authors Concluded that logistics management is an activity that plans, implements, and controls transportation, storage, and distribution activities efficiently and effectively to meet customer needs so that profits can be maximized. According to (Le Calvé & Savoy, 2000), Logistics functions as a system that unites various components such as information flow, starting from suppliers (ordering and shipping), the information in the production process (inventory), or in information flow services within the company (coordination), to consumers (distribution of both goods and services) (Hemalatha et al., 2018;Yeo et al., 2015). So based on the theory above, the authors conclude that logistics management is an activity that functions to regulate a system to unite various components from suppliers to consumers, both distribution of goods and services (Javed & Wu, 2020;Stepaniuk, 2017;Starostka-Patyk, 1987;Uriarte-Miranda et al., 2018). ...
Article
This study aimed to determine the effect of fleet availability and control on the smooth delivery of PT. Cardig Logistics Indonesia. The survey method is used as a way to collect primary data. The population is taken from employees who work in the operational section and get a sample of 30 people. The writer used a descriptive statistical analysis method, multiple linear regression, correlation coefficient, determination coefficient, or determination to test the research hypothesis to conduct this research. The results showed that the availability of the fleet (X1) and controlling (X2) on the smooth delivery (Y) of PT. Cardig Logistics Indonesia has a positive and significant effect. From the research result, the more dominant is the variable of fleet availability.
... We have used pipelining technique on the two models, MELM-GRBFNN and Logistic Regression [25]. To train the final blocking probability predictor training of dataset using MELM-GRBFNN is described below: First from the optimized training set, random centers are Randomly take centers of the MELM-GRBFNN's from the patterns in the training set; ...
... B.P. Dubey et al. [7], applied ANNs in the fast and consistent approximation of power distribution in pressurized heavy water reactors (PHWRs) demonstrating that there is a great scope for the application of machine learning techniques in nuclear engineering. Similarly, different machine learning techniques were employed in many fields [8][9][10][11][12][13][14] to overcome complex situations. ...
Article
Advancement in technology has created wide opportunities for the researchers to utilize artificial intelligence in various fields. Numerous attempts have been made in the use of machine learning tools in the manufacturing and production sector. However, variation in the performance of techniques is creating a major quagmire for the researchers. In many cases, some methods have shown similar results while in some cases one outperformed another. Choosing the best and suitable technique for process modelling and optimization is still a challenging task for the researchers. Hence, to present a direction for the prospect investigators, in this study, the performance of different machine learning techniques applied in the manufacturing sector is reviewed by assessing many articles from the past two decades. Among several machine learning techniques reviewed in this study, application of artificial neural networks (ANN) in process modelling and optimization has become quite noticeable because of its ability to predict the output quickly and accurately. The effectiveness and practicality of ANN models in manufacturing applications are reviewed for demonstrating its pivotal role in process modeling. Observations are reported in the study.
... When the returned snippets are found to be not sufficiently informative, additional information such as link statistics or the contents of documents are used for merging. Savoy et al. [1996] and Calvé and Savoy [2000] applied logistic regression [Hosmer and Lemeshow, 1989] to convert the ranks of documents returned by search engines into probabilities of relevance. Documents are then merged according to their estimated probabilities of relevance. ...
... Third, after results are returned from selected information sources, the individual ranked lists should be merged into a single final ranked list. This task is called results merging (Callan 2000;Cetintas and Si 2007;Kirsch 1997;Larson 2002;Le Calv and Savoy 2000;Lu et al. 2005;Si and Callan 2003b;Xu and Callan 1998). ...
Article
Federated text search provides a unified search interface for multiple search engines of distributed text information sources. Resource selection is an important component for federated text search, which selects a small number of information sources that contain the largest number of relevant documents for a user query. Most prior research of resource selection focused on selecting information sources by analyzing static information of available information sources that is sampled in the offline manner. On the other hand, most prior research ignored a large amount of valuable information like the results from past queries. This paper proposes a new resource selection technique (which is called qSim) that utilizes the search results of past queries for estimating the utilities of available information sources for a specific user query. The new algorithm calculates the query similarities between a specific query and all past queries, and then estimates the utilities of available information sources by the weighted combination of results of past queries with respect to the query similarities. The new resource selection algorithm is practical as it does not require relevance judgment of past queries and it only utilizes regression based results merging method to rank the results of past queries. This paper
... There are several approaches used to merge the relevant monolingual list retrieved by the IR system: the classical Round-Robin and Raw-Scoring (Callan et al. 1995;Voorhees et al. 1995), the Normalized Raw-Scoring (Powell et al. 2000) or other methods based on machine learning such as Logistic Regression (Calve and Savoy 2000). In any case, the merging algorithm decreases the precision of the multilingual system (depending on the collection, between 20 and 40%) (Savoy 2002). ...
Article
Given a user question, the goal of a Question Answering (QA) system is to retrieve answers rather than full documents or even best-matching passages, as most Information Retrieval systems currently do. In this paper, we present BRUJA, a QA system for the management of multilingual collections. BRUJ rkstions (English, Spanish and French). The BRUJA architecture is not formed with three monolingual QA systems but instead uses English as Interlingua to make usual QA tasks such as question classifications and answer extractions. In addition, BRUJA uses Cross Language Information Retrieval (CLIR) techniques to retrieve relevant documents from a multilingual collection. On the one hand, we have more documents to find answers from but on the other hand, we are introducing noise into the system because of translations to the Interlingua (English) and the CLIR module. The question is whether the difficulty of managing three languages is worth it or whether a monolingual QA system delivers better results. We report on in-depth experimentation and demonstrate that our multilingual QA system gets better results than its monolingual counterpart whenever it uses good translation resources and, especially, CLIR techniques that are state-of-the-art.
... When the returned snippets are found to be not sufficiently informative, additional information such as link statistics or the contents of documents are used for merging. Savoy et al. [217] and Calvé and Savoy [48] applied logistic regression [135] to convert the ranks of documents returned by search engines into probabilities of relevance. Documents are then merged according to their estimated probabilities of relevance. ...
Article
Federated search (federated information retrieval or distributed information retrieval) is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized search alternatives in many environments. For example, commercial search engines such as Google cannot easily index uncrawlable hidden web collections while federated search systems can search the contents of hidden web collections without crawling. In enterprise environments, where each organization maintains an independent search engine, federated search techniques can provide parallel search over multiple collections. There are three major challenges in federated search. For each query, a subset of collections that are most likely to return relevant documents are selected. This creates the collection selection problem. To be able to select suitable collections, federated search systems need to acquire some knowledge about the contents of each collection, creating the collection representation problem. The results returned from the selected collections are merged before the final presentation to the user. This final step is the result merging problem. The goal of this work is to provide a comprehensive summary of the previous research on the federated search challenges described above.
... It showed that the weighted linear combination with source values can achieve considerably better results than interleaving ranks. A logistic transformation model [CS00] was proposed to build logistic models for all the information sources using human-judged training data and transform information-source specific scores into source independent scores. Lin et al. [LH04] have applied a similar strategy for fusing various outputs of news video collections and achieved considerable improvement over the roundrobin combination. ...
Article
In recent years, the multimedia retrieval community is gradually shifting its em- phasis from analyzing one media source at a time to exploring the opportunities of combining diverse knowledge sources from correlated media types and context. This thesis presents a conditional probabilistic retrieval model as a principled framework to combine diverse knowledge sources. An e-cient rank-based learning approach has been developed to explicitly model the ranking relations in the learning process. Under this retrieval framework, we overview and develop a number of state-of-the-art approaches for extracting ranking features from multimedia knowledge sources. To incorporate query information in the combination model, this thesis develops a number of query analysis models that can automatically discover mixing structure of the query space based on previous retrieval results. To adapt the combination function on a per query basis, this thesis also presents a probabilistic local context analysis(pLCA) model to au- tomatically leverage additional retrieval sources to improve initial retrieval outputs. All the proposed approaches are evaluated on multimedia retrieval tasks with large-scale video collections as well as meta-search tasks with large-scale text collections.
... Nos travaux en cours tendent de résoudre les mêmes problèmes de sélection et de fusion en admettant que les différentes collections sont gérées par des stratégies d'indexation et de recherche différentes, situation classique des meta-moteurs de recherche, et en supposant que nous n'avons pas de requêtes antérieures pour paramétrer le modèle; dans le cas contraire, voir (Le Calvé & Savoy 2000). ...
Article
Full-text available
The Web and digital libraries offer the possibility to send natural language queries to various information servers (corpora or search engines) raising the difficult problem of selecting the best document sources and merging the results provided by different servers. In this paper, a new ap- proach for collections selection based on decision trees is described. Moreover, different merging and selection procedures have been evaluated leading to an overview of the suggested approaches.
... With these additional explanatory variables, we can compute the corresponding subjectivity score for each sentence as follows: (2) As a better way to combine different judgments we suggest following Le Calvé & Savoy (2000) and normalize the scores using the logistic regression. The logistic transformation π(x) given by each logistic regression model is defined as: ( 3) where β i are the coefficients obtained from the fitting, x i are the variables, and k is the number of explanatory variables. ...
Article
Full-text available
We propose an efficient text summarization technique that involves two basic operations. The first operation involves finding coherent chunks in the document and the second operation involves ranking the text in the individual coherent chunks and picking the sentences that rank above a given threshold. The coherent chunks are formed by exploiting the lexical relationship between adjacent sentences in the document. Occurrence of words through repetition or relatedness by sense relation plays a major role in forming a cohesive tie. The proposed text ranking approach is based on a graph theoretic ranking model applied to text summarization task.
... For database merging, large databases from various analysis functions usually gathered together to see their information relation between different databases. Therefore, managers and decision makers can select data from any databases and inspect multiple result lists [25]. In the merging process, both overlapping (e.g. usually names of individuals) and non-overlapping information are compared and combined [26]. ...
Article
A database is an integrated collection of logically related records. It plays an important role in an area of computing when there is a lot of data and information need to be stored and retrieved. Today, databases are used in many disciplines such as business, education, general administration, medicine and many more. Research in database works have been advanced from file management system to data warehousing with the discussion on several issues such as database sharing, integration, conversion, access and integrity. In this paper, we propose a new method on doing database conversion by providing a single master database that can accept multiple databases of any type through the use of Java Database Connectivity (JDBC) and application program interface (API). The key contribution of this method is the ability to accept a single record, multiple records or the whole records of a database to be converted into any other database type. Thus, any existing form of database can be integrated and updated without the need to design new database system for coping with the new technology. In this way, the old or existing databases can be used for an unlimited lifetime and a broader scope of application domains.
... For database merging, large databases from various analysis functions usually gathered together to see their information relation between different databases. Therefore, managers and decision makers can select data from any databases and inspect multiple result lists [33]. In the merging process, both overlapping (e.g. usually names of individuals) and non-overlapping information are compared and combined [34]. ...
Article
Much research has been undertaken to work on database sharing, integration, conversion, merging and migration. In particular, database conversion has attracted researchers' attention due to the rapid change in the computer technology. There are also several tools available on the Web for free usage on handling database conversion. All of these works are focusing on the relational database which consists of an integrated collection of logically related records. Database plays an important role in an area of computing when there is a lot of data and information need to be stored and retrieved. Today, databases are used in many disciplines such as business, education, general administration, medicine and many more. Research in database works have been advanced from file management system to data warehousing with the discussion of how a database can be significantly sustainable and have the potential to be added and modified to suit the current situation and technology. In this paper, we propose a new method on doing database conversion by providing a single master database that can accept multiple databases of any type through the use of Java Database Connectivity (JDBC) and application program interface (API). The key contribution of this method is the ability to accept a single record, multiple records or the whole records of a database to be converted into any other database type. Thus, any existing form of database can be integrated and updated without the need to design new database system for coping with the new technology. In this way, the old or existing databases can be used for an unlimited lifetime and a broader scope of application domains.
... There has been some work on using logistic regression to learn merging models to normalize document scores but relevance judgments are required for training [2]. The Semi-Supervised Learning result-merging algorithm uses the documents obtained by query-based sampling as training data to learn score normalizing functions on a query-by-query basis. ...
Conference Paper
Peer-to-peer architectures are a potentially powerful model for developing large-scale networks of text-based digital libraries, but peer-to-peer networks have so far provided very limited support for text-based federated search of digital libraries using relevance-based ranking. This paper addresses the problems of resource representation, resource ranking and selection, and result merging for federated search of text-based digital libraries in hierarchical peer-to-peer networks. Existing approaches to text-based federated search are adapted and new methods are developed for resource representation and resource selection according to the unique characteristics of hierarchical peer-to-peer networks. Experimental results demonstrate that the proposed approaches offer a better combination of accuracy and efficiency than more common alternatives for federated search in peer-to-peer networks.
... As a fifth merging strategy, we might use the logistic regression [1] to predict the probability of a binary outcome variable, according to a set of explanatory variables [4] . In our current case, we predict the probability of relevance of document D k given both the logarithm of its rank (indicated by ln(rank k )) and the original document score RSV k as indicated in Equation 2. Based on these estimated relevance probabilities (computed independently for each language using the S+ software [9]), we sort the records retrieved from separate collections in order to obtain a single ranked list. ...
Conference Paper
Full-text available
For our third participation in the CLEF evaluation campaign, our objective for both multilingual tracks is to propose a new merging strategy that does not require a training sample to access the multilingual collection. As a second objective, we want to verify whether our combined query translation approach would work well with new requests.
... As a fifth merging strategy, we might use logistic regression to predict the probability of a binary outcome variable, according to a set of explanatory variables[15]. In our current case, we predicted the probability of relevance for document D k , given both the logarithm of its rank (indicated by ln(Rank k )) and the original document score RSV k as indicated in Equation 2. Based on these estimated relevance probabilities (computed independently for each language using S+ software), we sorted the records retrieved from separate collections in order to obtain a single ranked list. ...
Conference Paper
Full-text available
In our fourth participation in the CLEF evaluation campaigns, our objective was to verify whether our combined query translation approach would work well with new requests and new languages (Russian and Portuguese in this case). As a second objective, we were to suggest a selection procedure able to extract a smaller number of documents from collections that seemed to contain no or only a few relevant items for the current request. We also applied different merging strategies in order to obtain more evidence about their respective relative merits.
... There has been some work on using logistic regression to learn merging models to normalize document scores but relevance judgments are required for training [2]. ...
Article
Peer-to-peer (P2P) networks integrate autonomous computing resources without requiring a central coordinating authority, which makes them a potentially robust and scalable model for providing federated search capability to large-scale networks of text-based digital libraries. However, peer-to-peer networks have so far provided very limited support for full-text federated search with relevance-based document ranking. This paper provides solutions to full-text federated search of text-based digital libraries in hierarchical peer-to-peer networks. Existing approaches to full-text search are adapted and new methods are developed for the problems of resource representation, resource selection, and result merging according to the unique characteristics of hierarchical peer-to-peer networks. Experimental results demonstrate that the proposed approaches offer a better combination of accuracy and efficiency than more common alternatives for federated search of text-based digital libraries in peer-to-peer networks.
... Some training queries were needed for setting up the model. Their experiments showed that the logistic regression approach was significantly better than round-robin, raw-score, and normalised raw-score approaches (Calve & Savoy, 2000). Finally, a quite effective, though not very efficient, results merging method is downloading all the documents estimated to be relevant from different resources and then using a local information retrieval system to re-rank these documents (Lawrence & Giles, 1998). ...
Article
How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single merged list. Cluster-based retrieval results presentation is based on the cluster hypothesis, which states that documents that cluster together have a similar relevance to a given query. However, while this hypothesis has been demonstrated to hold in classical information retrieval environments, it has never been fully tested in heterogeneous distributed information retrieval environments. Heterogeneous document representations, the presence of document duplicates, and disparate qualities of retrieval results, are major features of an heterogeneous distributed information retrieval environment that might disrupt the effectiveness of the cluster hypothesis. In this paper we report on an experimental investigation into the validity and effectiveness of the cluster hypothesis in highly heterogeneous distributed information retrieval environments. The results show that although clustering is affected by different retrieval results representations and quality, the cluster hypothesis still holds and that generating hierarchical clusters in highly heterogeneous distributed information retrieval environments is still a very effective way of presenting retrieval results to users.
Article
Full-text available
For several industries, the traditional manufacturing processes are time-consuming and uneconomical due to the absence of the right tool to produce the products. In a couple of years, machine learning (ML) algorithms have become more prevalent in manufacturing to develop items and products with reduced labor cost, time, and effort. Digitalization with cutting-edge manufacturing methods and massive data availability have further boosted the necessity and interest in integrating ML and optimization techniques to enhance product quality. ML integrated manufacturing methods increase acceptance of new approaches, save time, energy, and resources, and avoid waste. ML integrated assembly processes help creating what is known as smart manufacturing, where technology automatically adjusts any errors in real-time to prevent any spillage. Though manufacturing sectors use different techniques and tools for computing, recent methods such as the ML and data mining techniques are instrumental in solving challenging industrial and research problems. Therefore, this paper discusses the current state of ML technique, focusing on modern manufacturing methods i.e., additive manufacturing. The various categories especially focus on design, processes and production control of additive manufacturing are described in the form of state of the art review.
Article
Full-text available
Continuous exposure to stress leads to many health problems and substantial economic loss in companies. A lot of attention has been given to the development of wearable systems for stress monitoring to tackle its long-term effects such as confusion, high blood pressure, insomnia, depression, headache and inability to take decisions. Accurate detection of stress from physiological measurements embedded in wearable devices has been the primary goal in the healthcare industry. Advanced sensor devices with a high sampling rate have been proven to achieve high accuracy in many earlier works. However, there has been a little attempt to employ consumer-based devices with a low sampling rate, which potentially degrades the performance of detection systems. In this paper, we propose a set of new features, local maxima and minima (LMM), from heart rate variability and galvanic skin response sensors along with the voting and similarity-based fusion (VSBF) method, to improve the detection performance. The proposed feature set and fusion method are first tested on the acquired dataset which is collected using the wrist-worn devices with a low sampling rate in workplace environments and validated on a publicly available dataset, driveDB from PhysioNet. The experimental results from both datasets prove that the LMM features can improve the detection accuracy for different classifiers in general. The proposed VSBF method further boosts the recognition accuracy by 5.69% and 2.90% in comparison with the AdaBoost, which achieves the highest accuracy as a single classifier on the acquired, and the DriveDB dataset, respectively. Our analyses show that the stress detection system using the acquired dataset yields an accuracy of 92.05% and an F1 score of 0.9041. Based on the analyses, a soft real-time system is implemented and validated to prove the applicability of the proposed work for stress detection in a real environment.
Conference Paper
Full-text available
Database conversion is a process to transfer data from one database to another along with its structure. Since there are many database systems created by organization or individuals, such systems can be in diverse types such as Access, Oracle and MySql. The progression in technology requires some systems to be upgraded to a newer system (e.g. adding new records' structures, changing platform) or migrated (e.g. adapting to a newer version of a system). In order to work with different types of database, a common platform is needed to do data integration or conversion due to their heterogeneity and platform diversity. This paper presents a computerized tool, namely FlexiDC, which is implemented using the Java programming language to provide a single platform for database conversion. This platform uses Oracle as a working platform that allows records from various formats and types of databases to be integrated and manipulated before producing a single or multiple databases. Novelties of this work are column level conversion and flexible changing of a data type. Therefore, the cost and time to deal with any database enhancement, migration, integration, conversion, and new development can be reduced in order to accommodate with the changing requirements in the existing databases.
Article
Many organizations invest heavily in heterogeneous databases according to organizational functions. These heterogeneous databases are stand-alone systems that do not interact with one another. The objective of this paper is to introduce a multi-database system (MDBMS) that interacts with other heterogeneous DBMS within the organization to integrate information processing. In this paper, we discuss the potential inconsistencies in integrating heterogeneous databases. We further extend to include issues in designing a MDBMS. With a MDBMS, data sharing across organization reduces overheads and costs, thus, provides a competitive advantage to the global firms.
Article
We describe Dublin City University (DCU)'s participation in the Hyperlinking sub-task of the MediaEval 2012 Search and Hyperlinking Task. Our strategy involves combining textual metadata, automatic speech recognition (ASR) transcripts, and visual content analysis to create anchor summaries for each video segment available for linking. Two categories of fusion strategy, score-based and rank-based methods, were used to combine scores from different modalities to produce potential inter-item links.
Article
Published methods for distributed information retrieval generally rely on cooperation from search servers. But most real servers, particularly the tens of thousands available on the Web, are not engineered for such cooperation. This means that the majority of methods proposed, and evaluated in simulated environments of homogeneous cooperating servers, are never applied in practice. ¶ This thesis introduces new methods for server selection and results merging. The methods do not require search servers to cooperate, yet are as effective as the best methods which do. Two large experiments evaluate the new methods against many previously published methods. In contrast to previous experiments they simulate a Web-like environment, where servers employ varied retrieval algorithms and tend not to sub-partition documents from a single source. ...
Article
Multilingual information retrieval (MLIR) provides results that are more comprehensive than those of mono- and cross-lingual retrieval. Methods for MLIR are categorized as: (1) Fusion-based methods that merge results from multiple retrieval runs, and (2) Direct methods that build a unique index for the entire collection. Merging results of individual runs reduces the overall effectiveness, while more effective direct methods suffer from either time complexity and memory overhead, or over-weighting of index terms. In this paper, we propose a direct MLIR approach by using the language modeling framework that includes a novel multilingual language model estimation for documents, and a new way to globally estimate word statistics. These contributions enable ranking documents in multiple languages in one retrieval phase without having the problems of the previous direct methods. Moreover, our approach has the advantage of accommodating multilingual feedback information which helps to prevent query drift, and consequently to improve the performance. Finally, we effectively address the common case of incomplete coverage of translation resources in our proposed estimation methods. Experimental results show that the proposed approach outperforms the previous MLIR approaches.
Book
Full-text available
Data Fusion in Information Retrieval The technique of data fusion has been used extensively in information retrieval due to the complexity and diversity of tasks involved such as web and social networks, legal, enterprise, and many others. This book presents both a theoretical and empirical approach to data fusion. Several typical data fusion algorithms are discussed, analysed and evaluated. A reader will find answers to the following questions, among others: - What are the key factors that affect the performance of data fusion algorithms significantly? - What conditions are favourable to data fusion algorithms? - CombSum and CombMNZ, which one is better? and why? - What is the rationale of using the linear combination method? - How can the best fusion option be found under any given circumstances?
Article
Full-text available
Recently, an increasing number of information retrieval studies have triggered a resurgence of interest in redefining the algorithmic estimation of relevance, which implies a shift from topical to multidimensional relevance assessment. A key underlying aspect that emerged when addressing this concept is the aggregation of the relevance assessments related to each of the considered dimensions. The most commonly adopted forms of aggregation are based on classical weighted means and linear combination schemes to address this issue. Although some initiatives were recently proposed, none was concerned with considering the inherent dependencies and interactions existing among the relevance criteria, as is the case in many real-life applications. In this article, we present a new fuzzy-based operator, called iAggregator, for multidimensional relevance aggregation. Its main originality, beyond its ability to model interactions between different relevance criteria, lies in its generalization of many classical aggregation functions. To validate our proposal, we apply our operator within a tweet search task. Experiments using a standard benchmark, namely, Text REtrieval Conference Microblog,† emphasize the relevance of our contribution when compared with traditional aggregation schemes. In addition, it outperforms state-of-the-art aggregation operators such as the Scoring and the And prioritized operators as well as some representative learning-to-rank algorithms.
Article
Full-text available
The web and its search engines have resulted in a new paradigm, generating new challenges for the IR community which are in turn attracting a growing interest from around the world. The decision by NIST to build a new and larger test collection based on web pages represents a very attractive initiative. This motivated us at TREC-9 to support and participate in the creation of this new corpus, to address the underlying problems of managing large text collections and to evaluate the retrieval effective-ness of hyperlinks. In this paper, we will describe the results of our investigations, which demonstrate that simple raw score merging may show interesting retrieval perfor-mances while the hyperlinks used in different search strategies were not able to improve retrieval effective-ness.
Article
Multilingual information retrieval system (MLIR) retrieves relevant information from multiple languages in response to a user query in a single source language. Effectiveness of multilingual information retrieval is measured using Mean Average Precision (MAP). The main feature of multilingual information retrieval is the score list of one language cannot be compared with other language score list. MAP does not consider this feature. We propose a new metric Normalized Distance Measure (NDM) for measuring the effectiveness of MLIR systems. NDM considers the MLIR features. Our analysis states that NDM metric gives credits to MLIR systems that retrieve highly relevant multilingual documents.
Article
Information specialists in enterprises regularly use distributed information retrieval (DIR) systems that query a large number of information retrieval (IR) systems, merge the retrieved results, and display them to users. There can be considerable heterogeneity in the quality of results returned by different IR servers. Further, because different servers handle collections of different sizes and have different processing and bandwidth capacities, there can be considerable heterogeneity in their response times. The broker in the DIR system has to decide which servers to query, how long to wait for responses, and which retrieved results to display based on the benefits and costs imposed on users. The benefit of querying more servers and waiting longer is the ability to retrieve more documents. The costs may be in the form of access fees charged by IR servers or user's cost associated with waiting for the servers to respond. We formulate the broker's decision problem as a stochastic mixed-integer program and present analytical solutions for the problem. Using data gathered from FedStats—a system that queries IR engines of several U.S. federal agencies—we demonstrate that the technique can significantly increase the utility from DIR systems. Finally, simulations suggest that the technique can be applied to solve the broker's decision problem under more complex decision environments.
Article
Full-text available
The Multilingual Information Retrieval System (MLIR) retrieves relevant information from multiple languages in response to a user query in a single source language. Effectiveness of any information retrieval system and Multilingual Information Retrieval System is measured using traditional metrics like Mean Average Precision (MAP), Average Distance Measure (ADM). Distributed MLIR system requires merging mechanism to obtain result from different languages. The ADM metric cannot differentiation effectiveness of the merging mechanisms. In first phase we propose a new metric Normalized Distance Measure (NDM) for measuring the effectiveness of an MLIR system. We present the characteristic differences between NDM, ADM and NDPM metrics. In the second phase shows how effectiveness of merging techniques can be observed by using Normalized Distance Measure (NDM). In first phase of experiments we show that NDM metric gives credits to MLIR systems that retrieve highly relevant multilingual documents. In the second phase of the experiments it is proved that NDM metric can show the effectiveness of merging techniques that cannot be shown by ADM metric.
Article
Information specialists in enterprises and consumers on the Internet regularly use Distributed Information Retrieval (DIR) systems that query a large number of Information Retrieval (IR) systems, merge the retrieved results and display them to users. There can be considerable heterogeneity in the quality of results returned by different IR servers. Further, since different servers handle collections of different sizes, have different processing and bandwidth capacities, there can be considerable heterogeneity in their response times. The broker in the distributed IR system thus has to decide which servers to query, how long to wait for responses and which retrieved results to display based on the benefits and costs imposed on users. The benefit of querying more servers and waiting longer is the ability to retrieve more documents. The costs may be in the form of access fees charged by IR servers or user's cost associated with waiting for the servers to respond. We formulate the broker's decision problem as a stochastic mixed integer program. We present closed-form results for the optimal query set and wait time in the special case when the relevance scores and response times of the IR servers are independent and identically distributed. When servers are heterogeneous, we present a simulations-based optimization technique and demonstrate how the optimal query set and wait time may be determined. The technique is computationally efficient and can be used to generate decision rules for source selection and query termination that are relatively easy to implement. We use data gathered from two different contexts - a DIR system that queries IR engines of several US federal agencies and a comparison shopping engine that queries multiple stores for price and product information - to validate our technique. Our research demonstrates that user satisfaction can be considerably improved by modeling user utility and incorporating historical information on performance of the IR servers.
Article
We present a new approach based on neural networks to solve the merging strategy problem for Cross-Lingual Information Retrieval (CLIR). In addition to language barrier issues in CLIR systems, how to merge a ranked list that contains documents in different languages from several text collections is also critical. We propose a merging strategy based on competitive learning to obtain a single ranking of documents merging the individual lists from the separate retrieved documents. The main contribution of the paper is to show the effectiveness of the Learning Vector Quantization (LVQ) algorithm in solving the merging problem. In order to investigate the effects of varying the number of codebook vectors, we have carried out several experiments with different values for this parameter. The results demonstrate that the LVQ algorithm is a good alternative merging strategy.
Conference Paper
In this paper we present a new data fusion method in information retrieval, which uses ranking information of resultant documents. Our method is based on the modelling of rank-probability of relevance of documents in resultant document list using logarithmic models. The proposed method is more effective than other data fusion methods which also use ranking information, and is as effective as some data fusion methods which rely on reliable scoring information.
Article
Metasearching of online current news services is a potentially useful Web application of distributed information retrieval techniques. We constructed a realistic current news test collection using the results obtained from 15 current news Web sites (including ABC News, BBC and AllAfrica) in response to 107 topical queries. Results were judged for relevance by independent assessors. Online news services varied considerably both in the usefulness of the results sets they returned and also in the amount of information they provided which could be exploited by a metasearcher. Using the current news test collection we compared a range of different merging methods. We found that a low-cost merging scheme based on a combination of available evidence (title, summary, rank and server usefulness) worked almost as well as merging based on downloading and rescoring the actual news articles.
Conference Paper
Full-text available
Collection fusion is a data fusion problem in which the re- sults of retrieval runs on separate, autonomous document collections must be merged to produce a single, effective re- sult. This paper explores two collection fusion techniques that learn the rmrnber of documents to retrieve from each collection using only the ranked lists of documents returned in response to past queries and those documents! relevance judgments. Retrieval experiments using the TREC test co)- lection demonstrate that the effectiveness of the fusion tech- niques is within 10'?% of the effectiveness of a run in which the entire set of documents is treated as a single collection.
Conference Paper
Full-text available
This research evaluates a model for probabilistic text and document retrieval; the model utilizes the technique of logistic regression to obtain equations which rank documents by probability of relevance as a function of document and query properties. Since the model infers probability of relevance from statistical clues present in the texts of documents and queries, we call it logistic inference. By transforming the distri- bution of each statistical clue into its standardized distribution (one with mean v = O and standard deviation a = 1), the method allows one to apply logistic coefficients derived from a training collection to other docu- ment collections, with little loss of predictive power. The model is applied to three well-known information retrieval test collections, and the results are compared directly to the particular vector space model of retrieval which uses term-frequency/inverse-document-frequency (tfidf) weighting and the cosine similarity measure. In the comparison, the logistic inference method performs significantly better than (in two collec- tions) or equally well as (in the third collection) the tfidf/cosine vector space model. The differences in per- formances of the two models were subjected to statistical tests to see if the differences are statistically significant or could have occurred by chance.
Article
Full-text available
This article describes and evaluates SavvySearch, a metasearch engine designed to intelligently select and interface with multiple remote search engines. The primary metasearch issue examined is the importance of carefully selecting and ranking remote search engines for user queries. We studied the efficacy of SavvySearch's incrementally acquired metaindex approach to selecting search engines by analyzing the effect of time and experience on performance. We also compared the metaindex approach to the simpler categorical approach and showed how much experience is required to surpass the simple scheme.
Article
A generalized form of the cross‐validation criterion is applied to the choice and assessment of prediction using the data‐analytic concept of a prescription. The examples used to illustrate the application are drawn from the problem areas of univariate estimation, linear regression and analysis of variance.
Article
Binary response variables special logistical analyses some complications some related approaches more complex responses. Appendices: Theoretical background Choice of explanatory variables in multiple regression Review of computational aspects Further results and exercises.
Book
From the reviews of the First Edition."An interesting, useful, and well-written book on logistic regression models . . . Hosmer and Lemeshow have used very little mathematics, have presented difficult concepts heuristically and through illustrative examples, and have included references."—Choice"Well written, clearly organized, and comprehensive . . . the authors carefully walk the reader through the estimation of interpretation of coefficients from a wide variety of logistic regression models . . . their careful explication of the quantitative re-expression of coefficients from these various models is excellent."—Contemporary Sociology"An extremely well-written book that will certainly prove an invaluable acquisition to the practicing statistician who finds other literature on analysis of discrete data hard to follow or heavily theoretical."—The StatisticianIn this revised and updated edition of their popular book, David Hosmer and Stanley Lemeshow continue to provide an amazingly accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets. Hosmer and Lemeshow extend the discussion from biostatistics and epidemiology to cutting-edge applications in data mining and machine learning, guiding readers step-by-step through the use of modeling techniques for dichotomous data in diverse fields. Ample new topics and expanded discussions of existing material are accompanied by a wealth of real-world examples-with extensive data sets available over the Internet.
Article
The Okapi system has been used in a series of experiments on the TREC collections, investigating probabilistic models, relevance feedback, and query expansion, and interaction issues. Some new probabilistic models have been developed, resulting in simple weighting functions that take account of document length and within-document and within-query term frequency. All have been shown to be beneficial. Relevance feedback and query expansion are highly beneficial when based on large quantities of relevance data (as in the routing task). Interaction issues are much more difficult to evaluate in the TREC framework, and no benefits have yet been demonstrated from feedback based on small numbers of “relevant” items identified by intermediary searchers.
Conference Paper
The Smart information retrieval project emphasizes completely automatic approaches to the understandingand retrieval of large quantities of text. We continue our work in TREC 4, performing runs in therouting, ad-hoc, confused text, interactive, and foreign language environments.IntroductionFor over 30 years, the Smart project at Cornell University has been interested in the analysis, search, andretrieval of heterogeneous text databases, where the vocabulary is allowed to vary widely, and...
Article
Informetrics deals with the search for regularities in data associated with the production and use of recorded information. Most of the methods used in the past implicitly assume that the variables of importance are quantitative in form. Yet much relevant data is categorical. In this paper we point out the existence of techniques for analyzing such data. Examples of informetric phenomena for which these techniques are important are given, and one, involving the book purchasing pattern of a group of libraries, is studied in detail.
Article
ion, Inductive Learning and Probabilistic Assumptions Norbert Fuhr, Ulrich Pfeifer University of Dortmund, Dortmund, Germany Categories and Subject Descriptors G.1.2 [Numerical Analysis] Approximation -- nonlinear approximation, least squares approximation H.3.1 [Information Storage and Retrieval] Content Analysis and Indexing -- indexing methods H.3.3 [Information Storage and Retrieval] Information Search and Retrieval -- retrieval models I.2.6 [Artificial Intelligence] Learning -- Parameter Learning General Terms: Experimentation, Theory Additional Keywords and Phrases: logistic regression, probabilistic indexing, probabilistic retrieval, controlled vocabulary Abstract We show that former approaches in probabilistic information retrieval are based on one or two of the three concepts abstraction, inductive learning and probabilistic assumptions, and we propose a new approach which combines all three concepts. This approach is illustrated for the case of indexing with a controlled ...
Article
The PIRCS retrieval system has been upgraded in TREC-3 to handle the full English collections of 2 GB in an efficient manner. For ad-hoc retrieval, we use recurrent spreading of activation in our network to implement query learning and expansion based on the best-ranked subdocuments of an initial retrieval. We also augment our standard retrieval algorithm with a soft-Boolean component. For routing, we use learning from signal-rich short documents or subdocument segments. For the optional thresholding experiment, we tried two approaches to transforming retrieval status values (RSV's) so that they could be used to partition documents into retrieved and nonretrieved sets. The first method normalizes RSV's using a query self-retrieval score. The second, which requires training data, uses logistic regression to convert RSV's into estimates of probability of relevance. Overall, our results are highly competitive with those of other participants. 1. INTRODUCTION PIRCS is an experimental info...
Article
Practical information retrieval systems must manage large volumes of data, often divided into several collections that may be held on separate machines. Techniques for locating matches to queries must therefore consider identification of probable collections as well as identification of documents that are probable answers. Furthermore, the large amounts of data involved motivates the use of compression, but in a dynamic environment compression is problematic, because as new text is added the compression model slowly becomes inappropriate. In this paper we describe solutions to both of these problems. We show that use of centralised blocked indexes can reduce overall query processing costs in a multi-collection environment, and that careful application of text compression techniques allow collections to grow by several orders of magnitude without recompression becoming necessary. 1 Introduction Practical information systems are required to store many gigabytes of data while supporting ...
Article
this paper. The "ltc" weights were computed on this matrix. 3.2 SVD analysis
Article
This paper examines the feasibility of merging the results of retrieval runs on separate, autonomous document collections into an effective combined result. In particular, we examine two collection fusion techniques that use the results of past queries to compute the number of documents to retrieve from each of a set of subcollections such that the total number of retrieved documents is equal to N , the number of documents to be returned to the user. The fusion techniques are independent of the particular weighting schemes, similarity measures, and retrieval models used by the component collections. Our official TREC-3 runs are fusion runs in which N = 1000; other runs investigate the effects of varying N . These results show that the precision averaged over the 50 queries is within 10% of the precision of an effective single collection run for a wide range of values of N . 1 Introduction Data fusion techniques have been used in information retrieval to improve the effectiveness of ...
Article
The use of information retrieval systems in networked environments raises a new set of issues that have received little attention. These issues include ranking document collections for relevance to a query, selecting the best set of collections from a ranked list, and merging the document rankings that are returned from a set of collections. This paper describes methods of addressing each issue in the inference network model, discusses their implementation in the INQUERY system, and presents experimental results demonstrating their effectiveness. 1 Introduction Retrospective document retrieval is usually described as the task of searching a single collection of documents to produce a list of documents ranked in order of relevance to a particular query. The need to search multiple collections in distributed environments is becoming increasingly important as the sizes of individual collections grow and network information services proliferate. Distributed collections can be relatively ...
Article
A database merging technique is a strategy for combining the results of multiple, independent searches into a single cohesive response. An isolated database merging technique selects the number of documents to be retrieved from each database without using data from the component databases at run-time. In this paper we investigate the effectiveness of two isolated database merging techniques in the context of the TREC-4 database merging task. The results show that on average a merged result contains about 1 fewer relevant document per query than a comparable single collection run when retrieving up to 100 documents. 1 Introduction Siemens has used TREC-4 to continue its investigation of the collection fusion or database merging problem. Informally, the database merging problem is to combine the retrieval results from multiple, independent databases into a single result that has the best possible effectiveness. Such a search is necessary in a variety of distributed IR settings, with th...
Applications of loglinear models for informetric phenomena Information Processing & Management New retrieval approaches using SMART
  • A Bookstein
  • E Neil
  • M Dillon
  • D Stephens
Bookstein, A., O'Neil, E., Dillon, M., & Stephens, D. (1992). Applications of loglinear models for informetric phenomena. Information Processing & Management, 28(1), 75±88. Buckley, C., Singhal, A., Mitra, M., & Salton, G. (1996). New retrieval approaches using SMART. In Proceedings of the TREC4 (pp. 25±48) (NIST publication 500-236).
Cross-validatory choice and assessment of statistical predictions Learning collection fusion strategies
  • M Stone
  • ±
  • E M Voorhees
  • N K Gupta
  • B Johnson-Laird
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 36(2), 111±147. Voorhees, E. M., Gupta, N. K., & Johnson-Laird, B. (1995a). Learning collection fusion strategies. In Proceedings of the 18th International Conference of the ACM-SIGIR'95, Seattle, WA (pp. 172±179).
Large test collection experiments on an operational, interactive system: OKAPI at TREC. Information Processing & Management Research on automatic indexing 1974±1976
  • S E Robertson
  • S Walker
  • M M K Hancock-Beaulieu
  • R G Bates
Robertson, S. E., Walker, S., & Hancock-Beaulieu, M. M. (1995). Large test collection experiments on an operational, interactive system: OKAPI at TREC. Information Processing & Management, 31(3), 345±360. Sparck Jones, K., & Bates, R. G. (1977). Research on automatic indexing 1974±1976. Technical Report, Computer Laboratory, University of Cambridge, UK.
Research on automatic indexing 1974-1976
  • Sparck Jones
  • K Bates
Sparck Jones, K., & Bates, R. G. (1977). Research on automatic indexing 1974-1976. Technical Report, Computer Laboratory, University of Cambridge (UK).
4: Evaluation of database merging strategies (Topic = <desc> & <narr>)
  • A Table
Table A.4: Evaluation of database merging strategies (Topic = <desc> & <narr>)
Searching distributed collections with inference networks
  • J P Callan
  • Z Lu
  • W B Croft
Callan, J. P., Lu, Z., & Croft, W. B. (1995, June). Searching distributed collections with inference networks. Proceedings of the 18th International Conference of the ACM-SIGIR'95, Seattle, WA, 21-28.