Dawei Song

Tianjin University, T’ien-ching-shih, Tianjin Shi, China

Are you Dawei Song?

Claim your profile

Publications (139)31.23 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Typical dimensionality reduction (DR) methods are often data-oriented, focusing on directly reducing the number of random variables (features) while retaining the maximal variations in the high-dimensional data. In unsupervised situations, one of the main limitations of these methods lies in their dependency on the scale of data features. This paper aims to address the problem from a new perspective and considers model-oriented dimensionality reduction in parameter spaces of binary multivariate distributions. Specifically, we propose a general parameter reduction criterion, called Confident-Information-First (CIF) principle, to maximally preserve confident parameters and rule out less confident parameters. Formally, the confidence of each parameter can be assessed by its contribution to the expected Fisher information distance within the geometric manifold over the neighbourhood of the underlying real distribution. We then revisit Boltzmann machines (BM) from a model selection perspective and theoretically show that both the fully visible BM (VBM) and the BM with hidden units can be derived from the general binary multivariate distribution using the CIF principle. This can help us uncover and formalize the essential parts of the target density that BM aims to capture and the non-essential parts that BM should discard. Guided by the theoretical analysis, we develop a sample-specific CIF for model selection of BM that is adaptive to the observed samples. The method is studied in a series of density estimation experiments and has been shown effective in terms of the estimate accuracy.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Query language modeling based on relevance feedback has been widely applied to improve the effectiveness of information retrieval. However, intra-query term dependencies (i.e., the dependencies between different query terms and term combinations) have not yet been sufficiently addressed in the existing approaches. This article aims to investigate this issue within a comprehensive framework, namely the Aspect Query Language Model (AM). We propose to extend the AM with a hidden Markov model (HMM) structure to incorporate the intra-query term dependencies and learn the structure of a novel aspect HMM (AHMM) for query language modeling. In the proposed AHMM, the combinations of query terms are viewed as latent variables representing query aspects. They further form an ergodic HMM, where the dependencies between latent variables (nodes) are modeled as the transitional probabilities. The segmented chunks from the feedback documents are considered as observables of the HMM. Then the AHMM structure is optimized by the HMM, which can estimate the prior of the latent variables and the probability distribution of the observed chunks. Our extensive experiments on three large-scale text retrieval conference (TREC) collections have shown that our method not only significantly outperforms a number of strong baselines in terms of both effectiveness and robustness but also achieves better results than the AM and another state-of-the-art approach, namely the latent concept expansion model. © 2014 Wiley Periodicals, Inc.
    Computational Intelligence 10/2014; DOI:10.1111/coin.12058 · 0.87 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we prove that specific early and specific late fusion strategies are interchangeable. In the case of the late fusion, we consider not only linear but also nonlinear combinations of scores. Our findings are important from both theoretical and practical (applied) perspectives. The duality of specific fusion strategies also answers the question why in the literature the experimental results for both early and late fusion are often similar. The most important aspect of our research is, however, related to the presumable drawbacks of the aforementioned fusion strategies. It is an accepted fact that the main drawback of the early fusion is the curse of dimensionality (generation of high dimensional vectors) whereas the main drawback of the late fusion is its inability to capture correlation between feature spaces. Our proof on the interchangeability of specific fusion schemes undermines this belief. Only one of the possibilities exists: either the late fusion is capable of capturing the correlation between feature spaces or the interaction between the early fusion operators and the similarity measurements decorrelates feature spaces. Keywords - Information and data fusion, early fusion, late fusion, Content-based Image Retrieval, Information Retrieval, Multimedia Retrieval, textual representation, visual representation
    The 17th International Conference on Information Fusion (Fusion 2014), Salamanca, Spain; 07/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The principle of extreme physical information (EPI) can be used to derive many known laws and distributions in theoretical physics by extremizing the physical information loss K, i.e., the difference between the observed Fisher information I and the intrinsic information bound J of the physical phenomenon being measured. However, for complex cognitive systems of high dimensionality (e. g., human language processing and image recognition), the information bound J could be excessively larger than I (J >> I), due to insufficient observation, which would lead to serious over-fitting problems in the derivation of cognitive models. Moreover, there is a lack of an established exact invariance principle that gives rise to the bound information in universal cognitive systems. This limits the direct application of EPI. To narrow down the gap between I and J, in this paper, we propose a confident-information-first (CIF) principle to lower the information bound J by preserving confident parameters and ruling out unreliable or noisy parameters in the probability density function being measured. The confidence of each parameter can be assessed by its contribution to the expected Fisher information distance between the physical phenomenon and its observations. In addition, given a specific parametric representation, this contribution can often be directly assessed by the Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cramer-Rao bound. We then consider the dimensionality reduction in the parameter spaces of binary multivariate distributions. We show that the single-layer Boltzmann machine without hidden units (SBM) can be derived using the CIF principle. An illustrative experiment is conducted to show how the CIF principle improves the density estimation performance.
    Entropy 07/2014; 16(7):3670-3688. DOI:10.3390/e16073670 · 1.56 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The estimation of query model is an important task in language modeling (LM) approaches to information retrieval (IR). The ideal estimation is expected to be not only effective in terms of high mean retrieval performance over all queries, but also stable in terms of low variance of retrieval performance across different queries. In practice, however, improving effectiveness can sacrifice stability, and vice versa. In this paper, we propose to study this tradeoff from a new perspective, i.e., the bias–variance tradeoff, which is a fundamental theory in statistics. We formulate the notion of bias–variance regarding retrieval performance and estimation quality of query models. We then investigate several estimated query models, by analyzing when and why the bias–variance tradeoff will occur, and how the bias and variance can be reduced simultaneously. A series of experiments on four TREC collections have been conducted to systematically evaluate our bias–variance analysis. Our approach and results will potentially form an analysis framework and a novel evaluation strategy for query language modeling.
    Information Processing & Management 01/2014; 50(1):199–217. DOI:10.1016/j.ipm.2013.08.004 · 1.07 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper reports on an approach to the analysis of form (layout and formatting) during genre recognition recorded using eye tracking. The researchers focused on eight di erent types of e-mail, such as calls for papers, newsletters and spam, which were chosen to represent di erent genres. The study involved the collection of oculographic behaviour data based on the scanpath duration and scanpath length based metric, to highlight the ways in which people view the features of genres. We found that genre analysis based on purpose and form (layout features, etc.) was an e ective means of identifying the characteristics of these e-mails. The research, carried out on a group of 24 participants, highlighted their interaction and interpretation of the e-mail texts and the visual cues or features perceived. In addition, the ocular strategies of scanning and skimming, they employed for the processing of the texts by block, genre and representation were evaluated.
    Information Processing & Management 01/2014; 50(1):175–198. DOI:10.1016/j.ipm.2013.08.005 · 1.07 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Modern search engines have been moving away from simplistic interfaces that aimed at satisfying a user's need with a single-shot query. Interactive features are now integral parts of web search engines. However, generating good query modification suggestions remains a challenging issue. Query log analysis is one of the major strands of work in this direction. Although much research has been performed on query logs collected on the web as a whole, query log analysis to enhance search on smaller and more focused collections has attracted less attention, despite its increasing practical importance. In this article, we report on a systematic study of different query modification methods applied to a substantial query log collected on a local website that already uses an interactive search engine. We conducted experiments in which we asked users to assess the relevance of potential query modification suggestions that have been constructed using a range of log analysis methods and different baseline approaches. The experimental results demonstrate the usefulness of log analysis to extract query modification suggestions. Furthermore, our experiments demonstrate that a more fine-grained approach than grouping search requests into sessions allows for extraction of better refinement terms from query log files.
    Journal of the American Society for Information Science and Technology 10/2013; 64(10-10):1975–1994. DOI:10.1002/asi.22901 · 2.23 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The road network design problem is to optimize the road network by selecting paths to improve or adding paths in the existing road network, under certain constraints, e.g., the weighted sum of modifying costs. Since its multi-objective nature, the road network design problem is often challenging for designers. Empirically, the smaller diameter a road network has, the more connected and efficient the road network is. Based on this observation, we propose a set of constrained convex models for designing road networks with small diameters. To be specific, we theoretically prove that the diameter of the road network, which is evaluated w.r.t the travel times in the network, can be bounded by the algebraic connectivity in spectral graph theory since that the upper and lower bounds of diameter are inversely proportional to algebraic connectivity. Then we can focus on increasing the algebraic connectivity instead of reducing the network diameter, under the budget constraints. The above formulation leads to a semi-definite program, in which we can get its global solution easily. Then, we present some simulation experiments to show the correctness of our method. At last, we compare our method with an existing method based on the genetic algorithm.
    Proceedings of the Twenty-Third international joint conference on Artificial Intelligence; 08/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: It has been recognized that, when an information retrieval (IR) system achieves improvement in mean retrieval effectiveness (e.g. mean average precision (MAP)) over all the queries, the performance (e.g., average precision (AP)) of some individual queries could be hurt, resulting in retrieval instability. Some stability/robustness metrics have been proposed. However, they are often defined separately from the mean effectiveness metric. Consequently, there is a lack of a unified formulation of effectiveness, stability and overall retrieval quality (considering both). In this paper, we present a unified formulation based on the bias-variance decomposition. Correspondingly, a novel evaluation methodology is developed to evaluate the effectiveness and stability in an integrated manner. A case study applying the proposed methodology to evaluation of query language modeling illustrates the usefulness and analytical power of our approach.
    Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval; 07/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The classical bag-of-word models for information retrieval (IR) fail to capture contextual associations between words. In this article, we propose to investigate pure high-order dependence among a number of words forming an unseparable semantic entity, that is, the high-order dependence that cannot be reduced to the random coincidence of lower-order dependencies. We believe that identifying these pure high-order dependence patterns would lead to a better representation of documents and novel retrieval models. Specifically, two formal definitions of pure dependence—unconditional pure dependence (UPD) and conditional pure dependence (CPD)—are defined. The exact decision on UPD and CPD, however, is NP-hard in general. We hence derive and prove the sufficient criteria that entail UPD and CPD, within the well-principled information geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods for extracting word patterns with pure high-order dependence. Our methods are applied to and extensively evaluated on three typical IR tasks: text classification and text retrieval without and with query expansion.
    ACM Transactions on Information Systems 07/2013; 31(3). DOI:10.1145/2493175.2493177 · 1.30 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: 3-D CAD models are an important digital resource in the manufacturing industry. 3-D CAD model retrieval has become a key technology in product lifecycle management enabling the reuse of existing design data. In this paper, we propose a new method to retrieve 3-D CAD models based on 2-D pen-based sketch inputs. Sketching is a common and convenient method for communicating design intent during early stages of product design, e.g., conceptual design. However, converting sketched information into precise 3-D engineering models is cumbersome, and much of this effort can be avoided by reuse of existing data. To achieve this purpose, we present a user-adaptive sketch-based retrieval method in this paper. The contributions of this work are twofold. First, we propose a statistical measure for CAD model retrieval: the measure is based on sketch similarity and accounts for users' drawing habits. Second, for 3-D CAD models in the database, we propose a sketch generation pipeline that represents each 3-D CAD model by a small yet sufficient set of sketches that are perceptually similar to human drawings. User studies and experiments that demonstrate the effectiveness of the proposed method in the design process are presented.
    IEEE Transactions on Automation Science and Engineering 07/2013; 10(3):783-795. DOI:10.1109/TASE.2012.2228481 · 2.16 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Typical dimensionality reduction methods focus on directly reducing the number of random variables while retaining maximal variations in the data. In this paper, we consider the dimensionality reduction in parameter spaces of binary multivariate distributions. We propose a general Confident-Information-First (CIF) principle to maximally preserve parameters with confident estimates and rule out unreliable or noisy parameters. Formally, the confidence of a parameter can be assessed by its Fisher information, which establishes a connection with the inverse variance of any unbiased estimate for the parameter via the Cram\'{e}r-Rao bound. We then revisit Boltzmann machines (BM) and theoretically show that both single-layer BM without hidden units (SBM) and restricted BM (RBM) can be solidly derived using the CIF principle. This can not only help us uncover and formalize the essential parts of the target density that SBM and RBM capture, but also suggest that the deep neural network consisting of several layers of RBM can be seen as the layer-wise application of CIF. Guided by the theoretical analysis, we develop a sample-specific CIF-based contrastive divergence (CD-CIF) algorithm for SBM and a CIF-based iterative projection procedure (IP) for RBM. Both CD-CIF and IP are studied in a series of density estimation experiments.
  • Lei Wang, Dawei Song, Eyad Elyan
    [Show abstract] [Hide abstract]
    ABSTRACT: Most of the state-of-art approaches to Query-by-Example (QBE) video retrieval are based on the Bag-of-visual-Words (BovW) representation of visual content. It, however, ignores the spatial-temporal information, which is important for similarity measurement between videos. Direct incorporation of such information into the video data representation for a large scale data set is computationally expensive in terms of storage and similarity measurement. It is also static regardless of the change of discriminative power of visual words for different queries. To tackle these limitations, in this paper, we propose to discover Spatial-Temporal Correlations (STC) imposed by the query example to improve the BovW model for video retrieval. The STC, in terms of spatial proximity and relative motion coherence between different visual words, is crucial to identify the discriminative power of the visual words. We develop a novel technique to emphasize the most discriminative visual words for similarity measurement, and incorporate this STC-based approach into the standard inverted index architecture. Our approach is evaluated on the TRECVID2002 and CC\_WEB\_VIDEO datasets for two typical QBE video retrieval tasks respectively. The experimental results demonstrate that it substantially improves the BovW model as well as a state of the art method that also utilizes spatial-temporal information for QBE video retrieval.
    Proceedings of the 21st ACM international conference on Information and knowledge management; 10/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper reports on our task-based observational, logged, questionnaire study and analysis of ocular behavior pertaining to the interaction of structural features of text in Wikipedia using eye tracking. We set natural and realistic tasks searching Wikipedia online focusing on examining which features and strategies (skimming or scanning) were the most important for the participants to complete their tasks. Our research, carried out on a group of 30 participants, highlighted their interactions with the structural areas within Wikipedia articles, the visual cues and features perceived during the searching of the Wiki text. We collected questionnaire and ocular behavior (fixation metrics) data to highlight the ways in which people view the features in the articles. We found that our participants' extensively interacted with layout features, such as tables, titles, bullet lists, contents lists, information boxes, and references. The eye tracking results showed that participants used the format and layout features and they also highlighted them as important. They were able to navigate to useful information consistently, and they were an effective means of locating relevant information for the completion of their tasks with some success. This work presents results which contribute to the long-term goals of studying the features for genre and theoretical perception research.
    Proceedings of the 4th Information Interaction in Context Symposium, Nijmegen, the Netherlands; 08/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A concept hierarchy created from a document collection can be used for query recommendation on Intranets by ranking terms according to the strength of their links to the query within the hierarchy. A major limitation is that this model produces the same recommendations for identical queries and rebuilding it from scratch periodically can be extremely inefficient due to the high computational costs. We pro-pose to adapt the model by incorporating query refinements from search logs. Our intuition is that the concept hierarchy built from the collection and the search logs provide com-plementary conceptual views on the same search domain, and their integration should continually improve the effec-tiveness of recommended terms. Two adaptation approaches using query logs with and without click information are com-pared. We evaluate the concept hierarchy models (static and adapted versions) built from the Intranet collections of two academic institutions and compare them with a state-of-the-art log-based query recommender, the Query Flow Graph, built from the same logs. Our adaptive model significantly outperforms its static version and the query flow graph when tested over a period of time on data (documents and search logs) from two institutions' Intranets.
    Proceedings of the 35th International ACM SIGIR conference research and development in Information Retrieval (SIGIR'12); 08/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: In density estimation task, Maximum Entropy (Maxent) model can effectively use reliable prior information via nonparametric constraints, that is, linear constraints without empirical parameters. However, reliable prior information is often insufficient, and parametric constraints becomes necessary but poses considerable implementation complexity. Improper setting of parametric constraints can result in overfitting or underfitting. To alleviate this problem, a generalization of Maxent, under Tsallis entropy framework, is proposed. The proposed method introduces a convex quadratic constraint for the correction of (expected) quadratic Tsallis Entropy Bias (TEB). Specifically, we demonstrate that the expected quadratic Tsallis entropy of sampling distributions is smaller than that of the underlying real distribution with regard to frequentist, Bayesian prior, and Bayesian posterior framework, respectively. This expected entropy reduction is exactly the (expected) TEB, which can be expressed by the closed-form formula and acts as a consistent and unbiased correction with an appropriate convergence rate. TEB indicates that the entropy of a specific sampling distribution should be increased accordingly. This entails a quantitative reinterpretation of the Maxent principle. By compensating TEB and meanwhile forcing the resulting distribution to be close to the sampling distribution, our generalized quadratic Tsallis Entropy Bias Compensation (TEBC) Maxent can be expected to alleviate the overfitting and underfitting. We also present a connection between TEB and Lidstone estimator. As a result, TEB–Lidstone estimator is developed by analytically identifying the rate of probability correction in Lidstone. Extensive empirical evaluation shows promising performance of both TEBC Maxent and TEB-Lidstone in comparison with various state-of-the-art density estimation methods.
    Computational Intelligence 07/2012; 30(2). DOI:10.1111/j.1467-8640.2012.00443.x · 0.87 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Today, searchers exploring the World Wide Web have come to expect enhanced search interfaces --- query completion and related searches have become standard. Here we propose a Formal Concept Analysis lattice as an underlying domain model to provide a source of query refinements. The initial lattice is constructed using NLP. User clicks on documents, seen as implicit user feedback, are harnessed to adapt it. In this paper, we explore the viability of this adaptation process and the results we present demonstrate its promise and limitations for proposing initial effective refinements when searching the diverse WWW domain.
    Proceedings of the 34th European conference on Advances in Information Retrieval; 04/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In information retrieval (IR) research, more and more focus has been placed on optimizing a query language model by detecting and estimating the dependencies between the query and the observed terms occurring in the selected relevance feedback documents. In this paper, we propose a novel Aspect Language Modeling framework featuring term association acquisition, document segmentation, query decomposition, and an Aspect Model (AM) for parameter optimization. Through the proposed framework, we advance the theory and practice of applying high-order and context-sensitive term relationships to IR. We first decompose a query into subsets of query terms. Then we segment the relevance feedback documents into chunks using multiple sliding windows. Finally we discover the higher order term associations, that is, the terms in these chunks with high degree of association to the subsets of the query. In this process, we adopt an approach by combining the AM with the Association Rule (AR) mining. In our approach, the AM not only considers the subsets of a query as “hidden” states and estimates their prior distributions, but also evaluates the dependencies between the subsets of a query and the observed terms extracted from the chunks of feedback documents. The AR provides a reasonable initial estimation of the high-order term associations by discovering the associated rules from the document chunks. Experimental results on various TREC collections verify the effectiveness of our approach, which significantly outperforms a baseline language model and two state-of-the-art query language models namely the Relevance Model and the Information Flow model. © 2012 Wiley Periodicals, Inc.
    Computational Intelligence 02/2012; 28(1). DOI:10.1111/j.1467-8640.2012.00407.x · 0.87 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This study examines reformulations of queries submitted to a search engine of a university Web site with a focus on (implicitly derived) user satisfaction and the performance of the underlying search engine. Using a search log of a university Web site we examined all reformulations submitted in a 10-week period and studied the relation between the popularity of the reformulation and the performance of the search engine estimated using a number of clickthrough-based measures. Our findings are a step towards building better query recommendation systems and suggest a number of metrics to evaluate query recommendation systems.
    Advances in Information Retrieval (Proceedings of the 34th European Conference on Information Retrieval (ECIR'12)); 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the design and results of a task-based user study, based on Information Foraging Theory, on a novel user interaction framework - uInteract - for content-based image retrieval (CBIR). The framework includes a four-factor user interaction model and an interactive interface. The user study involves three focused evaluations, 12 simulated real life search tasks with different complexity levels, 12 comparative systems and 50 subjects. Information Foraging Theory is applied to the user study design and the quantitative data analysis. The systematic findings have not only shown how effective and easy to use the uInteract framework is, but also illustrate the value of Information Foraging Theory for interpreting user interaction with CBIR. KeywordsInformation Foraging Theory-User interaction-Four-factor user interaction model-uInteract-content-based image retrieval
    09/2011: pages 241-251;

Publication Stats

815 Citations
31.23 Total Impact Points


  • 2010–2014
    • Tianjin University
      • • School of Computer Science and Technology
      • • Department of Computer Science
      T’ien-ching-shih, Tianjin Shi, China
    • University of Essex
      • School of Computer Science and Electronic Engineering
      Colchester, ENG, United Kingdom
  • 2013
    • Tianjin Open University
      T’ien-ching-shih, Tianjin Shi, China
    • Shanghai Open University
      Shanghai, Shanghai Shi, China
  • 2008–2012
    • The Robert Gordon University
      • School of Computing Science and Digital Media
      Aberdeen, Scotland, United Kingdom
  • 2006–2012
    • Milton Keynes College
      Milton Keynes, England, United Kingdom
  • 1970–2006
    • University of Queensland 
      • • Distributed Systems Technology Centre
      • • School of Information Technology and Electrical Engineering
      Brisbane, Queensland, Australia
  • 2005
    • The Open University (UK)
      • Knowledge Media Institute
      Milton Keynes, England, United Kingdom
  • 1999
    • The Chinese University of Hong Kong
      • Department of Systems Engineering and Engineering Management
      Hong Kong, Hong Kong