Hui Zhang

Beijing University of Aeronautics and Astronautics (Beihang University), Peping, Beijing, China

Are you Hui Zhang?

Claim your profile

Publications (28)25.06 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Information explosion is a critical challenge to the development of modern information systems. In particular, when the application of an information system is over the Internet, the amount of information over the web has been increasing exponentially and rapidly. Search engines, such as Google and Baidu, are essential tools for people to find the information from the Internet. Valuable information, however, is still likely submerged in the ocean of search results from those tools. By clustering the results into different groups based on subjects automatically, a search engine with the clustering feature allows users to select most relevant results quickly. In this paper, we propose an online semantics-based method to cluster Chinese web search results. First, we employ the generalised suffix tree to extract the longest common substrings (LCSs) from search snippets. Second, we use the HowNet to calculate the similarities of the words derived from the LCSs, and extract the most representative features by constructing the vocabulary chain. Third, we construct a vector of text features and calculate snippets’ semantic similarities. Finally, we improve the Chameleon algorithm to cluster snippets. Extensive experimental results have shown that the proposed algorithm has outperformed over the suffix tree clustering method and other traditional clustering methods.
    Enterprise Information Systems 01/2014; 8(1). · 9.26 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Recently topic model has been more and more popular in lots of fields such as information retrieval and semantic relatedness computing, but its practical application is limited to the scalability of data. It cannot be efficiently executed on large-scale datasets in a parallel way. In this paper, we introduce an improved Regularized Latent Semantic Indexing(RLSI) with L1/2 regularization and non-negative constraints. This method formalizes topic model as a problem of minimizing a quadratic loss function regularized by L1/2 and L2 norm with non-negative constraints. This formulation allows the learning process to be decomposed into a series of mutually independent sub-optimization problems which can be processed in parallel, therefore, it has the ability to handle large-scale data. The non-negative constraints and L1/2 regularization allow our model to be more practical and more conducive to information retrieval and semantic relatedness computing. Extensive experimental results show that our improved model can deal with large-scale text data, and compared with some of the-state-of-the-art topic models, it is also very effective.
    Proceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering; 12/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Much work has been done on feature selection. Existing methods are based on document frequency, such as Chi-Square Statistic, Information Gain etc. However, these methods have two shortcomings: one is that they are not reliable for low-frequency terms, and the other is that they only count whether one term occurs in a document and ignore the term frequency. Actually, high-frequency terms within a specific category are often regards as discriminators. This paper focuses on how to construct the feature selection function based on term frequency, and proposes a new approach based on $t$-test, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that our new approach is comparable to or or slightly better than the state-of-the-art feature selection methods (i.e., $\chi^2$, and IG) in terms of macro-$F_1$ and micro-$F_1$.
    05/2013;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Text classification/categorization (TC) is to assign new unlabeled natural language documents to the predefined thematic categories. Centroid-based classifier (CC) has been widely used for TC because of its simplicity and efficiency. However, it has also been long criticized for its relatively low classification accuracy compared with state-of-the-art classifiers such as support vector machines (SVMs). In this paper, we find that for CC using only border instances rather than all instances to construct centroid vectors can obtain higher generalization accuracy. Along this line, we propose Border-Instance-based Iteratively Adjusted Centroid Classifier (IACC_BI), which relies on the border instances found by some routines, e.g. 1-Nearest-and-1-Furthest-Neighbors strategy, to construct centroid vectors for CC. IACC_BI then iteratively adjusts the initial centroid vectors according to the misclassified training instances. Our extensive experiments on 11 real-world text corpora demonstrate that IACC_BI improves the performance of centroid-based classifiers greatly and obtains classification accuracy competitive to the well-known SVMs, while at significantly lower computational costs.
    Neurocomputing. 02/2013; 101:299–308.
  • Hui Zhang, Zhenan Li, Wenjun Wu
    [Show abstract] [Hide abstract]
    ABSTRACT: In a data-driven Science Collaborative Framework, access authorization is a vital component to facilitate the management of the collective data and computing resources shared by researchers from geographically distributed locations. But traditional virtual organization based access control frameworks are not suitable for self-organizing, ad-hoc and opportunistic scientific collaborations, in which scientists can easily set up group-oriented authorization rules across the administrative domains to share their resources by flexible and effective access control. Using the emerging OAuth2.0 protocol and XACML framework, this paper introduces a novel Open Social based access control framework to support ad-hoc team formation and user-controlled resource sharing. To verify the effectiveness of our authorization framework, we develop a infant birth-defect data and data mining resource-sharing application. Our experiences demonstrate that the proposed framework is a very promising approach to resource sharing in cross-domain network environments.
    Proceedings of the 2012 Second International Conference on Cloud and Green Computing; 11/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: In today's business environment, enterprises are increasingly under pressure to process the vast amount of data produced everyday within enterprises. One method is to focus on the business intelligence (BI) applications and increasing the commercial added-value through such business analytics activities. Term weighting scheme, which has been used to convert the documents as vectors in the term space, is a vital task in enterprise Information Retrieval (IR), text categorisation, text analytics, etc. When determining term weight in a document, the traditional TF-IDF scheme sets weight value for the term considering only its occurrence frequency within the document and in the entire set of documents, which leads to some meaningful terms that cannot get the appropriate weight. In this article, we propose a new term weighting scheme called Term Frequency – Function of Document Frequency (TF-FDF) to address this issue. Instead of using monotonically decreasing function such as Inverse Document Frequency, FDF presents a convex function that dynamically adjusts weights according to the significance of the words in a document set. This function can be manually tuned based on the distribution of the most meaningful words which semantically represent the document set. Our experiments show that the TF-FDF can achieve higher value of Normalised Discounted Cumulative Gain in IR than that of TF-IDF and its variants, and improving the accuracy of relevance ranking of the IR results.
    Enterprise Information Systems 11/2012; 6(4):433-444. · 9.26 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing is increasingly becoming a popular solution to massive data analysis in bioinformatics. In order to enable scientists to harness the computing power provided by Cloud platforms, we designed Green Pipe, a scalable computational workflow system, which runs jobs as MapReduce tasks on virtual Hadoop clusters. This paper introduces a power-aware scheduling algorithm in the workflow engine to optimize workflow execution in terms of running time and energy consumption. Experimental results demonstrate the performance improvement in Green Pipe.
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International; 01/2012
  • Yihua Lou, Wenjun Wu, Hui Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: In a large tiled-display environment, it is important to support multi-user interaction with juxtaposed applications for collaborative work. Although a few multi-user interaction systems have been developed by researchers, they often require modifications to desktop applications to become simultaneously accessible for multiple users. In this paper, we propose a novel multi-user interaction system for large tiled-display environments which needs no customizations to applications. To enable multiple users to cooperate within one application, three basic interaction strategies are introduced for different situations. Our experiment results show that this system not only enables users to interact with applications smoothly just like what they feel on their desktops but also presents them with more collaborative and immersive experience in a large tiled-display environment.
    Multimedia and Expo Workshops (ICMEW), 2012 IEEE International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: We will demonstrate a multi-user interaction system that uses Kinect and Wii Remote for manipulating windows in both desktop and wall-sized environment. This system combines the gesture information collected by Kinect and other sensor information such as acceleration from Wii Remote, therefore providing a more accurate control and a more nature experience for users.
    Multimedia and Expo Workshops (ICMEW), 2012 IEEE International Conference on; 01/2012
  • Hui Zhang, Wenjun Wu, Zhenan Li
    [Show abstract] [Hide abstract]
    ABSTRACT: In an e-Science data infrastructure, access control is a vital component to facilitate the management of the collective data and computing resources shared by researchers from geographically distributed locations. But conventional virtual organization based access control frameworks are not suitable for self-organizing, ad-hoc and opportunistic scientific collaborations, in which scientists can easily set up group-oriented authorization rules across the administrative domains. Using the emerging OAuth2.0 protocol, this paper introduces a novel Open Social based access control framework to support ad-hoc team formation and user-controlled resource sharing. Our experiences with development of the framework in e-Science data infrastructure projects demonstrate that the proposed framework is a very promising approach to resource sharing in cross-domain e-science environments.
    E-Science (e-Science), 2012 IEEE 8th International Conference on; 01/2012
  • Wenjun Wu, Hui Zhang, ZhenAn Li
    [Show abstract] [Hide abstract]
    ABSTRACT: In data-driven science projects, researchers distributed in different institutions often wish to easily team up for data and computing resource sharing to address challenging scientific problems. Typical VO based authorization schemes is not suitable for such a user organized scientific collaboration. Using the emerging OAuthprotocol, we introduce a novel group authorization scheme to support ad-hoc team formation and user controlled resource sharing. Integrating this group authorization scheme, we define an Open Social based scientific collaboration framework and develop a science gateway prototype named as Open Life Science Gateway (OLSGW) to verify and refine the framework. Our experience with development of the OLSGW shows that OAuth 2.0 based group authorization scheme is avery promising approach to resource sharing in Cloud environments, and the Open Social based framework can facilitate science gateway developers to create domain-specific collaborative applications in a very flexible way.
    Cluster, Cloud and Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium on; 06/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Open source projects often maintain open bug repositories during development and maintenance, and the reporters often point out straightly or implicitly the reasons why bugs occur when they submit them. The comments about a bug are very valuable for developers to locate and fix the bug. Meanwhile, it is very common in large software for programmers to override or overload some methods according to the same logic. If one method causes a bug, it is obvious that other overridden or overloaded methods maybe cause related or similar bugs. In this paper, we propose and implement a tool Rebug- Detector, which detects related bugs using bug information and code features. Firstly, it extracts bug features from bug information in bug repositories; secondly, it locates bug methods from source code, and then extracts code features of bug methods; thirdly, it calculates similarities between each overridden or overloaded method and bug methods; lastly, it determines which method maybe causes potential related or similar bugs. We evaluate Rebug-Detector on an open source project: Apache Lucene-Java. Our tool totally detects 61 related bugs, including 21 real bugs and 10 suspected bugs, and it costs us about 15.5 minutes. The results show that bug features and code features extracted by our tool are useful to find real bugs in existing projects.
    Computing Research Repository - CORR. 03/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Spectra-based fault localization (SFL) techniques have brought encouraging results and a variety of program spectra have been proposed to locate faults. Different types of abnormal behaviors may be revealed by different kinds of spectra. Compared to techniques using single spectra type, techniques combining multiple types of spectra try to leverage the strengths of the constituent types. However, in the presence of multiple kinds of spectra, how to select adequate spectra type and build appropriate models need further investigation. In this paper, we propose an SFL technique LOUPE, which uses multiple spectra-specific models. Both control and data dependences are introduced to capture unusual behaviors of faults. In the stage of suspiciousness modeling, in contrast with previous studies, we build different models to evaluate the suspiciousness of statements for each spectra type respectively. Finally, since the fault type is unknown in advance, suspiciousness scores are calculated based on the two models. We evaluate LOUPE on the Siemens benchmark and experimental results show that our technique is promising.
    Proceedings of the 2011 ACM Symposium on Applied Computing (SAC), TaiChung, Taiwan, March 21 - 24, 2011; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing is increasingly becoming a popular solution to massive data analysis in life science community. To completely harness the power of Cloud computing, scientists need science gateways to efficiently manage their virtual machines, share Cloud resources, and run high-throughput sequence analysis with bioinformatics software tools. This paper introduces the development and use of Open Life Science Gateway, which manages computational jobs on top of Hadoop streaming, and supports user-customized runtime environment with virtual machine images. Moreover, it facilitates researchers to team up on solving challenging computing problems by sharing Cloud based data sources and software tools. This gateway has been used for investigating better B-cell epitope prediction.
    IEEE 7th International Conference on E-Science, e-Science 2011, Stockholm, Sweden, December 5-8, 2011; 01/2011
  • Source
    Deqing Wang, Hui Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Term weighting schemes often dominate the performance of many classifiers, such as kNN, centroid-based classifier and SVMs. The widely used term weighting scheme in text categorization, i.e., tf.idf, is originated from information retrieval (IR) field. The intuition behind idf for text categorization seems less reasonable than IR. In this paper, we introduce inverse category frequency (icf) into term weighting scheme and propose two novel approaches, i.e., tf.icf and icf-based supervised term weighting schemes. The tf.icf adopts icf to substitute idf factor and favors terms occurring in fewer categories, rather than fewer documents. And the icf-based approach combines icf and relevance frequency (rf) to weight terms in a supervised way. Our cross-classifier and cross-corpus experiments have shown that our proposed approaches are superior or comparable to six supervised term weighting schemes and three traditional schemes in terms of macro-F1 and micro-F1.
    Computing Research Repository - CORR. 12/2010;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The number of bug reports in complex software increases dramatically. Now bugs are triaged manually, bug triage or assignment is a labor-intensive and time-consuming task. Without knowledge about the structure of the software, testers often specify the component of a new bug wrongly. Meanwhile, it is difficult for triagers to determine the component of the bug only by its description. We dig out the components of 28,829 bugs in Eclipse bug project have been specified wrongly and modified at least once. It results in these bugs have to be reassigned and delays the process of bug fixing. The average time of fixing wrongly-specified bugs is longer than that of correctly-specified ones. In order to solve the problem automatically, we use historical fixed bug reports as training corpus and build classifiers based on support vector machines and Na\"ive Bayes to predict the component of a new bug. The best prediction accuracy reaches up to 81.21% on our validation corpus of Eclipse project. Averagely our predictive model can save about 54.3 days for triagers and developers to repair a bug. Keywords: bug reports; bug triage; text classification; predictive model
    Computing Research Repository - CORR. 10/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Question classification plays a crucial important role in the question answering system. Recent research on question classification for open-domain mostly concentrates on using machine learning methods to resolve the special kind of text classification. This paper presents our research about Chinese question classification using machine learning method and gives our approach based on SVM and semantic gram extraction. SVM has been widely used for question classification and got good performances. We use SVM as the classifier and propose a new feature extraction method of Chinese questions which is called semantic gram extraction. The method is proposed based on the word semantics and N-gram. The experiment results show that the feature extraction can perform well with SVM and our approach can reach high classification accuracy.
    Computer Science-Technology and Applications, International Forum on. 12/2009; 1:432-435.
  • Hongping Hu, Hui Zhang
    [Show abstract] [Hide abstract]
    ABSTRACT: Named entity recognition (NER) is one of the key techniques in natural language processing tasks such as information extraction, text summarization and so on. Chinese NER is more complicated and difficult than other languages because of its characteristics. This paper investigates Chinese named entity recognition based on CRFs, and implements three main named entities, person, location, and organization recognition in two levels: word level and character level. Experiments are made to compare the two level models¿ performances. In the experiments, different training scales and feature sets are utilized to look into the models¿ relationships with training corpus and their ability in making use of different features.
    Computational Intelligence and Security, 2008. CIS '08. International Conference on; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the Internet developing, there are many documents on the same topic which contains redundance. Multi-Document summarization is a technology of natural languages processing, which extract important information from multiple texts about the same topic according to the compression ratio. Sentence selection is an important part of Multi-document summarization .In this paper, we design a calculate method for Chinese words semantic similarity based on Hownet and Tongyici Cilin and we also design a calculate method for sentences semantic similarity based on N-gram. Using these methods, we evaluate the importance of each candidate sentence based on exploiting both the feature of correlation with the query and the feature of the global association feature. We use an improved MMR method to select the sentence in order to reducing the redundancy. The evaluation and results are presented, which prove that the proposed methods are efficient and the summaries generated are good.
    01/2009;
  • Deqing Wang, Hui Zhang, Gang Zhou
    [Show abstract] [Hide abstract]
    ABSTRACT: Web pages often contain “clutters” (defined by us as unnecessary images, navigational menus and extraneous Ad links) around the body of an article that may distract users from the actual content. Therefore, how to extract useful and relevant themes from these web pages becomes a research focus. This paper proposes a new method for web theme extraction. The method firstly uses page segmentation technique to divide a web page into many unrelated blocks, and then calculates entropy of each block and that of the entire web page, then prunes redundant blocks whose entropies are larger than the threshold of the web page, lastly exports the rest blocks as theme of the web page. Moreover, it is verified by experiments that the new method takes better effect on theme extraction from Chinese web pages.
    Foundations of Intelligent Systems, 18th International Symposium, ISMIS 2009, Prague, Czech Republic, September 14-17, 2009. Proceedings; 01/2009