
Haihua Chen- Doctor of Philosophy
- Assistant Professor at University of North Texas
Haihua Chen
- Doctor of Philosophy
- Assistant Professor at University of North Texas
I am recruiting perspective Ph.D. students in Information, Data, and Computer Science will full financial support!
About
90
Publications
23,358
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
897
Citations
Introduction
I am an Assistant Professor in Data Science with a joint appointment in Health Informatics, and the Director of Intelligent Data Engineering and Analytics Lab at UNT. My team is working on exploring effective and efficient methods for access, interaction, and analysis of large, distributed, heterogeneous, and multimedia information, to build high-performance and reliable intelligent systems. We aim to develop computational models and building real-world applications for important domains.
Current institution
Additional affiliations
January 2022 - July 2023
Editor roles
Education
August 2017 - May 2022
August 2014 - May 2017
August 2010 - May 2014
Publications
Publications (90)
Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV...
Tables and figures are usually used to present information in a structured and visual way in scientific documents. Understanding the tables and figures in scientific documents is significant for a series of downstream tasks, such as academic search, scientific knowledge graphs, and so on. Existing studies mainly focus on detecting figures and table...
Interdisciplinary topic reflects the knowledge exchange and integration between different disciplines. Analyzing its evolutionary path is beneficial for interdisciplinary research in identifying potential cooperative research direction and promoting the cross-integration of different disciplines. However, current studies on the evolution of interdi...
Automated legal text classification is a prominent research topic in the legal field. It lays the foundation for building an intelligent legal system. Current literature focuses on international legal texts, such as Chinese cases, European cases, and Australian cases. Little attention is paid to text classification for U.S. legal texts. Deep learni...
A high-quality corpus is essential for building an effective legal intelligence system. The quality of a corpus includes both the quality of original data and the quality of its corresponding labeling. The major quality dimensions of a legal corpus include comprehensiveness, freshness, and correctness. However, building a comprehensive, correct, an...
While edge computing significantly impacts modern computing systems and applications, it also introduces ethical and social implications. This chapter explores two key topics, privacy and bias, central to these concerns. Edge servers process large volumes of data offloaded from end devices, which raises privacy issues but also offers unique opportu...
This editorial aims to provide authors with a clear understanding of ethical expectations, journal policies, publisher guidelines and best practices to ensure their studies and publications align with recognized ethical standards. To illustrate these principles, this editorial presents hypothetical scenarios to explore key ethical considerations in...
With the advent and progression of Natural Language Processing (NLP) methodologies, the domain of automatic citation function classification has gained popularity and considerable research efforts have been contributed to this task. Automatic citation function classification has a joint computational linguistic and bibliometrics background. However...
Emojis have become ubiquitous in online communication, serving as a universal medium to convey emotions and decorative elements. Their widespread use transcends language and cultural barriers, enhancing understanding and fostering more inclusive interactions. While existing work gained valuable insight into emojis understanding, exploring emojis' c...
Online reviews play a crucial role in influencing seller–customer dynamics. This research evaluates the credibility and consistency of reviews based on volume, length, and content to understand the impacts of incentives on customer review behaviors, how to improve review quality, and decision-making in purchases. The data analysis reveals major fac...
Novelty is a critical characteristic of innovative scientific articles, and accurately identifying novelty can facilitate the early detection of scientific breakthroughs. However, existing methods for measuring novelty have two main limitations: (1) Metadata-based approaches, such as citation analysis, are retrospective and do not alleviate the pre...
Even though significant progress has been made in standardizing document layout analysis, complex layout documents like magazines and newspapers still present challenges. Models trained on standardized documents struggle with these complexities, and the high cost of annotating such documents limits dataset availability. To address this, we propose...
In recent years, considerable work has been done on the application of artificial intelligence (AI) and machine learning (ML) in the realm of cultural heritage. The purpose of this panel is to address the following questions: What is the current understanding and implementation status of AI/ML in GLAM collections? What are the associated concerns a...
ChatGPT has shown promise in assisting qualitative researchers with coding. Previous efforts have primarily focused on datasets derived from interviews and observations, leaving document analysis, another crucial data source, relatively unexplored. In this project, we address the rapidly emerging topic of disinformation regulatory policy as a pilot...
This poster aims to conduct an examination of the applications of Artificial Intelligence (AI) in oncology. We collect projects from National Institutes of Health (NIH) which delve into how AI applications and techniques are used in cancer research between 2018 to 2024. Results show AI's ability to enhance lung and pancreatic cancer through novel i...
Research resources (RRs) such as data, software, and tools are essential pillars of scientific research. The field of biomedicine, a critical scientific discipline, is witnessing a surge in research publications resulting in the accumulation of a substantial number of RRs. However, these resources are dispersed among various biomedical articles and...
Since the emergence of “deep neural networks (DNNs)”, several deep learning [...]
The Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2024; https://eeke-workshop.github.io/) and the 4th AI + Informetrics (AII2024; https://ai-informetrics.github.io/) was held in Changchun, China and online, co-located with the iConference2024. The two workshop series are designed to activel...
With the development of information and communication technology, project-based learning (PBL) has become an important pedagogical approach. Group leaders are critical in PBL, and prestige influences learner leadership. Regulation affects learners’ prestige, but research on their relationship is lacking. Through content analysis and epistemic netwo...
Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV...
Question-answering based text summarization can produce personalized and specific summaries; however, the primary challenge is the generation and selection of questions that users expect the summary to answer. Large language models (LLMs) provide an automatic method for generating these questions from the original text. By prompting the LLM to answ...
Evaluating text summarization has been a challenging task in natural language processing (NLP). Automatic metrics which heavily rely on reference summaries are not suitable in many situations, while human evaluation is time-consuming and labor-intensive. To bridge this gap, this paper proposes a novel method based on large language models (LLMs) fo...
Large Language Models (LLMs) such as ChatGPT possess advanced capabilities in understanding and generating text. These capabilities enable ChatGPT to create text based on specific instructions, which can serve as augmented data for text classification tasks. Previous studies have approached data augmentation (DA) by either rewriting the existing da...
Purpose: This study aims to explore the applications of natural language processing (NLP) and data analytics in understanding large-scale digital collections in oral history archives.
Design/methodology/approach: NLP and data analytics were used to analyze the oral interview transcripts of 904 survivors of the Japanese American incarceration camps...
Purpose: This study aims to establish a reliable index to identify interdisciplinary breakthrough innovation effectively. We constructed a new index, the DDiv index, for this purpose.
Design/methodology/approach: The DDiv index incorporates the degree of interdisciplinarity in the breakthrough index. To validate the index, a data set combining the...
In recent years, online review system attracts attention to assess seller-customer relations in the world of e-commerce. To address the quality concern of online review, especially incentivized ones, this study evaluates credibility and consistency based on reviews’ volume, length, and content to distinguish the impact of incentives on customer rev...
Objective: Medical concept normalization (MCN) aims to map informal medical terms to formal medical concepts, a critical task in building machine learning systems for medical applications. However, most existing studies on MCN primarily focus on models and algorithms, often overlooking the vital role of data quality. This research evaluates MCN per...
Python for Information Professionals: How to Design Practical Applications to Capitalize on the Data Explosion is an introduction to the Python programming language for library and information professionals with little or no prior experience. As opposed to the many Python books available today that focus on the language only from a general sense, t...
Several recently developed neural network models have shown their potential for automated text summarization. However, the evaluation results of these models on summarization of long text are fairly close in almost every major evaluation parameter. None of these models including large language models GPT-3.5 and GPT-4 can well summarize long text w...
The Joint Workshop of the 4th Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2023; https://eeke-workshop.github.io/) and the 3rd AI + Informetrics (AII2023; https://ai-informetrics.github.io/) was held at Santa Fe, New Mexico, USA and online, co-located with the ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2...
In the era of intelligent applications, Mobile Edge Computing (MEC) is emerging as a promising technology that provides abundant resources for mobile devices. However, establishing a direct connection to the MEC server is not always feasible for certain devices. This paper introduces a novel Device-to-Device (D2D)-assisted system to address this ch...
This study investigated the potential of enhancing the performance of text classification by augmenting the training dataset with external knowledge samples generated by a generative AI, specifically ChatGPT. The study conducted experiments on three models - CNN, HiSAN, and BERT - using the Reuters dataset. First, the study evaluated the effectiven...
This research investigates the impact of data quality to the quality of text summarization using the software review summarization as a case study. It answers three research questions: 1. What is the most important quality dimension for measuring the quality of software reviews for fitting the review summarization purpose? Our answer is the informa...
Interdisciplinary concept association discovery is a fundamental task in interdisciplinary knowledge organization. Unlike general concept association, interdisciplinary concept association mainly manifests in the correlation between fine-grained concept properties, which requires that interdisciplinary concept association discovery be explored thro...
Interdisciplinary concept association discovery is a fundamental task in interdisciplinary knowledge organization. Unlike general concept association, interdisciplinary concept association mainly manifests in the correlation between fine-grained concept properties, which requires that interdisciplinary concept association discovery be explored thro...
To reduce the conceptual ambiguity in interdisciplinary knowledge organization systems (KOSs) and enhance interdisciplinary KOS management, this paper proposes a framework for interdisciplinary semantic drift (ISD) detection based on the normal cloud model (NCM). In this framework, we first analyze the features of interdisciplinary concepts and pro...
This study reviews existing studies on misinformation. Our purposes are to understand the major research topics that have been investigated by researchers from a variety of disciplines, and to identify important areas for further exploration for library and information science scholars. We conducted automatic descriptive analysis and manual content...
The increasingly mature artificial intelligence technologies, such as big data, deep learning, and natural language processing, provide technical support for research on automatic text understanding and bring development opportunities for innovative measurement of scientific communication. Innovation measurement in scientific communication is a cha...
The ex-ante novelty measurement of scientific literature is an essential tool for academic data mining and scientific communication. It can help researchers and peer experts quickly identify highly creative articles among a large number of papers. This paper proposes a framework for novelty measurement of scientific literature based on contribution...
Purpose
This study aims to evaluate a method of building a biomedical knowledge graph (KG).
Design/methodology/approach
This research first constructs a COVID-19 KG on the COVID-19 Open Research Data Set, covering information over six categories (i.e. disease, drug, gene, species, therapy and symptom). The construction used open-source tools to ex...
Around 120,000 people of Japanese ancestry was forced to remove into internment camps in the United States during World War II. Densho Digital Repository curates a collection of oral histories, which mainly includes 904 filmed interviews about “Japanese American incarceration experience from those who lived it”. This study uses web scraping techniq...
Background
Biomedical sciences, with their focus on human health and disease, have attracted unprecedented attention in the 21st century. The proliferation of biomedical sciences has also led to a large number of scientific articles being produced, which makes it difficult for biomedical researchers to find relevant articles and hinders the dissemi...
Research contributions, which indicate how a research paper contributes new knowledge or new understanding in contrast to prior research on the topic, are the most valuable type of information for researchers to understand the main content of a paper. However, there is little research using research contributions to identify and recommend valuable...
In this research, the focus is on data-centric AI with a specific concentration on data quality evaluation and improvement for machine learning. We first present a practical framework for data quality evaluation and improvement, using a legal domain as a case study and building a corpus for legal argument mining. We first created an initial corpus...
Interest in assessing research impacts is increasing due to its importance for informing actions and funding allocation decisions. The level of innovation (also called “innovation degree” in the following article), one of the most essential factors that affect scientific literature’s impact, has also received increasing attention. However, current...
Machine learning (ML) has drawn great attention from academics as well as industries during the past decades and continues to achieve impressive human-level performance on nontrivial tasks such as image classification, voice recognition, natural language processing, and autopiloting. Both data and algorithms are critical to ensure the performance,...
Indicative list of anticipated article topics: • Data quality assessment for machine learning and deep learning (including defining of dimensions, measurement, and evaluation techniques) • Data quality management in high-stake domains (e.g., legal, medical, cyber security) • Quality evaluation of knowledge graph and ontology system • Experimental s...
Citations play a fundamental role in supporting authors’ contribution claims throughout a scientific paper. Labelling citation instances with different function labels is indispensable for understanding a scientific text. A single citation is the linkage between two scientific papers in the citation network. These citations encompass rich native in...
Intrusion detection is an essential task for protecting the cyber environment from attacks. Many studies have proposed sophisticated models to detect intrusions from a large amount of data, yet they ignore the fact that poor data quality has a direct impact on the performance of the intrusion detection systems. This article first summarizes existin...
Observed from our experiment results, we recommend ensuring the performance and reliability of machine learning systems from four aspects: data preparation, data quality evaluation, data quality improvement, and model selection. Data preparation for building a domain-specific machine learning system mainly includes four steps: (1) Selecting a data...
Objective:
We analyzed the COVID-19 Open Research Dataset (CORD-19) to understand leading research institutions, collaborations among institutions, major publication venues, key research concepts, and topics covered by pandemic-related research.
Methods:
We conducted a descriptive analysis of authors' institutions and relationships, automatic co...
COVID‐19 is a pandemic disease affecting billions of people worldwide. Taking vaccines is a most effective approach to gain fully control. Thanks to the coordinated efforts from all over the world, several brands of vaccines targeting COVID‐19 have passed through clinical trials and been brought to the public. Growing numbers of people are taking v...
During the COVID‐19 pandemic, racists remarks accompanied by racist hashtags were disseminated via social media. Particularly, Asian Americans in the U.S. have been suffered from racism and xenophobia, resulting in physical violence and mental harassment in many cases. Despite the major function of the social media as an open‐access platform for un...
Class imbalance is a common issue in real-world machine learning datasets. This problem is more obvious in intrusion detection since many attack types only have very few samples. Ignoring the imbalance issue or constructing the machine learning classifier on partial classes will lead to bias in the model performance. Motivated by a recent study tha...
Intrusion detection is an essential task in the cyber threat environment. Machine learning and deep learning techniques have been applied for intrusion detection. However, most of the existing research focuses on the model work but ignores the fact that poor data quality has a direct impact on the performance of a machine learning system. More atte...
Purpose
Researchers frequently encounter the following problems when writing scientific articles: (1) Selecting appropriate citations to support the research idea is challenging. (2) The literature review is not conducted extensively, which leads to working on a research problem that others have well addressed. This study focuses on citation recomm...
Poor data quality has a direct impact on the performance of the machine learning system that is built on the data. As a demonstrated effective approach for data quality improvement, transfer learning has been widely used to improve machine learning quality. However, the “quality improvement” brought by transfer learning was rarely rigorously valida...
Knowledge graph has become an essential tool for semantic analysis with the development of natural language processing and deep learning. A high-quality knowledge graph is handy for building a high-performance knowledge-driven application. Despite recent advances in information extraction (IE) techniques, no suitable automated methods can be applie...
This paper describes a semantic vector space model (SeVSM) and an information retrieval system based on the model. The SeVsmaims to improve information retrieval performance for domain-specific systems. In this model, we use an ontology to build the relations between any two keywords to solve the performance deficiency caused by the basic hypothesi...
Clinical case reports are the `eyewitness' in biomedical literature and provide a valuable, unique, albeit noisy and underutilized type of evidence. Main finding is the reason for writing up the reports. Main finding based case reports retrieval provides way for user to conveniently access information of eyewitness evidence. However, user retrieval...
Purpose
The purpose of this paper is to provide an integrated semantic information retrieval (IR) solution based on an ontology-improved vector space model for situations where a digital collection is established or curated. It aims to create a retrieval approach which could return the results by meanings rather than by keywords.
Design/methodolog...
Purpose
Citation contexts have been found useful in many scenarios. However, existing context-based recommendations ignored the importance of diversity in reducing the redundant issues and thus cannot cover the broad range of user interests. To address this gap, the paper aims to propose a novelty task that can recommend a set of diverse citation c...
The misplacement of books in a library is a common problem, which makes books difficult to locate. We propose a smart bookshelf solution based on RFID technology to allow easy locating of books. In our proposed solution, each book is equipped with a RFID tag while each layer of a bookshelf is equipped with a RFID reader. The bookshelf could then de...
Citation function and citation sentiment are two essential aspects of citation content analysis (CCA), which are useful for influence analysis, the recommendation of scientific publications. However, existing studies are mostly traditional machine learning methods, although deep learning techniques have also been explored, the improvement of the pe...
In the era of big scholarly data, researchers frequently encounter the following problems when writing scientific articles: 1) it's challenging to select appropriate references to support the research idea, and 2) literature review is not conducted extensively, which leads to working on a research problem that has been well addressed by others. Cit...
Purpose
In a people-oriented society, it is necessary to understand the electronic word of mouth (eWOM) of information service of the government. Government affair microblogs (GAM), which play an important role in orienting online opinions, listening to public voices, establishing government image, is an ideal channel to achieve this goal. In essen...
User requirements for result diversification in image retrieval have been increasing with the explosion of image resources. Result diversification requires that image retrieval systems are made capable of handling semantic gaps between image visual features and semantic concepts, and providing both relevant and diversified image results. Context in...
Knowledge graphs have become much large and complex during past several years due to its wide applications in knowledge discovery. Many knowledge graphs were built using automated construction tools and via crowdsourcing. The graph may contain significant amount of syntax and semantics errors that great impact its quality. A low quality knowledge g...
The Smart and Connected Health (SCH) program at the National Science Foundation (NSF) has been established as a stand-alone solicitation since 2012. This article reviews and analyzes the 100 projects that have been funded since 2012 to understand their characteristics and the research challenges they have addressed in SCH. Descriptive analysis, top...
Precision medicine information retrieval (PMIR) is about matching the most relevant scientific articles to an individual patient for reliable disease treatment. The corresponding Precision Medicine (PM) Track organized by 2017 Text REtrieval Conference [1] provides a test collection for evaluating the performance of PMIR techniques for finding reli...
Citation contexts of an article refer to sentences or paragraphs that cite that article. Citation contexts are especially useful for recommendation and summarization tasks. However, few studies have recognized the diversity of these citation contexts, thus leading to redundant recommendation lists and abstract [3]. To address this gap, we compared...