
Zhiyuan Chen- PhD
- University of Maryland, Baltimore County
Zhiyuan Chen
- PhD
- University of Maryland, Baltimore County
About
92
Publications
13,627
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,430
Citations
Current institution
Additional affiliations
August 2004 - present
Publications
Publications (92)
A large volume of network trace data are collected by the government, public, and private organizations, and can be analyzed for various purposes such as resolving network problems, improving network performance, and understanding user behavior. However, most organizations are reluctant to share their data with any external experts for analysis bec...
In recent years, there is a lot of interest in modeling students' digital traces in Learning Management System (LMS) to understand students' learning behavior patterns including aspects of meta-cognition and self-regulation, with the ultimate goal to turn those insights into actionable information to support students to improve their learning outco...
Network traces are considered a primary source of information to researchers, who use them to investigate research problems such as identifying user behavior, analyzing network hierarchy, maintaining network security, classifying packet flows, and much more. However, most organizations are reluctant to share their data with a third party or the pub...
Federated RDF systems allow users to retrieve data from multiple independent sources without needing to have all the data in the same triple store. The performance of these systems can be poor for large and geographically distributed RDF data where network transfer costs are high. This article introduces CBTP-OL and CBTP-Nhop, two novel join algori...
With the increasing adoption of Learning Management Systems (LMS) in colleges and universities, research in exploring the interaction data captured by these systems is promising in developing a better learning environment and improving teaching practice. Most of these research efforts focused on course-level variables to predict student performance...
In the age of IoT, collection of activity data has become ubiquitous. Publishing activity data can be quite useful for various purposes such as estimating the level of assistance required by older adults and facilitating early diagnosis and treatment of certain diseases. However, publishing activity data comes with privacy risks: Each dimension, i....
Publishing physical activity data can facilitate reproducible health-care research in several areas such as population health management, behavioral health research, and management of chronic health problems. However, publishing such data also brings high privacy risks related to re-identification which makes anonymization necessary. One of the cha...
Organizations often need to share mission-dependent data in a secure and flexible way. Examples include contact tracing for a contagious disease such as COVID-19, maritime search and rescue operations, or creating a collaborative bid for a contract. In such examples, the ability to access data may need to change dynamically, depending on the situat...
Publishing physical activity data can facilitate reproducible health-care research in several areas such as population health management, behavioral health research, and management of chronic health problems. However, publishing such data also brings high privacy risks related to re-identification which makes anonymization necessary. One of the cha...
Machine learning models have been widely used in security applications. However, it is well-known that adversaries can adapt their attacks to evade detection. There has been some work on making machine learning models more robust to such attacks. However, one simple but promising approach called
randomization
is under-explored. In addition, most...
In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent of time and resources of the data analyst. Adequate tools and techniques must be used for data clean...
Publishing physical activity data can facilitate reproducible health-care research in several areas such as population health management, behavioral health research, and management of chronic health problems. However, publishing such data also brings high privacy risks related to re-identification which makes anonymization necessary. One of the cha...
Maritime Search and Rescue missions involve complex operations in which multiple entities, playing different roles in dynamic situations, benefit from sharing mission-dependent data. We propose an approach to support situation-aware access control in a federated Data-as-a-Service architecture. We develop an ontology and rules to represent access co...
In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent of time and resource of data analyst. Adequate tools and techniques must be used for data cleaning....
Evolving cybersecurity threats are a persistent challenge for systemadministrators and security experts as new malwares are continu-ally released. Attackers may look for vulnerabilities in commercialproducts or execute sophisticated reconnaissance campaigns tounderstand a targets network and gather information on securityproducts like firewalls and...
Machine learning models have been widely used in security applications. However, it is well-known that adversaries can adapt their attacks to evade detection. There has been some work on making machine learning models more robust to such attacks. However, one simple but promising approach called randomization is under-explored. In addition, most ex...
As it becomes easy and inexpensive to store huge amount of data, concerns about privacy are increasing as well. Although service providers have privacy policies, research shows that users rarely read privacy policies. As a result, there has been little work done on how consumers respond to individual segments of privacy policies, which is important...
Evolving cybersecurity threats are a persistent challenge for system administrators and security experts as new malwares are continually released. Attackers may look for vulnerabilities in commercial products or execute sophisticated reconnaissance campaigns to understand a target's network and gather information on security products like firewalls...
Machine learning models have been widely used in security applications such as intrusion detection, spam filtering, and virus or malware detection. However, it is well-known that adversaries are always trying to adapt their attacks to evade detection. For example, an email spammer may guess what features spam detection models use and modify or remo...
Intrusion Detection Systems (IDS) have been widely used to detect cyber attacks in Cyber-Physical Systems (CPS). However, attackers can often adapt their attacking strategies to evade detection. Many commercial IDS are rule-based systems. This paper analyzes the possible attacking strategies against a widely used rule-based IDS, Snort, using hyper...
As it becomes easy and inexpensive to store huge amount of data, concerns about privacy are increasing as well. Although service providers have privacy policies, research shows that users rarely read privacy policies. As a result, there has been little work done on how consumers respond to individual segments of privacy policies, which is important...
This paper proposes a method to anonymize network trace data by utilizing a novel perturbation technique that has strong privacy guarantee and at the same time preserves data utility. The resulting dataset can be used for security analysis, retaining the utility of the original dataset, without revealing sensitive information. Our method utilizes a...
As it becomes easy and inexpensive to store huge amount of data, concerns about privacy are increasing as well. Although service providers have privacy policies, research shows that users rarely read privacy policies. As a result, there has been little work done on how consumers respond to individual segments of privacy policies, which is important...
We consider a special case in association rule mining where mining is conducted by a third party over data located at a central location that is updated from several source locations. The data at the central location is at rest while that flowing in through source locations is in motion. We impose some limitations on the source locations, so that t...
Modern scientific and web databases maintain large and heterogeneous data. These real-world database schemas contain over hundreds or even thousands of attributes and relations. Traditional predefined query forms are not able to satisfy various ad-hoc queries from users. This paper proposes DQF, a novel database query form interface, which is able...
It is often necessary for organizations to perform data mining tasks collaboratively without giving up their own data. This necessity has led to the development of privacy preserving distributed data mining. Several protocols exist which deal with data mining methods in a distributed scenario but most of these methods handle a single data mining ta...
Privacy has always been a great concern of patients and medical service providers. As a result of the recent advances in information technology and the government's push for the use of Electronic Health Record (EHR) systems, a large amount of medical data is collected and stored electronically. This data needs to be made available for analysis but...
Users often find that their queries against a database return too many answers, many of them irrelevant. A common solution is to rank the query results. The effectiveness of a ranking function depends on how well it captures users' preferences. However, database systems often do not have the complete information about users' preferences and users'...
Privacy concerns often prevent organizations from sharing data for data mining purposes. There has been a rich literature on privacy preserving data mining techniques that can protect privacy and still allow accurate mining. Many such techniques have some parameters that need to be set correctly to achieve the desired balance between privacy protec...
While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, wo...
While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, wo...
PURPOSE
One potential problem with the sharing of thin section CT (or MRI) datasets of the head and face for educational or research purposes is the ability to perform a surface reconstruction of the 3D dataset and obtain an image, which closely resembles the physical appearance of a patient. It has been suggested that despite de-identification of...
Association rule mining is an important data mining task applicable across many commercial and scientific domains. There are instances when association analysis must be conducted by a third party over data located at a central point, but updated from several source locations. The source locations may not allow tracking changes. The target location...
Time series are recorded values of an interesting phenomenon such as stock prices, household incomes, or patient heart rates over a period of time. Time series data mining focuses on discovering interesting patterns in such data. This article introduces a wavelet-based time series data analysis to interested readers. It provides a systematic survey...
While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, wo...
Association rule mining is an important data mining method that has been studied extensively by the academic community and has been applied in practice. In the context of association rule mining, the state-of-the-art in privacy preserving data mining provides solutions for categorical and Boolean association rules but not for quantitative associati...
The identity of patients must be protected when patient data is shared. The two most commonly used models to protect identity of patients are L-diversity and K-anonymity. However, existing work mainly considers data sets with a single sensitive attribute, while patient data often contain multiple sensitive attributes (e.g., diagnosis and treatment)...
Data mining techniques have been widely used in many research disciplines such as medicine, life sciences, and social sciences to extract useful knowledge (such as mining models) from research data. Research data often needs to be published along with the data mining model for verification or reanalysis. However, the privacy of the published data n...
The ubiquity of the internet not only makes it very convenient for individuals or organizations to share data for data mining or statistical analysis, but also greatly increases the chance of privacy breach. There exist many techniques such as random perturbation to protect the privacy of such data sets. However, perturbation often has negative imp...
While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, wo...
Association rule mining is an important data mining method that has been studied extensively by the academic community and has been applied in practice. In the context of association rule mining, the state-of-the-art in privacy preserving data mining provides solutions for categorical and Boolean association rules but not for quantitative associati...
Association rule mining is an important data mining method that has been studied extensively by the academic community and has been applied in practice. In the context of association rule mining, the state-of-the-art in privacy preserving data mining provides solutions for categorical and Boolean association rules but not for quantitative associati...
In empirical disciplines, data sharing leads to verifiable research and facilitates future research studies. Recent efforts of the PROMISE community contributed to data sharing and reproducible research in software engineering. However, an important portion of data used in empirical software engineering research still remains classified. This situa...
The discovery of software artifacts (files, documents, and datasets) relevant to a change request, can increase software reuse and reduce the cost of software development and maintenance. However, traditional search techniques often fail to provide the relevant documents because they do not consider relationships between software artifacts. We prop...
Database queries are often exploratory and users often find their queries return too many answers, many of them irrelevant. Existing approaches include categorization, ranking, and query refinement. The success of all these approaches depends on the utilization of user preferences. However, most existing work assumes that all users have the same us...
The discovery of relevant software artifacts can increase software reuse and reduce the cost of software development and maintenance.
Furthermore, change requests, which are a leading cause of project failures, can be better classified and handled when all
relevant artifacts are available to the decision makers. However, traditional full-text and s...
Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics-based techniques to assist users in integrating, navigating, and mining mul...
This paper describes a methodology of OLAP cube navigation to identify interesting surprises by using a skewness based approach. Three different measures of interestingness of navigation rules are proposed. The navigation rules are examined for their interestingness in terms of their expectedness of skewness from neighborhood rules. A novel Axis Sh...
Helpdesk databases are used to store past interactions between cus- tomers and companies to improve customer service quality. One common scenario of using helpdesk database is to find whether rec- ommendations exist given a new problem from a customer. How- ever, customers often provide incomplete or even inaccurate infor- mation. Manually preparin...
There has been relatively little work on privacy preserving techniques for distance based mining. The most widely used ones are additive perturbation methods and orthogonal transform based methods. These methods concentrate on privacy protection in the average case and provide no worst case privacy guarantee. However, the lack of privacy guarantee...
The identity of patients must be protected when patient data are shared. The two most commonly used models to protect identity of patients are L-diversity and K-anonymity. However, existing work mainly considers data sets with a single sensitive attribute, while patient data often contain multiple sensitive attributes (e.g., diagnosis and treatment...
The Cornell Jaguar Project is exploring a variety of issues related to mobility and query processing. One broad theme is to break down the traditional client and server boundaries, leading to ubiquitous query processing. Another theme is to extend database and query processing techniques to small-scale and mobile devices. The project builds on and...
Face recognition technology has received much attention due to its application in defense and crime prevention. In such applications, there is great need to incorporate face recognition technologies onto mobile devices to allow on-the-spot field usage. However there are four major prob-lems that need to be solved, namely the limited storage and pro...
With the explosive growth of data and its distributed sources, there are increasing needs for secure cooperative data analysis. The issue of data reduction to decrease communication overheads and the issue of preservation of privacy of the shared data are becoming important. However, existing privacy preserving techniques do not work well for dista...
Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics-based techniques to assist users in integrating, navigating, and mining mul...
Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics-based techniques to assist users in integrating, navigating, and mining mul...
Navigating through multidimensional data cubes is a nontrivial task. Although On-Line Analytical Processing (OLAP) provides the capability to view multidimensional data through rollup, drill-down, and slicing-dicing, it offers minimal guidance to end users in the actual knowledge discovery process. In this article, we address this knowledge discove...
Normative models of e-government typically assert that horizontal (i.e., inter-agency) and vertical (i.e., inter-governmental) integration of data flows and business processes represent the most sophisticated form of e-government, delivering the greatest payoff for both governments and users. This paper concentrates on the integration of data suppo...
Database queries are often exploratory and users often find their queries return too many answers, many of them irrelevant. Exist- ing work either categorizes or ranks the results to help users locate interesting results. The success of both approaches depends on the utilization of user preferences. However, most existing work as- sumes that all us...
Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics-based techniques to assist users in integrating, navigating, and mining mul...
Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics-based techniques to assist users in integrating, navigating, and mining mul...
Privacy preserving data mining has become increasingly popular because it allows sharing of privacy-sensitive data for analysis purposes. However, existing techniques such as random perturbation do not fare well for simple yet widely used and efficient Euclidean distance-based mining algorithms. Although original data distributions can be pretty ac...
Navigating through multidimensional data cubes is a nontrivial task. Although On-Line Analytical Processing (OLAP) provides the capability to view multidimensional data through rollup, drill-down, and slicing-dicing, it offers minimal guidance to end users in the actual knowledge discovery process. In this article, we address this knowledge discove...
The small screen size of handheld mobile devices poses an inherent problem in visualizing data: very often it is too difficult and unpleasant to navigate through the plethora of presented information. This paper presents a novel approach to personalized and adaptive content presentation for handheld devices, which has been implemented in a mobile f...
Much of business XML data has accompanying XSD specifications. In many scenarios "shredding" such XML data into a relational storage is a popular paradigm. Optimizing evaluation of XPath queries overmuch XML data requires paying careful attention to both the logical and physical designs of the relational database where XML data is shredded. None of...
Considering the size of quantitative attribute values and categorical attribute values in databases, the paper presents two quantitative association rules mining methods considering privacy-preserving respectively, one bases on boolean association rules, the other bases on partially transform measure. To each approach, the privacy and accuracy are...
Various index structures have been proposed to speed up the evaluation of XML path expressions. However, existing XML path indices suffer from at least one of three limitations: they focus only on indexing the structure (relying on a separate index for node content), they are useful only for simple path expressions such as root-to-leaf paths, or th...
Various index structures have been proposed to speed up the evaluation of XML path expressions. However, existing XML path indices suffer from at least one of three limitations: they focus only on indexing the structure (relying on a separate index for node content), they are useful only for simple path expressions such as root-to-leaf paths, or th...
In this paper, we examine the interplay of logical and physical design, and experimentally demonstrate that: (1) solving the logical mapping and the physical design problem independently leads to a suboptimal solution; (2) taking into account the physical design space impacts the space of logical mapping. Specifically, well-known outlining and inli...
The study on database technologies, or more generally, the technologies of data and information man- agement, is an important and active research field. Recently, many exciting results have been reported. In this fast growing field, Chinese researchers play more and more active roles. Research papers from Chinese scholars, both in China and abroad,...
In a variety of settings from relational databases to LDAP to Web applications, there is an increasing need to quickly and accurately estimate the count of tuples (LDAP entries, Web documents, etc.) matching Boolean substring queries. In providing such selectivity estimates, the correlation between different occurrences of substrings is crucial. Se...
Over the last decades, improvements in CPU speed have outpaced improve-ments in disk access rates byorders of magnitude, motivating the use of data compression techniques in database systems to trade reduced disk I/O against additional CPU overhead for compression and decompression of data. In this thesis, we studyhow to build compressed database s...
Over the last decades, improvements in CPU speed have outpaced improvements in main memory and disk access rates by orders of magnitude, enabling the use of data compression techniques to improve the performance of database systems. Previous work describes the benefits of compression for numerical attributes, where data is stored in compressed form...
Describes efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e. a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for cost-based query optimization. Our summary data structur...
We describe efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e., a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for costbased query optimization. Our summary data struct...
In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substri...
Decision-support applications in emerging environments require
that SQL query results or intermediate results be shipped to clients for
further analysis and presentation. These clients may use low bandwidth
connections or have severe storage restrictions. Consequently, there is
a need to compress the results of a query for efficient transfer and
cl...
In this paper we identify a major area of research as a topic for next generation data mining. The research effort in the last decade on privacy preserving data mining has resulted in the development of numerous algorithms. However, most of the existing research has not been applied in any particular application context. Hence it is unclear whether...
The compression plan uses information derived from an analysis of the query and the particular query plan used to evaluate it. It also uses schema information as well as statistical information on stored tables. This semantic information enables much higher compression ratios than are achieved using traditional compression algorithms (e.g. using WI...