Zhiyuan Chen

Zhiyuan Chen
  • PhD
  • University of Maryland, Baltimore County

About

92
Publications
13,627
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,430
Citations
Current institution
University of Maryland, Baltimore County
Additional affiliations
August 2004 - present
University of Maryland, Baltimore County
Position
  • Professor

Publications

Publications (92)
Article
A large volume of network trace data are collected by the government, public, and private organizations, and can be analyzed for various purposes such as resolving network problems, improving network performance, and understanding user behavior. However, most organizations are reluctant to share their data with any external experts for analysis bec...
Preprint
Full-text available
In recent years, there is a lot of interest in modeling students' digital traces in Learning Management System (LMS) to understand students' learning behavior patterns including aspects of meta-cognition and self-regulation, with the ultimate goal to turn those insights into actionable information to support students to improve their learning outco...
Article
Full-text available
Network traces are considered a primary source of information to researchers, who use them to investigate research problems such as identifying user behavior, analyzing network hierarchy, maintaining network security, classifying packet flows, and much more. However, most organizations are reluctant to share their data with a third party or the pub...
Article
Federated RDF systems allow users to retrieve data from multiple independent sources without needing to have all the data in the same triple store. The performance of these systems can be poor for large and geographically distributed RDF data where network transfer costs are high. This article introduces CBTP-OL and CBTP-Nhop, two novel join algori...
Conference Paper
Full-text available
With the increasing adoption of Learning Management Systems (LMS) in colleges and universities, research in exploring the interaction data captured by these systems is promising in developing a better learning environment and improving teaching practice. Most of these research efforts focused on course-level variables to predict student performance...
Article
In the age of IoT, collection of activity data has become ubiquitous. Publishing activity data can be quite useful for various purposes such as estimating the level of assistance required by older adults and facilitating early diagnosis and treatment of certain diseases. However, publishing activity data comes with privacy risks: Each dimension, i....
Chapter
Publishing physical activity data can facilitate reproducible health-care research in several areas such as population health management, behavioral health research, and management of chronic health problems. However, publishing such data also brings high privacy risks related to re-identification which makes anonymization necessary. One of the cha...
Conference Paper
Organizations often need to share mission-dependent data in a secure and flexible way. Examples include contact tracing for a contagious disease such as COVID-19, maritime search and rescue operations, or creating a collaborative bid for a contract. In such examples, the ability to access data may need to change dynamically, depending on the situat...
Article
Full-text available
Publishing physical activity data can facilitate reproducible health-care research in several areas such as population health management, behavioral health research, and management of chronic health problems. However, publishing such data also brings high privacy risks related to re-identification which makes anonymization necessary. One of the cha...
Article
Machine learning models have been widely used in security applications. However, it is well-known that adversaries can adapt their attacks to evade detection. There has been some work on making machine learning models more robust to such attacks. However, one simple but promising approach called randomization is under-explored. In addition, most...
Article
In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent of time and resources of the data analyst. Adequate tools and techniques must be used for data clean...
Preprint
Full-text available
Publishing physical activity data can facilitate reproducible health-care research in several areas such as population health management, behavioral health research, and management of chronic health problems. However, publishing such data also brings high privacy risks related to re-identification which makes anonymization necessary. One of the cha...
Conference Paper
Maritime Search and Rescue missions involve complex operations in which multiple entities, playing different roles in dynamic situations, benefit from sharing mission-dependent data. We propose an approach to support situation-aware access control in a federated Data-as-a-Service architecture. We develop an ontology and rules to represent access co...
Book
In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent of time and resource of data analyst. Adequate tools and techniques must be used for data cleaning....
Preprint
Full-text available
Evolving cybersecurity threats are a persistent challenge for systemadministrators and security experts as new malwares are continu-ally released. Attackers may look for vulnerabilities in commercialproducts or execute sophisticated reconnaissance campaigns tounderstand a targets network and gather information on securityproducts like firewalls and...
Conference Paper
Full-text available
Machine learning models have been widely used in security applications. However, it is well-known that adversaries can adapt their attacks to evade detection. There has been some work on making machine learning models more robust to such attacks. However, one simple but promising approach called randomization is under-explored. In addition, most ex...
Chapter
Full-text available
As it becomes easy and inexpensive to store huge amount of data, concerns about privacy are increasing as well. Although service providers have privacy policies, research shows that users rarely read privacy policies. As a result, there has been little work done on how consumers respond to individual segments of privacy policies, which is important...
Conference Paper
Full-text available
Evolving cybersecurity threats are a persistent challenge for system administrators and security experts as new malwares are continually released. Attackers may look for vulnerabilities in commercial products or execute sophisticated reconnaissance campaigns to understand a target's network and gather information on security products like firewalls...
Preprint
Machine learning models have been widely used in security applications such as intrusion detection, spam filtering, and virus or malware detection. However, it is well-known that adversaries are always trying to adapt their attacks to evade detection. For example, an email spammer may guess what features spam detection models use and modify or remo...
Conference Paper
Intrusion Detection Systems (IDS) have been widely used to detect cyber attacks in Cyber-Physical Systems (CPS). However, attackers can often adapt their attacking strategies to evade detection. Many commercial IDS are rule-based systems. This paper analyzes the possible attacking strategies against a widely used rule-based IDS, Snort, using hyper...
Chapter
As it becomes easy and inexpensive to store huge amount of data, concerns about privacy are increasing as well. Although service providers have privacy policies, research shows that users rarely read privacy policies. As a result, there has been little work done on how consumers respond to individual segments of privacy policies, which is important...
Conference Paper
Full-text available
This paper proposes a method to anonymize network trace data by utilizing a novel perturbation technique that has strong privacy guarantee and at the same time preserves data utility. The resulting dataset can be used for security analysis, retaining the utility of the original dataset, without revealing sensitive information. Our method utilizes a...
Article
Full-text available
As it becomes easy and inexpensive to store huge amount of data, concerns about privacy are increasing as well. Although service providers have privacy policies, research shows that users rarely read privacy policies. As a result, there has been little work done on how consumers respond to individual segments of privacy policies, which is important...
Article
We consider a special case in association rule mining where mining is conducted by a third party over data located at a central location that is updated from several source locations. The data at the central location is at rest while that flowing in through source locations is in motion. We impose some limitations on the source locations, so that t...
Article
Full-text available
Modern scientific and web databases maintain large and heterogeneous data. These real-world database schemas contain over hundreds or even thousands of attributes and relations. Traditional predefined query forms are not able to satisfy various ad-hoc queries from users. This paper proposes DQF, a novel database query form interface, which is able...
Article
It is often necessary for organizations to perform data mining tasks collaboratively without giving up their own data. This necessity has led to the development of privacy preserving distributed data mining. Several protocols exist which deal with data mining methods in a distributed scenario but most of these methods handle a single data mining ta...
Article
Privacy has always been a great concern of patients and medical service providers. As a result of the recent advances in information technology and the government's push for the use of Electronic Health Record (EHR) systems, a large amount of medical data is collected and stored electronically. This data needs to be made available for analysis but...
Article
Users often find that their queries against a database return too many answers, many of them irrelevant. A common solution is to rank the query results. The effectiveness of a ranking function depends on how well it captures users' preferences. However, database systems often do not have the complete information about users' preferences and users'...
Article
Privacy concerns often prevent organizations from sharing data for data mining purposes. There has been a rich literature on privacy preserving data mining techniques that can protect privacy and still allow accurate mining. Many such techniques have some parameters that need to be set correctly to achieve the desired balance between privacy protec...
Chapter
While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, wo...
Chapter
While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, wo...
Conference Paper
PURPOSE One potential problem with the sharing of thin section CT (or MRI) datasets of the head and face for educational or research purposes is the ability to perform a surface reconstruction of the 3D dataset and obtain an image, which closely resembles the physical appearance of a patient. It has been suggested that despite de-identification of...
Conference Paper
Association rule mining is an important data mining task applicable across many commercial and scientific domains. There are instances when association analysis must be conducted by a third party over data located at a central point, but updated from several source locations. The source locations may not allow tracking changes. The target location...
Article
Full-text available
Time series are recorded values of an interesting phenomenon such as stock prices, household incomes, or patient heart rates over a period of time. Time series data mining focuses on discovering interesting patterns in such data. This article introduces a wavelet-based time series data analysis to interested readers. It provides a systematic survey...
Chapter
While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, wo...
Chapter
Association rule mining is an important data mining method that has been studied extensively by the academic community and has been applied in practice. In the context of association rule mining, the state-of-the-art in privacy preserving data mining provides solutions for categorical and Boolean association rules but not for quantitative associati...
Chapter
The identity of patients must be protected when patient data is shared. The two most commonly used models to protect identity of patients are L-diversity and K-anonymity. However, existing work mainly considers data sets with a single sensitive attribute, while patient data often contain multiple sensitive attributes (e.g., diagnosis and treatment)...
Article
Data mining techniques have been widely used in many research disciplines such as medicine, life sciences, and social sciences to extract useful knowledge (such as mining models) from research data. Research data often needs to be published along with the data mining model for verification or reanalysis. However, the privacy of the published data n...
Conference Paper
The ubiquity of the internet not only makes it very convenient for individuals or organizations to share data for data mining or statistical analysis, but also greatly increases the chance of privacy breach. There exist many techniques such as random perturbation to protect the privacy of such data sets. However, perturbation often has negative imp...
Article
While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, wo...
Chapter
Association rule mining is an important data mining method that has been studied extensively by the academic community and has been applied in practice. In the context of association rule mining, the state-of-the-art in privacy preserving data mining provides solutions for categorical and Boolean association rules but not for quantitative associati...
Article
Association rule mining is an important data mining method that has been studied extensively by the academic community and has been applied in practice. In the context of association rule mining, the state-of-the-art in privacy preserving data mining provides solutions for categorical and Boolean association rules but not for quantitative associati...
Conference Paper
Full-text available
In empirical disciplines, data sharing leads to verifiable research and facilitates future research studies. Recent efforts of the PROMISE community contributed to data sharing and reproducible research in software engineering. However, an important portion of data used in empirical software engineering research still remains classified. This situa...
Conference Paper
The discovery of software artifacts (files, documents, and datasets) relevant to a change request, can increase software reuse and reduce the cost of software development and maintenance. However, traditional search techniques often fail to provide the relevant documents because they do not consider relationships between software artifacts. We prop...
Article
Full-text available
Database queries are often exploratory and users often find their queries return too many answers, many of them irrelevant. Existing approaches include categorization, ranking, and query refinement. The success of all these approaches depends on the utilization of user preferences. However, most existing work assumes that all users have the same us...
Article
Full-text available
The discovery of relevant software artifacts can increase software reuse and reduce the cost of software development and maintenance. Furthermore, change requests, which are a leading cause of project failures, can be better classified and handled when all relevant artifacts are available to the decision makers. However, traditional full-text and s...
Chapter
Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics-based techniques to assist users in integrating, navigating, and mining mul...
Article
This paper describes a methodology of OLAP cube navigation to identify interesting surprises by using a skewness based approach. Three different measures of interestingness of navigation rules are proposed. The navigation rules are examined for their interestingness in terms of their expectedness of skewness from neighborhood rules. A novel Axis Sh...
Article
Full-text available
Helpdesk databases are used to store past interactions between cus- tomers and companies to improve customer service quality. One common scenario of using helpdesk database is to find whether rec- ommendations exist given a new problem from a customer. How- ever, customers often provide incomplete or even inaccurate infor- mation. Manually preparin...
Article
There has been relatively little work on privacy preserving techniques for distance based mining. The most widely used ones are additive perturbation methods and orthogonal transform based methods. These methods concentrate on privacy protection in the average case and provide no worst case privacy guarantee. However, the lack of privacy guarantee...
Article
Full-text available
The identity of patients must be protected when patient data are shared. The two most commonly used models to protect identity of patients are L-diversity and K-anonymity. However, existing work mainly considers data sets with a single sensitive attribute, while patient data often contain multiple sensitive attributes (e.g., diagnosis and treatment...
Article
The Cornell Jaguar Project is exploring a variety of issues related to mobility and query processing. One broad theme is to break down the traditional client and server boundaries, leading to ubiquitous query processing. Another theme is to extend database and query processing techniques to small-scale and mobile devices. The project builds on and...
Article
Full-text available
Face recognition technology has received much attention due to its application in defense and crime prevention. In such applications, there is great need to incorporate face recognition technologies onto mobile devices to allow on-the-spot field usage. However there are four major prob-lems that need to be solved, namely the limited storage and pro...
Article
Full-text available
With the explosive growth of data and its distributed sources, there are increasing needs for secure cooperative data analysis. The issue of data reduction to decrease communication overheads and the issue of preservation of privacy of the shared data are becoming important. However, existing privacy preserving techniques do not work well for dista...
Chapter
Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics-based techniques to assist users in integrating, navigating, and mining mul...
Chapter
Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics-based techniques to assist users in integrating, navigating, and mining mul...
Chapter
Navigating through multidimensional data cubes is a nontrivial task. Although On-Line Analytical Processing (OLAP) provides the capability to view multidimensional data through rollup, drill-down, and slicing-dicing, it offers minimal guidance to end users in the actual knowledge discovery process. In this article, we address this knowledge discove...
Article
Normative models of e-government typically assert that horizontal (i.e., inter-agency) and vertical (i.e., inter-governmental) integration of data flows and business processes represent the most sophisticated form of e-government, delivering the greatest payoff for both governments and users. This paper concentrates on the integration of data suppo...
Conference Paper
Full-text available
Database queries are often exploratory and users often find their queries return too many answers, many of them irrelevant. Exist- ing work either categorizes or ranks the results to help users locate interesting results. The success of both approaches depends on the utilization of user preferences. However, most existing work as- sumes that all us...
Article
Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics-based techniques to assist users in integrating, navigating, and mining mul...
Chapter
Environmental research and knowledge discovery both require extensive use of data stored in various sources and created in different ways for diverse purposes. We describe a new metadata approach to elicit semantic information from environmental data and implement semantics-based techniques to assist users in integrating, navigating, and mining mul...
Article
Full-text available
Privacy preserving data mining has become increasingly popular because it allows sharing of privacy-sensitive data for analysis purposes. However, existing techniques such as random perturbation do not fare well for simple yet widely used and efficient Euclidean distance-based mining algorithms. Although original data distributions can be pretty ac...
Article
Full-text available
Navigating through multidimensional data cubes is a nontrivial task. Although On-Line Analytical Processing (OLAP) provides the capability to view multidimensional data through rollup, drill-down, and slicing-dicing, it offers minimal guidance to end users in the actual knowledge discovery process. In this article, we address this knowledge discove...
Conference Paper
Full-text available
The small screen size of handheld mobile devices poses an inherent problem in visualizing data: very often it is too difficult and unpleasant to navigate through the plethora of presented information. This paper presents a novel approach to personalized and adaptive content presentation for handheld devices, which has been implemented in a mobile f...
Article
Full-text available
Much of business XML data has accompanying XSD specifications. In many scenarios "shredding" such XML data into a relational storage is a popular paradigm. Optimizing evaluation of XPath queries overmuch XML data requires paying careful attention to both the logical and physical designs of the relational database where XML data is shredded. None of...
Article
Considering the size of quantitative attribute values and categorical attribute values in databases, the paper presents two quantitative association rules mining methods considering privacy-preserving respectively, one bases on boolean association rules, the other bases on partially transform measure. To each approach, the privacy and accuracy are...
Conference Paper
Full-text available
Various index structures have been proposed to speed up the evaluation of XML path expressions. However, existing XML path indices suffer from at least one of three limitations: they focus only on indexing the structure (relying on a separate index for node content), they are useful only for simple path expressions such as root-to-leaf paths, or th...
Article
Various index structures have been proposed to speed up the evaluation of XML path expressions. However, existing XML path indices suffer from at least one of three limitations: they focus only on indexing the structure (relying on a separate index for node content), they are useful only for simple path expressions such as root-to-leaf paths, or th...
Conference Paper
Full-text available
In this paper, we examine the interplay of logical and physical design, and experimentally demonstrate that: (1) solving the logical mapping and the physical design problem independently leads to a suboptimal solution; (2) taking into account the physical design space impacts the space of logical mapping. Specifically, well-known outlining and inli...
Article
Full-text available
The study on database technologies, or more generally, the technologies of data and information man- agement, is an important and active research field. Recently, many exciting results have been reported. In this fast growing field, Chinese researchers play more and more active roles. Research papers from Chinese scholars, both in China and abroad,...
Article
In a variety of settings from relational databases to LDAP to Web applications, there is an increasing need to quickly and accurately estimate the count of tuples (LDAP entries, Web documents, etc.) matching Boolean substring queries. In providing such selectivity estimates, the correlation between different occurrences of substrings is crucial. Se...
Article
Over the last decades, improvements in CPU speed have outpaced improve-ments in disk access rates byorders of magnitude, motivating the use of data compression techniques in database systems to trade reduced disk I/O against additional CPU overhead for compression and decompression of data. In this thesis, we studyhow to build compressed database s...
Conference Paper
Full-text available
Over the last decades, improvements in CPU speed have outpaced improvements in main memory and disk access rates by orders of magnitude, enabling the use of data compression techniques to improve the performance of database systems. Previous work describes the benefits of compression for numerical attributes, where data is stored in compressed form...
Conference Paper
Describes efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e. a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for cost-based query optimization. Our summary data structur...
Article
Full-text available
We describe efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e., a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for costbased query optimization. Our summary data struct...
Article
Full-text available
In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substri...
Conference Paper
Full-text available
Decision-support applications in emerging environments require that SQL query results or intermediate results be shipped to clients for further analysis and presentation. These clients may use low bandwidth connections or have severe storage restrictions. Consequently, there is a need to compress the results of a query for efficient transfer and cl...
Article
Full-text available
In this paper we identify a major area of research as a topic for next generation data mining. The research effort in the last decade on privacy preserving data mining has resulted in the development of numerous algorithms. However, most of the existing research has not been applied in any particular application context. Hence it is unclear whether...
Article
Full-text available
The compression plan uses information derived from an analysis of the query and the particular query plan used to evaluate it. It also uses schema information as well as statistical information on stored tables. This semantic information enables much higher compression ratios than are achieved using traditional compression algorithms (e.g. using WI...

Network

Cited By