
Ben KaoThe University of Hong Kong | HKU · Department of Computer Science
Ben Kao
Doctor of Philosophy
About
193
Publications
32,917
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,973
Citations
Introduction
Skills and Expertise
Publications
Publications (193)
We explore the use of multi-stage prompting with retrieval-augmented generation (RAG) in generating legal analysis. We break the complex legal problems into small steps, including initial analysis, legal query formulation, legal sources retrieval, legal reasoning, result consolidation, and output generation. The design of this framework emulates th...
Online one-on-one tutoring serves as a supplementary approach to traditional classroom instruction. It has been shown to enhance personalized learning and academic performance. However, the dynamics of dialogic interactions within this educational setting are not fully understood. Thus, we present a computational analysis of dialogic interactions i...
Online one-on-one tutoring serves as a personalized approach to supplement classroom instruction. However, with the growing tutoring market, a single tutor often handles inquiries from students across primary, middle, and high school levels. Consequently, the extent of tutors’ interactions with students of varying grades and their use of tutoring s...
Nan Huo Reynold Cheng Ben Kao- [...]
Ge Qu
Entity alignment (EA), a crucial task in knowledge graph (KG) research, aims to identify equivalent entities across different KGs to support downstream tasks like KG integration, text-to-SQL, and question-answering systems. Given rich semantic information within KGs, pre-trained language models (PLMs) have shown promise in EA tasks due to their exc...
The Hong Kong Legal Information Institute (HKLII) provides a repository of legal documents in Hong Kong and such as ordinances and historical court judgments. HKLII provides a search facility through which users retrieve relevant documents. We perform statistical analysis on HKLII access log over a 5-year period categorizing user search queries and...
Access to legal information is fundamental to access to justice. Yet accessibility refers not only to making legal documents available to the public, but also rendering legal information comprehensible to them. A vexing problem in bringing legal information to the public is how to turn formal legal documents such as legislation and judgments, which...
During the COVID-19 pandemic, educational activities have shifted online, providing opportunities for researchers to analyze interaction data between teachers and students. In this study, we focus on automatically annotating dialog acts in one-on-one tutoring on online platforms. We address the challenge of limited training data, particularly for “...
We study the problem of semantically annotating textual documents that are complex in the sense that the documents are long, feature rich, and domain specific. Due to their complexity, such annotation tasks require trained human workers, which are very expensive in both time and money. We propose CEMA, a method for deploying machine learning to ass...
Graph neural networks (GNNs) have emerged as the state-of-the-art paradigm for collaborative filtering (CF). To improve the representation quality over limited labeled data, contrastive learning has attracted attention in recommendation and benefited graph-based CF model recently. However, the success of most contrastive methods heavily relies on m...
We study the problem of machine comprehension of court judgments and generation of descriptive tags for judgments. Our approach makes use of a legal taxonomy D, which serves as a dictionary of canonicalized legal concepts. Given a court judgment J, our method identifies the key contents of J and then applies Word2Vec and BERT-based models to select...
The main goal of the Social Technology and Research Laboratory (STAR Lab) in the University of Hong Kong (https://star.hku.hk) is to develop novel IT technologies for serving the society. Our team has more than three years of experience in project development, web, app, and game design, photography, and video production. We are interested in?Data S...
PageRank is a classic measure that effectively evaluates the importance of nodes in large graphs. It has been applied in numerous applications spanning data mining, Web algorithms, recommendation systems, load balancing, search and connectivity structures identification. Computing PageRank for large graphs is challenging and this has motivated the...
Zhiyong Wu Wei Bi Xiang Li- [...]
Ben Kao
We propose knowledge internalization (KI), which aims to complement the lexical knowledge into neural dialog models. Instead of further conditioning the knowledge-grounded dialog (KGD) models on externally retrieved knowledge, we seek to integrate knowledge about each input token internally into the model's parameters. To tackle the challenge due t...
Big Railway Data, such as train movement logs and timetables, have become increasingly available. By analyzing these data, insights about train movement and delay can be extracted, allowing train operators to make smarter train management decisions. In this paper, we study the problem of performing long-range analysis on Big Railway Data, such as e...
Heterogeneous Information Networks (HINs) capture complex relations among entities of various kinds and have been used extensively to improve the effectiveness of various data mining tasks, such as in recommender systems. Many existing HIN-based recommendation algorithms utilize hand-crafted meta-paths to extract semantic information from the netwo...
Online legal document libraries, such as WorldLII, are indispensable tools for legal professionals to conduct legal research. We study how topic modeling techniques can be applied to such platforms to facilitate searching of court judgments. Specifically, we improve search effectiveness by matching judgments to queries at semantics level rather tha...
A neural multimodal machine translation (MMT) system is one that aims to perform better translation by extending conventional text-only translation models with multimodal information. Many recent studies report improvements when equipping their models with the multimodal module, despite the controversy of whether such improvements indeed come from...
A heterogeneous information network (HIN) has as vertices objects of different types and as edges the relations between objects, which are also of various types. We study the problem of classifying objects in HINs. Most existing methods perform poorly when given scarce labeled objects as training sets, and methods that improve classification accura...
Judgment prediction is the task of predicting various outcomes of legal cases of which sentencing prediction is one of the most important yet difficult challenges. We study the applicability of machine learning (ML) techniques in predicting prison terms of drug trafficking cases. In particular, we study how legal domain knowledge can be integrated...
An open knowledge base (OKB) is a repository of facts, which are typically represented in the form of \(\langle \)subject; relation; object\(\rangle \) triples. The problem of canonicalizing OKB triples is to map different names mentioned in the triples that refer to the same entity into a basic canonical form. We propose the algorithm Multi-Level...
We study the problem of applying spectral clustering to cluster multi-scale data, which is data whose clusters are of various sizes and densities. Traditional spectral clustering techniques discover clusters by processing a similarity matrix that reflects the proximity of objects. For multi-scale data, distance-based similarity is not effective bec...
A heterogeneous information network (HIN) is one whose nodes model objects of different types and whose links model objects’ relationships. To enrich its information, objects in an HIN are typically associated with additional attributes. We call such an HIN an
Attributed HIN
or AHIN. We study the problem of clustering objects in an AHIN, taking i...
By introducing a small set of additional parameters, a probe learns to solve specific linguistic tasks (e.g., dependency parsing) in a supervised manner using feature representations (e.g., contextualized embeddings). The effectiveness of such probing tasks is taken as evidence that the pre-trained model encodes linguistic knowledge. However, this...
Recently, the impressive accuracy of deep neural networks (DNNs) has created great demands on practical analytics over video data. Although efficient and accurate, the latest video analytics systems have not supported analytics beyond selection and aggregation yet. In data analytics, Top-K is a very important analytical operation that enables analy...
A heterogeneous information network (HIN) is one whose objects are of different types and links between objects could model different object relations. We study how spectral clustering can be effectively applied to HINs. In particular, we focus on how meta-path relations are used to construct an effective similarity matrix based on which spectral c...
A heterogeneous information network (HIN) is one whose objects are of different types and links between objects could model different object relations. We study how spectral clustering can be effectively applied to HINs. In particular, we focus on how meta-path relations are used to construct an effective similarity matrix based on which spectral c...
Personalized PageRank (PPR) is a classic measure of the relevance among different nodes in a graph, and has been applied in numerous systems, such as Twitter's Who-To-Follow. Existing work on PPR has mainly focused on three general types of queries, namely, single-pair PPR, single-source PPR, and all-pair PPR. However, we observe that there are app...
Novel road-network applications often recommend a moving object (e.g., a vehicle) about interesting services or tasks on its way to a destination. A taxi-sharing system, for instance, suggests a new passenger to a taxi while it is serving another one. The traveling cost is then shared among these passengers. A fundamental query is: given two nodes...
An Open Information Extraction (OIE) system processes textual data to extract assertions, which are structured data typically represented in the form of (subject;relation; object) triples. An Open Knowledge Base (OKB) is a collection of such assertions. We study the problem of canonicalizing an OKB, which is defined as the problem of mapping each n...
Decentralized Web, or DWeb, is envisioned as a promising future of the Web. Being decentralized, there are no dedicated web servers in DWeb; Devices that retrieve web contents also serve their cached data to peer devices with straight privacy-preserving mechanisms. The fact that contents in DWeb are distributed, replicated, and decentralized lead t...
We investigate the effectiveness of spectral methods in clustering multi-scale data, which is data whose clusters are of various sizes and densities. We review existing spectral methods that are designed to handle multi-scale data and propose an alternative approach that is orthogonal to existing methods. We put forward the algorithm ROSC, which co...
We study the classical kNN queries on road networks. Existing solutions mostly focus on reducing query processing time. In many applications, however, system throughput is a more important measure. We devise a mathematical model that describes throughput in terms of a number of system characteristics. We show that query time is only one of the many...
Spatial object search is prevalent in map services (e.g., Google Maps). To rent an apartment, for example, one will take into account its nearby facilities, such as supermarkets, hospitals, and subway stations. Traditional keyword search solutions, such as the nearby function in Google Maps, are insufficient in expressing the often complex attribut...
Background:
A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions...
In many applications, information is best represented as graphs. In a dynamic world, information changes and so the graphs representing the information evolve with time. We propose that historical graph-structured data be maintained for analytical processing. We call a historical evolving graph sequence an EGS. We observe that in many applications,...
A heterogeneous information network (HIN) is one whose nodes model objects of different types and whose links model objects' relationships. In many applications, such as social networks and RDF-based knowledge bases, information can be modeled as HINs. To enrich its information content, objects (as represented by nodes) in an HIN are typically asso...
In this paper, we formulate a novel question on maximum flow queries. Specifically, this problem aims to find which k edges would have the largest impact on a maximum flow query on a network. This problem has important applications in areas like social network and network planning. We show the inapproximability of the problems and present our heuri...
A heterogeneous information network (HIN) is used to model objects of different types and their relationships. Objects are often associated with properties such as labels. In many applications, such as curated knowledge bases for which object labels are manually given, only a small fraction of the objects are labeled. Studies have shown that transd...
A Sequence OLAP (S-OLAP) system provides a platform on which pattern-based aggregate (PBA) queries on a sequence database are evaluated. In its simplest form, a PBA query consists of a pattern template T and an aggregate function F. A pattern template is a sequence of variables, each is defined over a domain. Each variable is instantiated with all...
We proposed Neural Enquirer as a neural network architecture to execute a
SQL-like query on a knowledge-base (KB) for answers. Basically, Neural Enquirer
finds the distributed representation of a query and then executes it on
knowledge-base tables to obtain the answer as one of the values in the tables.
Unlike similar efforts in end-to-end training...
A knowledge-based question-answering system (KB-QA) is one that answers natural language questions with information stored in a large-scale knowledge base (KB). Existing KB-QA systems are either powered by curated KBs in which factual knowledge is encoded in entities and relations with well-structured schemas, or by open KBs, which contain assertio...
A heterogeneous information network (HIN) is used to model objects of different types and their relationships. Meta-paths are sequences of object types. They are used to represent complex relationships between objects beyond what links in a homogeneous network capture. We study the problem of classifying objects in an HIN. We propose class-level me...
We address security issues in a cloud database system which employs the DBaaS model — a data owner (DO) exports data to a cloud database service provider (SP). To provide data security, sensitive data is encrypted by the DO before it is uploaded to the SP. Compared to existing secure query processing systems like CryptDB [7] and MONOMI [8], in whic...
We address security issues in a cloud database system which employs the DBaaS model --- a data owner (DO) exports data to a cloud database service provider (SP). To provide data security, sensitive data is encrypted by the DO before it is uploaded to the SP. Compared to existing secure query processing systems like CryptDB [7] and MONOMI [8], in wh...
Efficient main-memory index structures are crucial to main-memory database systems. Adaptive Radix Tree (ART) is the most recent in-memory index structure. ART is designed to avoid cache miss, leverage SIMD data parallelism, minimize branch mis-prediction, and have small memory footprint. When an in-memory index structure like ART has significantly...
Scan and lookup are two core operations in main memory column stores. A scan operation scans a column and returns a result bit vector that indicates which records satisfy a filter. Once a column scan is completed, the result bit vector is converted into a list of record numbers, which is then used to look up values from other columns of interest fo...
With the rapid growth of Web 2.0, a variety of content sharing services, such as Flickr, YouTube, Blogger, and TripAdvisor etc, have become extremely popular over the last decade. On these websites, users have created and shared with each other various kinds of resources, such as photos, video, and travel blogs. The sheer amount of user-generated c...
We address security issues in a cloud database system which employs the DBaaS model. In such a model, a data owner (DO) exports its data to a cloud database service provider (SP). To provide data security, sensitive data is encrypted by the DO before it is uploaded to the SP. Existing encryption schemes, however, are only partially homomorphic in t...
The discounted hitting time (DHT), which is a random-walk similarity measure for graph node pairs, is useful in various applications, including link prediction, collaborative recommendation, and reputation ranking. We examine a novel query, called the multi-way join (or n-way join), on DHT scores. Given a graph and n sets of nodes, the n-way join r...
A Sequence OLAP (S-OLAP) system provides a platform on which pattern-based aggregate (PBA) queries on a sequence database are evaluated. In its simplest form, a PBA query consists of a pattern template T and an aggregate function F. A pattern template is a sequence of variables, each is defined over a domain. For example, the template T = (X,Y ,Y ,...
In a crowdsourcing system, Human Intelligence Tasks (HITs) (e.g., translating sentences, matching photos, tagging videos with keywords) can be conveniently specified. HITs are made available to a large pool of workers, who are paid upon completing the HITs they have selected. Since workers may have different capabilities, some difficult HITs may no...
Order-preserving submatrices (OPSM's) have been shown useful in capturing concurrent patterns in data when the relative magnitudes of data items are more important than their exact values. For instance, in analyzing gene expression profiles obtained from microarray experiments, the relative magnitudes are important both because they represent the c...
A social tagging system, such as del.icio.us and Flickr, allows users to annotate resources (e.g., web pages and photos) with text descriptions called tags. Tags have proven to be invaluable information for searching, mining, and recommending resources. In practice, however, not all resources receive the same attention from users. As a result, whil...
Recently, data mining techniques have been applied to mine software execution data in order to identify the program statements that are relevant to program bugs. While empirically effective, we observe that the effectiveness of such techniques can be improved by isolating the interferences between bugs using a "signature-based" approach. Our propos...
Statistical debugging is a technique that mines data obtained from software executions in order to identify the program statements that are relevant to program bugs. Specifically, program predicates are injected into the program during compilation and statistics about those predicates are collected during the program execution. When bugs are found...
Web search queries issued by casual users are often short and with limited expressiveness. Query recommendation is a popular technique employed by search engines to help users refine their queries. Traditional similarity-based methods, however, often result in redundant and monotonic recommendations. We identify five basic requirements of a query r...
In social tagging systems, resources such as images and videos are annotated with descriptive words called tags. It has been shown that tag-based resource searching and retrieval is much more effective than content-based retrieval. With the advances in mobile technology, many resources are also geo-tagged with location information. We observe that...
In typical location-based services (LBS), moving objects (e.g., GPS-enabled mobile phones) report their locations through a wireless network. An LBS server can use the location information to answer various types of continuous queries. Due to hardware limitations, location data reported by the moving objects are often uncertain. In this paper, we s...
Many kinds of real-life data exhibit logical ordering among their data items and are thus sequential in nature. In recent years, the concept of Sequence OLAP (S-OLAP) has been proposed. The biggest distinguishing feature of SOLAP from traditional OLAP is that data sequences managed by an S-OLAP system are characterized by the subsequence/substring...
In many applications, information is best represented as graphs. In a dynamic world, information changes and so the graphs representing the information evolve with time. We propose that historical graph-structured data be maintained for analytical processing. We call a historical evolving graph sequence an EGS. We observe that in many applications,...
In a social tagging system, resources (such as photos, video and web pages) are associated with tags. These tags allow the resources to be effectively searched through tag-based keyword matching using traditional IR techniques. We note that in many such systems, tags of a resource are often assigned by a diverse audience of causal users (taggers)....
We study the problem of clustering data objects with location uncertainty. In our model, a data object is represented by an uncertainty region over which a probability density function (pdf) is defined. One method to cluster such uncertain objects is to apply the UK-means algorithm [1], an extension of the traditional K-means algorithm, which assig...
Traditional decision tree classifiers work with data whose values are known and precise. We extend such classifiers to handle data with uncertain information. Value uncertainty arises in many applications during the data collection process. Example sources of uncertainty include measurement/quantization errors, data staleness, and multiple repeated...
Although there has been a considerable body of work on skyline evaluation in multidimensional data with totally ordered attribute domains, there are only a few methods that consider attributes with partially ordered domains. Existing work maps each partially ordered domain to a total order and then adapts algorithms for totallyordered domains to so...
Abstract-We study the problem of clustering uncertain objects whose locations are described by probability density functions (pdfs). We show that the UK-means algorithm, which generalizes the k-means algorithm to handle uncertain objects, is very inefficient. The inefficiency comes from the fact that UK-means computes expected distances (EDs) betwe...
R ECENT years have witnessed the emergence of novel database applications in various nontraditional do-mains, including location-based services, sensor networks, RFID systems, and biological and biometric databases. Traditionally, data mining has been widely used to reveal interesting patterns in the vast amounts of data generated by such applicati...