Beng Chin Ooi's research while affiliated with National University of Singapore and other places

Publications (455)

Article
Every five years, a group of the leading database researchers meet to reflect on their community's impact on the computing industry as well as examine current research challenges.
Preprint
Verifiable ledger databases protect data history against malicious tampering. Existing systems, such as blockchains and certificate transparency, are based on transparency logs -- a simple abstraction allowing users to verify that a log maintained by an untrusted server is append-only. They expose a simple key-value interface. Building a practical...
Article
Full-text available
Background: Metastatic epidural spinal cord compression (MESCC) is a disastrous complication of advanced malignancy. Deep learning (DL) models for automatic MESCC classification on staging CT were developed to aid earlier diagnosis. Methods: This retrospective study included 444 CT staging studies from 185 patients with suspected MESCC who under...
Article
InstaHide is a state-of-the-art mechanism for protecting private training images, by mixing multiple private images and modifying them such that their visual features are indistinguishable to the naked eye. In recent work, however, Carlini et al. show that it is possible to reconstruct private images from the encrypted dataset generated by InstaHid...
Article
Background Lumbar spine MRI studies are widely used for back pain assessment. Interpretation involves grading lumbar spinal stenosis, which is repetitive and time consuming. Deep learning (DL) could provide faster and more consistent interpretation. Purpose To assess the speed and interobserver agreement of radiologists for reporting lumbar spinal...
Preprint
In the Metaverse, the physical space and the virtual space co-exist, and interact simultaneously. While the physical space is virtually enhanced with information, the virtual space is continuously refreshed with real-time, real-world information. To allow users to process and manipulate information seamlessly between the real and digital spaces, no...
Preprint
Full-text available
While state-of-the-art permissioned blockchains can achieve thousands of transactions per second on commodity hardware with x86/64 architecture, their performance when running on different architectures is not clear. The goal of this work is to characterize the performance and cost of permissioned blockchains on different hardware systems, which is...
Article
Full-text available
Background Metastatic epidural spinal cord compression (MESCC) is a devastating complication of advanced cancer. A deep learning (DL) model for automated MESCC classification on MRI could aid earlier diagnosis and referral. Purpose To develop a DL model for automated classification of MESCC on MRI. Materials and Methods Patients with known MESCC...
Preprint
In this paper, we propose a novel semi-supervised learning (SSL) framework named BoostMIS that combines adaptive pseudo labeling and informative active annotation to unleash the potential of medical image SSL models: (1) BoostMIS can adaptively leverage the cluster assumption and consistency regularization of the unlabeled data according to the cur...
Article
With the emergence of hybrid blockchain database systems, we aim to provide an in-depth analysis of the performance and trade-offs among a few representative systems. To achieve this goal, we implement Veritas and BlockchainDB from scratch. For Veritas, we provide two flavors to target the crash fault-tolerant (CFT) and Byzantine fault-tolerant (BF...
Chapter
Multi-modal learning using unpaired labeled data from multiple modalities to boost the performance of deep learning models on each individual modality has attracted a lot of interest in medical image segmentation recently. However, existing unpaired multi-modal learning methods require a considerable amount of labeled data from both modalities to o...
Preprint
Recent years have witnessed a surging interest in Neural Architecture Search (NAS). Various algorithms have been proposed to improve the search efficiency and effectiveness of NAS, i.e., to reduce the search cost and improve the generalization performance of the selected architectures, respectively. However, the search efficiency of these algorithm...
Preprint
Deep learning has achieved great success in a wide spectrum of multimedia applications such as image classification, natural language processing and multimodal data analysis. Recent years have seen the development of many deep learning frameworks that provide a high-level programming interface for users to design models, conduct training and deploy...
Preprint
Relational databases are the de facto standard for storing and querying structured data, and extracting insights from structured data requires advanced analytics. Deep neural networks (DNNs) have achieved super-human prediction performance in particular data types, e.g., images. However, existing DNNs may not produce meaningful results when applied...
Preprint
InstaHide is a state-of-the-art mechanism for protecting private training images in collaborative learning. It works by mixing multiple private images and modifying them in such a way that their visual features are no longer distinguishable to the naked eye, without significantly degrading the accuracy of training. In recent work, however, Carlini...
Article
Background Assessment of lumbar spinal stenosis at MRI is repetitive and time consuming. Deep learning (DL) could improve -productivity and the consistency of reporting. Purpose To develop a DL model for automated detection and classification of lumbar central canal, lateral recess, and neural -foraminal stenosis. Materials and Methods In this retr...
Article
While recent developments of neural network (NN) models have led to a series of record-breaking achievements in many applications, lack of sufficiently good datasets remains a problem for some applications. For such a problem, we can however exploit a large number of unstructured text corpora as external knowledge to complement the training data, a...
Preprint
Alphas are stock prediction models capturing trading signals in a stock market. A set of effective alphas can generate weakly correlated high returns to diversify the risk. Existing alphas can be categorized into two classes: Formulaic alphas are simple algebraic expressions of scalar features, and thus can generalize well and be mined into a weakl...
Preprint
Full-text available
Machine learning (ML) is an important part of modern data science applications. Data scientists today have to manage the end-to-end ML life cycle that includes both model training and model serving, the latter of which is essential, as it makes their works available to end-users. Systems for model serving require high performance, low cost, and eas...
Article
Full-text available
The success of Bitcoin and other cryptocurrencies is drawing significant interest to blockchains. A blockchain system implements a tamper-evident ledger for recording transactions that modify some global states. The system captures the entire evolution history of the states. The management of that history, also known as data provenance or lineage,...
Preprint
Federated learning (FL) is an emerging paradigm for facilitating multiple organizations' data collaboration without revealing their private data to each other. Recently, vertical FL, where the participating organizations hold the same set of samples but with disjoint features and only one organization owns the labels, has received increased attenti...
Preprint
With the ever-increasing adoption of machine learning for data analytics, maintaining a machine learning pipeline is becoming more complex as both the datasets and trained models evolve with time. In a collaborative environment, the changes and updates due to pipeline evolution often cause cumbersome coordination and maintenance work, raising the c...
Preprint
In the last few years, distributed machine learning has been usually executed over heterogeneous networks such as a local area network within a multi-tenant cluster or a wide area network connecting data centers and edge clusters. In these heterogeneous networks, the link speeds among worker nodes vary significantly, making it challenging for state...
Article
The success of Bitcoin and other cryptocurrencies bring enormous interest to blockchains. A blockchain system implements a tamper-evident ledger for recording transactions that modify some global states. The system captures the entire evolution history of the states. The management of that history, also known as data provenance or lineage, has been...
Preprint
Full-text available
Federated learning (FL) is an emerging paradigm that enables multiple organizations to jointly train a model without revealing their private data to each other. This paper studies {\it vertical} federated learning, which tackles the scenarios where (i) collaborating organizations own data of the same set of users but with disjoint features, and (ii...
Preprint
Full-text available
Data collaboration activities typically require systematic or protocol-based coordination to be scalable. Git, an effective enabler for collaborative coding, has been attested for its success in countless projects around the world. Hence, applying the Git philosophy to general data collaboration beyond coding is motivating. We call it Git for data....
Preprint
In high stakes applications such as healthcare and finance analytics, the interpretability of predictive models is required and necessary for domain practitioners to trust the predictions. Traditional machine learning models, e.g., logistic regression (LR), are easy to interpret in nature. However, many of these models aggregate time-series data wi...
Preprint
Full-text available
Smart contracts have enabled blockchain systems to evolve from simple cryptocurrency platforms, such as Bitcoin, to general transactional systems, such as Ethereum. Catering for emerging business requirements, a new architecture called execute-order-validate has been proposed in Hyperledger Fabric to support parallel transactions and improve the bl...
Preprint
Full-text available
In emerging applications such as blockchains and collaborative data analytics, there are strong demands for data immutability, multi-version accesses, and tamper-evident controls. This leads to three new index structures for immutable data, namely Merkle Patricia Trie (MPT), Merkle Bucket Tree (MBT), and Pattern-Oriented-Split Tree (POS-Tree). Alth...
Article
Approximately every five years, a group of database researchers meet to do a self-assessment of our community, including reflections on our impact on the industry as well as challenges facing our research community. This report summarizes the discussion and conclusions of the 9th such meeting, held during October 9-10, 2018 in Seattle.
Article
Full-text available
With 5G on the verge of being adopted as the next mobile network, there is a need to analyze its impact on the landscape of computing and data management. In this paper, we analyze the impact of 5G on both traditional and emerging technologies and project our view on future research challenges and opportunities. With a predicted increase of 10-100x...
Preprint
Blockchain has come a long way: a system that was initially proposed specifically for cryptocurrencies is now being adapted and adopted as a general-purpose transactional system. A blockchain is also a distributed system, and as such it shares some similarities with distributed database systems. Existing works that compare blockchains and distribut...
Preprint
Research in blockchain systems has mainly focused on improving security and bridging the performance gaps between blockchains and databases. Despite many promising results, we observe a worrying trend that the blockchain landscape is fragmented in which many systems exist in silos. Apart from a handful of general-purpose blockchains, such as Ethere...
Article
Deep learning models have been used to support analytics beyond simple aggregation, where deeper and wider models have been shown to yield great results. These models consume a huge amount of memory and computational operations. However, most of the large-scale industrial applications are often computational budget constrained. In practice, the pea...
Preprint
Full-text available
The fifth-generation (5G) mobile communication technologies are on the way to be adopted as the next standard for mobile networking. It is therefore timely to analyze the impact of 5G on the landscape of computing, in particular, data management and data-driven technologies. With a predicted increase of 10-100$\times$ in bandwidth and 5-10$\times$...
Preprint
Full-text available
With 5G on the verge of being adopted as the next mobile network, there is a need to analyze its impact on the landscape of computing and data management. In this paper, we analyze the impact of 5G on both traditional and emerging technologies and project our view on future research challenges and opportunities. With a predicted increase of 10-100x...
Preprint
Deep learning has recently become very popular on account of its incredible success in many complex data-driven applications, such as image classification and speech recognition. The database community has worked on data-driven applications for many years, and therefore should be playing a lead role in supporting this new wave. However, databases a...
Conference Paper
Full-text available
Existing blockchain systems scale poorly because of their distributed consensus protocols. Current attempts at improving blockchain scalability are limited to cryptocurrency. Scaling blockchain systems under general workloads (i.e., non-cryptocurrency applications) remains an open question. This work takes a principled approach to apply sharding to...
Preprint
Full-text available
Motivated by the massive energy usage of blockchain, on the one hand, and by significant performance improvements in low-power, wimpy systems, on the other hand, we perform an in-depth time-energy analysis of blockchain systems on low-power nodes in comparison to high-performance nodes. We use three low-power systems to represent a wide range of th...
Preprint
Recent years have witnessed growing interests in designing efficient neural networks and neural architecture search (NAS). Although remarkable efficiency and accuracy have been achieved, existing expert designed and NAS models neglect that input instances are of varying complexity thus different amount of computation is required. Therefore, inferen...
Article
The success of Bitcoin and other cryptocurrencies bring enormous interest to blockchains. A blockchain system implements a tamper-evident ledger for recording transactions that modify some global states. The system captures entire evolution history of the states. The management of that history, also known as data provenance or lineage, has been stu...
Preprint
Machine-learning-based data-driven applications have become ubiquitous, e.g., health-care analysis and database system optimization. Big training data and large (deep) models are crucial for good performance. Dropout has been widely used as an efficient regularization technique to prevent large models from overfitting. However, many recent works sh...
Preprint
Deep learning models have been used to support analytics beyond simple aggregation, where deeper and wider models have been shown to yield great results. These models consume a huge amount of memory and computational operations. However, most of the large-scale industrial applications are often computational budget constrained. Current solutions ar...
Article
Full-text available
In this era of mobile application and socialization, location-based services (LBSs) have become unprecedentedly popular and emphasized. By submitting its location to the service provider, the mobile terminal can obtain useful service. As an important technology that can provide location-based services, spatial query processing has become a research...
Conference Paper
Effective overload control for large-scale online service system is crucial for protecting the system backend from overload. Conventionally the design of overload control is ad-hoc for individual service. However, service-specific overload control could be detrimental to the overall system due to intricate service dependencies or flawed implementat...
Conference Paper
Embeddings of medical concepts such as medication, procedure and diagnosis codes in Electronic Medical Records (EMRs) are central to healthcare analytics. Previous work on medical concept embedding takes medical concepts and EMRs as words and documents respectively. Nevertheless, such models miss out the temporal nature of EMR data. On the one hand...
Conference Paper
Feature hashing is widely used to process large scale sparse features for learning of predictive models. Collisions inherently happen in the hashing process and hurt the model performance. In this paper, we develop a feature hashing scheme called Cuckoo Feature Hashing(CCFH) based on the principle behind Cuckoo hashing, a hashing scheme designed to...
Article
Recent advancements in high-performance networking interconnect significantly narrow the performance gap between intra-node and inter-node communications, and open up opportunities for distributed memory platforms to enforce cache coherency among distributed nodes. To this end, we propose GAM, an efficient distributed in-memory platform that provid...
Article
With the continuous growth of data size and the use of compression technology, data reduction has great research value and practical significance. Aiming at the shortcomings of the existing semantic compression algorithm, this paper is based on the analysis of ItCompress algorithm, and designs a method of bidirectional order selection based on inte...
Preprint
Embeddings of medical concepts such as medication, procedure and diagnosis codes in Electronic Medical Records (EMRs) are central to healthcare analytics. Previous work on medical concept embedding takes medical concepts and EMRs as words and documents respectively. Nevertheless, such models miss out the temporal nature of EMR data. On the one hand...
Article
Full-text available
DGCC protocol has been shown to achieve good performance on multi-core in-memory system. However, distributed transactions complicate the dependency resolution, and therefore, an effective transaction partitioning strategy is essential to reduce expensive multi-node distributed transactions. During failure recovery, log must be examined from the la...
Preprint
Few-shot learning that trains image classifiers over few labeled examples per category is a challenging task. In this paper, we propose to exploit an additional big dataset with different categories to improve the accuracy of few-shot learning over our target dataset. Our approach is based on the observation that images can be decomposed into objec...
Conference Paper
To unlock the wealth of the healthcare data, we often need to link the real-world text snippets to the referred medical concepts described by the canonical descriptions. However, existing healthcare concept linking methods, such as dictionary-based and simple machine learning methods, are not effective due to the word discrepancy between the text s...
Conference Paper
Full-text available
The tremendous volume of user behavior records generated in various domains provides data analysts new opportunities to mine valuable insights into user behavior. Cohort analysis, which aims to find user behavioral trends hidden in time series, is one of the most commonly used techniques. Since traditional database systems suffer from both operabil...