Koushik Sen’s research while affiliated with Berkeley College and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (8)


Figure 2: The intercloud broker architecture.
Figure 3: Market forces creating a more mature and capable version of Sky Computing.
Figure 4: Cloud market share as of October 2021 (see here).
Figure 6: Examples of cloud services allowing the user to select the desired Apache Spark version. (a) Databricks. (b) Azure HDInsight.
The Sky Above The Clouds
  • Preprint
  • File available

May 2022

·

437 Reads

·

1 Citation

Sarah Chasins

·

Alvin Cheung

·

·

[...]

·

Technology ecosystems often undergo significant transformations as they mature. For example, telephony, the Internet, and PCs all started with a single provider, but in the United States each is now served by a competitive market that uses comprehensive and universal technology standards to provide compatibility. This white paper presents our view on how the cloud ecosystem, barely over fifteen years old, could evolve as it matures.

Download

Hindsight logging for model training

December 2020

·

13 Reads

·

7 Citations

Proceedings of the VLDB Endowment

In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible, so developers are left to choose between slowing down training via extensive conservative logging, or letting training run fast via minimalist optimistic logging that may omit key information. As a compromise, optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint---a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptive periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model-training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch.


Hindsight Logging for Model Training

June 2020

·

26 Reads

Due to the long time-lapse between the triggering and detection of a bug in the machine learning lifecycle, model developers favor data-centric logfile analysis over traditional interactive debugging techniques. But when useful execution data is missing from the logs after training, developers have little recourse beyond re-executing training with more logging statements, or guessing. In this paper, we present hindsight logging, a novel technique for efficiently querying ad-hoc execution data, long after model training. The goal of hindsight logging is to enable analysis of past executions as if the logs had been exhaustive. Rather than materialize logs up front, we draw on the idea of physiological database recovery, and adapt it to arbitrary programs. Developers can query the state in past runs of a program by adding arbitrary log statements to their code; a combination of physical and logical recovery is used to quickly produce the output of the new log statements. We implement these ideas in Flor, a record-replay system for hindsight logging in Python. We evaluate Flor's performance on eight different model training workloads from current computer vision and NLP benchmarks. We find that Flor replay achieves near-ideal scale-out and order-of-magnitude speedups in replay, with just 1.47% average runtime overhead from record.


Ansor : Generating High-Performance Tensor Programs for Deep Learning

June 2020

·

357 Reads

High-performance tensor programs are crucial to guarantee efficient execution of deep learning models. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously difficult. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering efforts in developing platform-specific optimization code or fall short in finding high-performance programs due to restricted search space and ineffective exploration strategy. We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores much more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches. Besides, Ansor utilizes a scheduler to simultaneously optimize multiple subgraphs in a set of deep neural networks. Our evaluation shows that Ansor improves the execution performance of deep neural networks on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3.8×3.8\times, 2.6×2.6\times, and 1.7×1.7 \times, respectively.



Selecting Fault Revealing Mutants

January 2020

·

297 Reads

·

78 Citations

Empirical Software Engineering

Mutant selection refers to the problem of choosing, among a large number of mutants, the (few) ones to be used by the testers. We thus, investigate the problem of selecting the fault revealing mutants, i.e., the mutants that are most likely to lead to test cases that uncover unknown program faults. We formulate this problem as the fault revealing mutant selection and as the fault revealing mutant prioritization problems. We argue that these problems can be tackled through a set of 'static' program features. We thus, propose FaRM, a machine learning approach that learns to select fault revealing mutants. Experimental results involving 1,629 real faults show the practical benefits of our approach in both examined problems. Our results show that FaRM achieves a good trade-off between mutation testing application cost and effectiveness (measured in terms of faults revealed). We also show that FaRM outperforms random mutant sampling, which until now, is the most effective mutant selection method. In particular, our results show that with respect to mutant selection, our approach reveals 12% to 16% more faults than the random baselines, while, with respect to mutant prioritization, it achieves higher average percentage of revealed faults with a median difference of 10% (from the random mutant orderings).


EIGER: automated IOC generation for accurate and interpretable endpoint malware detection

December 2019

·

248 Reads

·

26 Citations

A malware signature including behavioral artifacts, namely Indicator of Compromise (IOC) plays an important role in security operations, such as endpoint detection and incident response. While building IOC enables us to detect malware efficiently and perform the incident analysis in a timely manner, it has not been fully-automated yet. To address this issue, there are two lines of promising approaches: regular expression-based signature generation and machine learning. However, each approach has a limitation in accuracy or interpretability, respectively. In this paper, we propose EIGER, a method to generate interpretable, and yet accurate IOCs from given malware traces. The key idea of EIGER is enumerate-then-optimize. That is, we enumerate representations of potential artifacts as candidates of IOCs. Then, we optimize the combination of these candidates to maximize the two essential properties, i.e., accuracy and interpretability, towards the generation of reliable IOCs. Through the experiment using 162K of malware samples collected over the five months, we evaluated the accuracy of EIGER-generated IOCs. We achieved a high True Positive Rate (TPR) of 91.98% and a very low False Positive Rate (FPR) of 0.97%. Interestingly, EIGER achieved FPR of less than 1% even when we use completely different dataset. Furthermore, we evaluated the interpretability of the IOCs generated by EIGER through a user study, in which we recruited 15 of professional security analysts working at a security operation center. The results allow us to conclude that our IOCs are as interpretable as manually-generated ones. These results demonstrate that EIGER is practical and deployable to the real-world security operations.


Retrieval on source code: a neural code search

June 2018

·

285 Reads

·

185 Citations

Searching over large code corpora can be a powerful productivity tool for both beginner and experienced developers because it helps them quickly find examples of code related to their intent. Code search becomes even more attractive if developers could express their intent in natural language, similar to the interaction that Stack Overflow supports. In this paper, we investigate the use of natural language processing and information retrieval techniques to carry out natural language search directly over source code, i.e. without having a curated Q&A forum such as Stack Overflow at hand. Our experiments using a benchmark suite derived from Stack Overflow and GitHub repositories show promising results. We find that while a basic word–embedding based search procedure works acceptably, better results can be obtained by adding a layer of supervision, as well as by a customized ranking strategy.

Citations (4)


... This section provides background on Flor and FlorDB, highlighting their evolution and key features in managing the machine learning lifecycle. Flor, originally designed as a record-replay system for model training [4], offers two main features: i) low-overhead adaptive checkpointing, minimizing computational resources during model training, and ii) low-latency replay from checkpoints, leveraging memoization and parallelism speed ups. Flor record-replay added support for hindsight logging, allowing for post-hoc and on-demand querying of effectively unbounded context. ...

Reference:

Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle
Hindsight logging for model training
  • Citing Article
  • December 2020

Proceedings of the VLDB Endowment

... The manual process of verifying false positives can be a daunting exercise; therefore, ML algorithms must achieve high performance and minimize false positives. The validation of false positives can also be incorporated into ML algorithms, as emphasized by Otsuki et al. [55]. ...

EIGER: automated IOC generation for accurate and interpretable endpoint malware detection
  • Citing Conference Paper
  • December 2019

... The code search problem is mainly studied in the software engineering field to improve software developers' productivity by facilitating the reuse of existing codes. Existing code search methods can be roughly classified into two groups: traditional Information Retrieval (IR) based [24] and ML based [25][26][27]. Traditional IR-based code search methods usually treat code as text and use statistical ranking algorithms such as TF-IDF [19] and BM25 [12] to compute the similarity scores between queries and code snippets. ML-based methods map both codes and queries to real-valued vectors and measure the relevancy using cosine similarity between vector representations [28]. ...

Retrieval on source code: a neural code search
  • Citing Conference Paper
  • June 2018

... As such, our latent mutant prediction method falls in the category of the mutant selection methods that have been studied by the mutation testing literature. Specifically, previous work has focused on predicting or identifying killable mutants [31] and subsuming mutants [12]. Our key difference from those methods is that we target a subset of mutants that is adding value to the developers rather than the entire set of mutants. ...

Selecting Fault Revealing Mutants

Empirical Software Engineering