Frederic Sala

Frederic Sala
Stanford University | SU · Department of Computer Science

Ph.D.

About

45
Publications
2,273
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
683
Citations
Introduction
Skills and Expertise
Additional affiliations
August 2011 - February 2015
University of California, Los Angeles
Position
  • PhD Student

Publications

Publications (45)
Preprint
Foundation models offer an exciting new paradigm for constructing models with out-of-the-box embeddings and a few labeled examples. However, it is not clear how to best apply foundation models without labeled data. A potential approach is to fuse foundation models with weak supervision frameworks, which use weak label sources -- pre-trained models,...
Preprint
Labeling data for modern machine learning is expensive and time-consuming. Latent variable models can be used to infer labels from weaker, easier-to-acquire sources operating on unlabeled data. Such models can also be trained using labeled data, presenting a key question: should a user invest in few labeled or many unlabeled points? We answer this...
Preprint
Our goal is to enable machine learning systems to be trained interactively. This requires models that perform well and train quickly, without large amounts of hand-labeled data. We take a step forward in this direction by borrowing from weak supervision (WS), wherein models can be trained with noisy sources of signal instead of hand-labeled data. B...
Preprint
Knowledge graph (KG) embeddings learn low-dimensional representations of entities and relations to predict missing facts. KGs often exhibit hierarchical and logical patterns which must be preserved in the embedding space. For hierarchical data, hyperbolic embedding methods have shown promise for high-fidelity and parsimonious representations. Howev...
Preprint
Full-text available
A popular way to estimate the causal effect of a variable x on y from observational data is to use an instrumental variable (IV): a third variable z that affects y only through x. The more strongly z is associated with x, the more reliable the estimate is, but such strong IVs are difficult to find. Instead, practitioners combine more commonly avail...
Preprint
Weak supervision is a popular method for building machine learning models without relying on ground truth annotations. Instead, it generates probabilistic training labels by estimating the accuracies of multiple noisy labeling sources (e.g., heuristics, crowd workers). Existing approaches use latent variable estimation to model the noisy sources, b...
Preprint
Full-text available
Since manually labeling training data is slow and expensive, recent industrial and scientific research efforts have turned to weaker or noisier forms of supervision sources. However, existing weak supervision approaches fail to model multi-resolution sources for sequential data, like video, that can assign labels to individual elements or collectio...
Preprint
Full-text available
Mendelian Randomization (MR) is an important causal inference method primarily used in biomedical research. This work applies contemporary techniques in machine learning to improve the robustness and power of traditional MR tools. By denoising and combining candidate genetic variants through techniques from unsupervised probabilistic graphical mode...
Article
As machine learning models continue to increase in complexity, collecting large hand-labeled training sets has become one of the biggest roadblocks in practice. Instead, weaker forms of supervision that provide noisier but cheaper labels are often used. However, these weak supervision sources have diverse and unknown accuracies, may output correlat...
Article
A common way to protect data stored in DRAM and related memory systems is through the use of an error-correcting code such as the extended Hamming code. Traditionally, these error-correcting codes provide equal protection guarantees to all messages. In this work, we focus on unequal message protection (UMP), in which a subset of messages is deemed...
Preprint
Labeling training data is a key bottleneck in the modern machine learning pipeline. Recent weak supervision approaches combine labels from multiple noisy sources by estimating their accuracies without access to ground truth labels; however, estimating the dependencies among these sources is a critical challenge. We focus on a robust PCA-based algor...
Article
In this work, we investigate the problem of constructing codes capable of correcting two deletions. In particular, we construct a code that requires redundancy approximately 8 log2 n + O(log2 log2 n) bits of redundancy, where n denotes the length of the code. To the best of the authors’ knowledge, this represents the best known construction in that...
Preprint
As machine learning models continue to increase in complexity, collecting large hand-labeled training sets has become one of the biggest roadblocks in practice. Instead, weaker forms of supervision that provide noisier but cheaper labels are often used. However, these weak supervision sources have diverse and unknown accuracies, may output correlat...
Article
Hyperbolic embeddings offer excellent quality with few dimensions when embedding hierarchical data structures like synonym or type hierarchies. Given a tree, we give a combinatorial construction that embeds the tree in hyperbolic space with arbitrarily low distortion without using optimization. On WordNet, our combinatorial embedding obtains a mean...
Article
Two-dimensional magnetic recording (TDMR) is an emerging storage technology that aims to achieve areal densities on the order of 10 Tb/in <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> , mainly driven by innovative channels engineering with minimal changes to existing head/media designs within...
Article
After being trained, classifiers must often operate on data that has been corrupted by noise. In this paper, we consider the impact of such noise on the features of binary classifiers. Inspired by tools for classifier robustness, we introduce the same classification probability (SCP) to measure the resulting distortion on the classifier outputs. We...
Article
Recent studies on noisy decoding for LDPC codes rely on the assumption that the noise in each component is independent and perpetual. This work examines a noisy decoding model that generalizes this approach: the noise is due to multistate channels where the channel states are governed by queuelike processes. This model is inspired by errors in deco...
Conference Paper
Two important recent trends are the proliferation of learning algorithms along with the massive increase of data stored on unreliable storage mediums. These trends impact each other; noisy data can have an undesirable effect on the results provided by learning algorithms. Although traditional tools exist to improve the reliability of data storage d...
Article
Developing efficient algorithms to synchronize between different versions of files is an important problem with numerous applications. We consider the interactive synchronization protocol introduced by Yazdi and Dolecek, based on an earlier synchronization algorithm by Venkataramanan et al. Unlike preceding synchronization algorithms, Yazdi and Dol...
Chapter
In this chapter, we discuss advanced error-correcting code techniques. In particular, we focus on two complementary strategies, asymmetric algebraic codes and non-binary low-density parity-check (LDPC) codes. Both of these techniques are inspired by traditional coding theory; however, in both cases, we depart from classical approaches and develop n...
Article
This work studies problems in data reconstruction, an important area with numerous applications. In particular, we examine the reconstruction of binary and nonbinary sequences from synchronization (insertion/deletion-correcting) codes. These sequences have been corrupted by a fixed number of symbol insertions (larger than the minimum edit distance...
Article
Non-volatile memories (NVMs) have emerged as the primary replacement of hard-disk drives for a variety of storage applications, including personal electronics, mobile computing, intelligent vehicles, enterprise storage, data warehousing, and data-intensive computing systems. Channel coding schemes are a necessary tool for ensuring target reliabilit...
Article
Full-text available
In this paper we summarize recent results and contributions from the NSF Expedition on Variability-Aware Software, a five year, multi-university effort to tackle the problem of hardware variations and its implications and opportunities in software. The Expedition has made contributions in characterization and online monitoring of variations (partic...
Article
We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. In our setup, Y is an edited version of X; edits are insertions and deletions that are potentially numerous. We previously proposed an order-wise optimal synchronization protocol for reconstructing file X...
Article
Flash, already one of the dominant forms of data storage for mobile consumer devices, such as smartphones and media players, is experiencing explosive growth in cloud and enterprise applications. Flash devices offer very high access speeds, low power consumption, and physical resiliency. Our goal in this article is to provide a high-level overview...
Article
The development of good codes which are capable of correcting more than a single deletion remains an elusive task. Recent papers, such as that by Kulkarni and Kiyavash [3], instead focus on the more tractable problem of deriving upper bounds on the cardinalities of such codes. In the present work, we develop Gilbert-Varshamov-type lower bounds on t...
Article
In this work, the model introduced by Gabrys is extended to account for the presence of unreliable memory cells. Leveraging data analysis on errors taking place in a TLC Flash device, we show that memory cells can be broadly categorized into reliable and unreliable cells, where the latter are much more likely to be in error. Our approach programs u...
Conference Paper
Codes based on multiset permutations, or multipermutations, have attracted recent attention due to their applications to non-volatile memories. Most of the literature studying multipermutations is focused on codes capable of correcting errors in the Kendall tau and Ulam metrics. In this work, we make a first effort towards studying synchronization...
Conference Paper
Full-text available
Motivated by the rank modulation scheme for flash memories, we consider an information representation system with relative values (permutations) and study codes for correcting deletions. In contrast to the case of a deletion in a regular (with absolute values) representation system, a deletion in this new paradigm results in a new permutation over...
Conference Paper
Efficient synchronization of remote copies of files that have experienced insertions and deletions is an important problem with many applications including data storage, file sharing, online editing, and cloud computing. Suppose that user A is the owner of an original file X, and user B is the owner of the edited file Y that is obtained from X thro...
Conference Paper
Rank modulation schemes for non-volatile memories (NVMs) represent information by the relative rankings of cell charge levels. This approach has several benefits; in particular, the scheme resolves the “write-asymmetry” limitation that NVMs suffer from. However, cell writing is still affected by a common NVM problem: inter-cell coupling, which can...
Article
In non-volatile memories, reading stored data is typically done through the use of predetermined fixed thresholds. However, due to problems commonly affecting such memories, including voltage drift, overwriting, and inter-cell coupling, fixed threshold usage often results in significant asymmetric errors. To combat these problems, Zhou, Jiang, and...
Conference Paper
Synchronization channels, which can remove codeword symbols or introduce extraneous symbols, pose additional difficulties when compared to the commonly-studied substitution channel. A traditional problem in this area is to count the number of sequences formed when deleting a fixed number of symbols from a sequence. This work contains our first effo...
Article
In this article we consider a class of Cayley graphs that are generated by certain 3-cycles on the alternating group An. These graphs are generalizations of the alternating group graph AGn. We look at the case when the 3-cycles form a “tree-like structure,” and analyze its fault resiliency. We present a number of structural theorems and prove that...

Network

Cited By

Projects

Project (1)