Roman Vershynin’s research while affiliated with University of California, Irvine and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (152)


Hamiltonicity of sparse pseudorandom graphs
  • Article

March 2025

·

6 Reads

Combinatorics Probability and Computing

·

·

Dingjia Mao

·

Roman Vershynin

We show that every (n,d,λ)(n,d,\lambda ) -graph contains a Hamilton cycle for sufficiently large n , assuming that dlog6nd\geq \log ^{6}n and λcd\lambda \leq cd , where c=170000c=\frac {1}{70000} . This significantly improves a recent result of Glock, Correia, and Sudakov, who obtained a similar result for d that grows polynomially with n . The proof is based on a new result regarding the second largest eigenvalue of the adjacency matrix of a subgraph induced by a random subset of vertices, combined with a recent result on connecting designated pairs of vertices by vertex-disjoint paths in (n,d,λ)(n,d,\lambda ) -graphs. We believe that the former result is of independent interest and will have further applications.


Random matrices acting on sets: Independent columns
  • Preprint
  • File available

February 2025

·

14 Reads

We study random matrices with independent subgaussian columns. Assuming each column has a fixed Euclidean norm, we establish conditions under which such matrices act as near-isometries when restricted to a given subset of their domain. We show that, with high probability, the maximum distortion caused by such a matrix is proportional to the Gaussian complexity of the subset, scaled by the subgaussian norm of the matrix columns. This linear dependence on the subgaussian norm is a new phenomenon, as random matrices with independent rows or independent entries typically exhibit superlinear dependence. As a consequence, normalizing the columns of random sparse matrices leads to stronger embedding guarantees.

Download

Differentially private low-dimensional synthetic data from high-dimensional datasets

January 2025

·

4 Reads

·

2 Citations

Information and Inference A Journal of the IMA

Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Unlike the standard perturbation analysis, our analysis of private PCA works without assuming the spectral gap for the covariance matrix.


Covariance loss, Szemeredi regularity, and differential privacy

January 2025

·

3 Reads

Proceedings of the American Mathematical Society

We show how randomized rounding based on Grothendieck’s identity can be used to prove a nearly tight bound on the covariance loss–the amount of covariance that is lost by taking conditional expectation. This result yields a new type of weak Szemeredi regularity lemma for positive semidefinite matrices and kernels. Moreover, it can be used to construct differentially private synthetic data.


Jackson's inequality on the hypercube

October 2024

·

38 Reads

We investigate the best constant J(n,d) such that Jackson's inequality infdeg(g)dfgJ(n,d)s(f), \inf_{\mathrm{deg}(g) \leq d} \|f - g\|_{\infty} \leq J(n,d) \, s(f), holds for all functions f on the hypercube {0,1}n\{0,1\}^n, where s(f) denotes the sensitivity of f. We show that the quantity J(n,0.499n)J(n, 0.499n) is bounded below by an absolute positive constant, independent of n. This complements Wagner's theorem, which establishes that J(n,d)1J(n,d)\leq 1 . As a first application we show that reverse Bernstein inequality fails in the tail space L0.499n1L^{1}_{\geq 0.499n} improving over previously known counterexamples in LCloglog(n)1L^{1}_{\geq C \log \log (n)}. As a second application, we show that there exists a function f:{0,1}n[1,1]f : \{0,1\}^n \to [-1,1] whose sensitivity s(f) remains constant, independent of n, while the approximate degree grows linearly with n. This result implies that the sensitivity theorem s(f)Ω(deg(f)C)s(f) \geq \Omega(\mathrm{deg}(f)^C) fails in the strongest sense for bounded real-valued functions even when deg(f)\mathrm{deg}(f) is relaxed to the approximate degree. We also show that in the regime d=(1δ)nd = (1 - \delta)n, the bound J(n,d)Cmin{δ,max{δ2,n2/3}} J(n,d) \leq C \min\{\delta, \max\{\delta^2, n^{-2/3}\}\} holds. Moreover, when restricted to symmetric real-valued functions, we obtain Jsymmetric(n,d)C/dJ_{\mathrm{symmetric}}(n,d) \leq C/d and the decay 1/d is sharp. Finally, we present results for a subspace approximation problem: we show that there exists a subspace E of dimension 2n12^{n-1} such that infgEfgs(f)/n\inf_{g \in E} \|f - g\|_{\infty} \leq s(f)/n holds for all f.


Can we spot a fake?

October 2024

·

11 Reads

The problem of detecting fake data inspires the following seemingly simple mathematical question. Sample a data point X from the standard normal distribution in Rn\mathbb{R}^n. An adversary observes X and corrupts it by adding a vector rt, where they can choose any vector t from a fixed set T of the adversary's "tricks", and where r>0r>0 is a fixed radius. The adversary's choice of t=t(X) may depend on the true data X. The adversary wants to hide the corruption by making the fake data X+rt statistically indistinguishable from the real data X. What is the largest radius r=r(T) for which the adversary can create an undetectable fake? We show that for highly symmetric sets T, the detectability radius r(T) is approximately twice the scaled Gaussian width of T. The upper bound actually holds for arbitrary sets T and generalizes to arbitrary, non-Gaussian distributions of real data X. The lower bound may fail for not highly symmetric T, but we conjecture that this problem can be solved by considering the focused version of the Gaussian width of T, which focuses on the most important directions of T.


Differentially Private Synthetic High-dimensional Tabular Stream

August 2024

·

4 Reads

While differentially private synthetic data generation has been explored extensively in the literature, how to update this data in the future if the underlying private data changes is much less understood. We propose an algorithmic framework for streaming data that generates multiple synthetic datasets over time, tracking changes in the underlying private data. Our algorithm satisfies differential privacy for the entire input stream (continual differential privacy) and can be used for high-dimensional tabular data. Furthermore, we show the utility of our method via experiments on real-world datasets. The proposed algorithm builds upon a popular select, measure, fit, and iterate paradigm (used by offline synthetic data generation algorithms) and private counters for streams.


Faber–Schauder system
Haar system
The depth-first search tour demonstrates that the TSP of a tree equals twice the sum of lengths of its edges
Chaining: construction of a spanning tree of a metric space
The map F folds an interval [0,TSP(M)]\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$[0,{{\,\textrm{TSP}\,}}(M)]$$\end{document} into a Hamiltonian path (a “space-filling curve”) of the metric space T
Private measures, random walks, and synthetic data

April 2024

·

30 Reads

·

12 Citations

Probability Theory and Related Fields

Differential privacy is a mathematical concept that provides an information-theoretic security guarantee. While differential privacy has emerged as a de facto standard for guaranteeing privacy in data sharing, the known mechanisms to achieve it come with some serious limitations. Utility guarantees are usually provided only for a fixed, a priori specified set of queries. Moreover, there are no utility guarantees for more complex—but very common—machine learning tasks such as clustering or classification. In this paper we overcome some of these limitations. Working with metric privacy, a powerful generalization of differential privacy, we develop a polynomial-time algorithm that creates a private measure from a data set. This private measure allows us to efficiently construct private synthetic data that are accurate for a wide range of statistical analysis tools. Moreover, we prove an asymptotically sharp min-max result for private measures and synthetic data in general compact metric spaces, for any fixed privacy budget εε\varepsilon bounded away from zero. A key ingredient in our construction is a new superregular random walk, whose joint distribution of steps is as regular as that of independent random variables, yet which deviates from the origin logarithmically slowly.


Online Differentially Private Synthetic Data Generation

January 2024

·

2 Reads

·

1 Citation

IEEE Transactions on Privacy

We present a polynomial-time algorithm for online differentially private synthetic data generation. For a data stream within the hypercube [0,1]d[0,1]^{d} and an infinite time horizon, we develop an online algorithm that generates a differentially private synthetic dataset at each time t . This algorithm achieves a near-optimal accuracy bound of O(log(t)t1/d)O(\log (t)t^{-1/d}) for d2d\geq 2 and O(log4.5(t)t1)O(\log ^{4.5}(t)t^{-1}) for d=1 in the 1-Wasserstein distance. This result extends the previous work on the continual release model for counting queries to Lipschitz queries. Compared to the offline case, where the entire dataset is available at once [8], [36], our approach requires only an extra polylog factor in the accuracy bound.


Differentially private low-dimensional representation of high-dimensional data

May 2023

·

94 Reads

Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Different from the standard perturbation analysis using the Davis-Kahan theorem, our analysis of private PCA works without assuming the spectral gap for the sample covariance matrix.


Citations (71)


... Our approach adapts recent advancements in private synthetic data generation (in a non-streaming setting) that utilizes a hierarchical decomposition to partition the sample space [12,13]. At a high level, these methods work in the following way: ...

Reference:

Private Synthetic Data Generation in Small Memory
Differentially private low-dimensional synthetic data from high-dimensional datasets
  • Citing Article
  • January 2025

Information and Inference A Journal of the IMA

... Moreover, online algorithms for differentially private synthetic data generation are being developed that promise near-optimal accuracy over time, supporting adaptive data generation in dynamic environments (He et al., 2024). These advancements indicate potential pathways to ensure that data generation processes can remain both useful and fair, addressing the critical balance between maintaining utility and achieving fairness and privacy. ...

Online Differentially Private Synthetic Data Generation

IEEE Transactions on Privacy

... Recent work has discussed the risk of leaking information with typical synthetic data generation [6]. More recently, methods have been developed to generate differentially private synthetic data [50,51], in which case the disclosure risk is clear, via the choice of privacy budget. The second drawback of releasing synthetic data is that making valid inferences with synthetic data requires clear communication from the data steward to the public of how the data can be analyzed. ...

Private measures, random walks, and synthetic data

Probability Theory and Related Fields

... At its core, OT creates a probabilistic map between the unannotated observations (for example, Slide-seq beads) and the means of the classes in the reference representation (for example, mean cell-type profiles) according to their relative similarity. This approach was previously used to map ancestor and descendant cells during differentiation 16 , recover the spatial organization of cells 13,[19][20][21] and associate measurements across modalities (for example, single-cell ATAC-seq and scRNA-seq) [22][23][24] . Through the OT framework, users can set the similarity metric, constrain or bias the by dropout or contamination of ambient RNA, can mask the true identity of an individual cell. ...

AVIDA: An alternating method for visualizing and integrating data
  • Citing Article
  • March 2023

Journal of Computational Science

Kathryn Dover

·

·

Anna Ma

·

[...]

·

Roman Vershynin

... Point attention is calculated by creating a graph between points, using nearest neighbors or other serialization techniques in order to emulate a rolling window, such as the one present in a traditional attention model [5]. The attention is then calculated and aggregated over the neighborhood of this graph, for example, if there is a source node ( , ) and a destination node are calculated with respect to the destination for each edge. ...

The Quarks of Attention: Structure and Capacity of Neural Attention Building Blocks
  • Citing Article
  • March 2023

Artificial Intelligence

... These approaches collectively enable the creation of synthetic datasets that reflect real-world patterns and maintain their utility for various applications. 4) Privacy-Preserving Measures: Privacy considerations are integrated into the synthetic data generation approach [118]. Techniques such as differential privacy, homomorphic encryption, and data perturbation are employed to safeguard individual privacy while maintaining the utility of the synthetic data. ...

Privacy of Synthetic Data: A Statistical Framework
  • Citing Article
  • January 2022

IEEE Transactions on Information Theory

... Synthetic data offer no mathematical guarantee of privacy per se 96 but can be paired with techniques such as differential privacy to train AI models 97 . A limitation of differential privacy synthetic data is that it may be difficult to preserve correlations between all variables in the original data, although preserving some correlations is achievable 98 . Other limitations include the types of data that can be synthesised, and whether they are realistic. ...

Covariance’s Loss is Privacy’s Gain: Computationally Efficient, Private and Accurate Synthetic Data

Foundations of Computational Mathematics

... Private synthetic data generation. There has been much progress in recent years on methods for differentially private data synthesis and generation [3,7,13,56,68,84,96,8]. Some of the best-performing methods follow the Select-Measure-Project paradigm [78,57]; these algorithms first select highly representative queries to evaluate on the data, measure those queries in a differentially private manner (often with standard additive noise mechanisms), and then project the private measurements onto a parametric distribution, which can then be used to generate arbitrarily many new datapoints. ...

Private Sampling: A Noiseless Approach for Generating Differentially Private Synthetic Data
  • Citing Article
  • September 2022

SIAM Journal on Mathematics of Data Science

... where N k is an ǫ k −net for the unit sphere S d−1 = {u : u ∈ R d , ||u|| = 1} for some sequence ǫ k > 0 with lim k→∞ ǫ k = 0. The perspective encapsulated by (aCFW) has been successfully employed with several families of random matrices and has rendered, for instance, almost sure limits for the smallest eigenvalues of random matrices (these are notoriously more difficult to understand than the largest eigenvalues: Tikhomirov [10], Livshyts [7] define nets of the unit sphere based on sizes of the largest entries or sparsity: see the definitions of peaky and compressible/incompressible vectors in [10], [7], respectively), as well as tail bounds for the operator norms of matrices with independent entries (see, for instance, chapter 4 in Vershyinin [12]). A special case of suprema of random processes (i.e., collections of random variables that are not necessarily independent) for which there exist elegant and powerful results is the Gaussian family: these are normal random variables (X t ) t∈T , whose indices lie in some set T, and the mostly analyzed statistic for such collections of random variables is sup t∈T X t . ...

The smallest singular value of inhomogeneous square random matrices
  • Citing Article
  • April 2021

The Annals of Probability