Roman Vershynin’s research while affiliated with University of California, Irvine and other places
What is this page?
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
We show that every -graph contains a Hamilton cycle for sufficiently large n , assuming that and , where . This significantly improves a recent result of Glock, Correia, and Sudakov, who obtained a similar result for d that grows polynomially with n . The proof is based on a new result regarding the second largest eigenvalue of the adjacency matrix of a subgraph induced by a random subset of vertices, combined with a recent result on connecting designated pairs of vertices by vertex-disjoint paths in -graphs. We believe that the former result is of independent interest and will have further applications.
We study random matrices with independent subgaussian columns. Assuming each column has a fixed Euclidean norm, we establish conditions under which such matrices act as near-isometries when restricted to a given subset of their domain. We show that, with high probability, the maximum distortion caused by such a matrix is proportional to the Gaussian complexity of the subset, scaled by the subgaussian norm of the matrix columns. This linear dependence on the subgaussian norm is a new phenomenon, as random matrices with independent rows or independent entries typically exhibit superlinear dependence. As a consequence, normalizing the columns of random sparse matrices leads to stronger embedding guarantees.
Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Unlike the standard perturbation analysis, our analysis of private PCA works without assuming the spectral gap for the covariance matrix.
We show how randomized rounding based on Grothendieck’s identity can be used to prove a nearly tight bound on the covariance loss–the amount of covariance that is lost by taking conditional expectation. This result yields a new type of weak Szemeredi regularity lemma for positive semidefinite matrices and kernels. Moreover, it can be used to construct differentially private synthetic data.
We investigate the best constant J(n,d) such that Jackson's inequality holds for all functions f on the hypercube , where s(f) denotes the sensitivity of f. We show that the quantity is bounded below by an absolute positive constant, independent of n. This complements Wagner's theorem, which establishes that . As a first application we show that reverse Bernstein inequality fails in the tail space improving over previously known counterexamples in . As a second application, we show that there exists a function whose sensitivity s(f) remains constant, independent of n, while the approximate degree grows linearly with n. This result implies that the sensitivity theorem fails in the strongest sense for bounded real-valued functions even when is relaxed to the approximate degree. We also show that in the regime , the bound holds. Moreover, when restricted to symmetric real-valued functions, we obtain and the decay 1/d is sharp. Finally, we present results for a subspace approximation problem: we show that there exists a subspace E of dimension such that holds for all f.
The problem of detecting fake data inspires the following seemingly simple mathematical question. Sample a data point X from the standard normal distribution in . An adversary observes X and corrupts it by adding a vector rt, where they can choose any vector t from a fixed set T of the adversary's "tricks", and where is a fixed radius. The adversary's choice of t=t(X) may depend on the true data X. The adversary wants to hide the corruption by making the fake data X+rt statistically indistinguishable from the real data X. What is the largest radius r=r(T) for which the adversary can create an undetectable fake? We show that for highly symmetric sets T, the detectability radius r(T) is approximately twice the scaled Gaussian width of T. The upper bound actually holds for arbitrary sets T and generalizes to arbitrary, non-Gaussian distributions of real data X. The lower bound may fail for not highly symmetric T, but we conjecture that this problem can be solved by considering the focused version of the Gaussian width of T, which focuses on the most important directions of T.
While differentially private synthetic data generation has been explored extensively in the literature, how to update this data in the future if the underlying private data changes is much less understood. We propose an algorithmic framework for streaming data that generates multiple synthetic datasets over time, tracking changes in the underlying private data. Our algorithm satisfies differential privacy for the entire input stream (continual differential privacy) and can be used for high-dimensional tabular data. Furthermore, we show the utility of our method via experiments on real-world datasets. The proposed algorithm builds upon a popular select, measure, fit, and iterate paradigm (used by offline synthetic data generation algorithms) and private counters for streams.
Differential privacy is a mathematical concept that provides an information-theoretic security guarantee. While differential privacy has emerged as a de facto standard for guaranteeing privacy in data sharing, the known mechanisms to achieve it come with some serious limitations. Utility guarantees are usually provided only for a fixed, a priori specified set of queries. Moreover, there are no utility guarantees for more complex—but very common—machine learning tasks such as clustering or classification. In this paper we overcome some of these limitations. Working with metric privacy, a powerful generalization of differential privacy, we develop a polynomial-time algorithm that creates a private measure from a data set. This private measure allows us to efficiently construct private synthetic data that are accurate for a wide range of statistical analysis tools. Moreover, we prove an asymptotically sharp min-max result for private measures and synthetic data in general compact metric spaces, for any fixed privacy budget ε bounded away from zero. A key ingredient in our construction is a new superregular random walk, whose joint distribution of steps is as regular as that of independent random variables, yet which deviates from the origin logarithmically slowly.
We present a polynomial-time algorithm for online differentially private synthetic data generation. For a data stream within the hypercube
and an infinite time horizon, we develop an online algorithm that generates a differentially private synthetic dataset at each time
t
. This algorithm achieves a near-optimal accuracy bound of
for
and
for
d=1
in the 1-Wasserstein distance. This result extends the previous work on the continual release model for counting queries to Lipschitz queries. Compared to the offline case, where the entire dataset is available at once [8], [36], our approach requires only an extra polylog factor in the accuracy bound.
Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. However, when the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. In this paper, we propose a differentially private algorithm to generate low-dimensional synthetic data efficiently from a high-dimensional dataset with a utility guarantee with respect to the Wasserstein distance. A key step of our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound that circumvents the curse of dimensionality. Different from the standard perturbation analysis using the Davis-Kahan theorem, our analysis of private PCA works without assuming the spectral gap for the sample covariance matrix.
... Our approach adapts recent advancements in private synthetic data generation (in a non-streaming setting) that utilizes a hierarchical decomposition to partition the sample space [12,13]. At a high level, these methods work in the following way: ...
... Moreover, online algorithms for differentially private synthetic data generation are being developed that promise near-optimal accuracy over time, supporting adaptive data generation in dynamic environments (He et al., 2024). These advancements indicate potential pathways to ensure that data generation processes can remain both useful and fair, addressing the critical balance between maintaining utility and achieving fairness and privacy. ...
... Recent work has discussed the risk of leaking information with typical synthetic data generation [6]. More recently, methods have been developed to generate differentially private synthetic data [50,51], in which case the disclosure risk is clear, via the choice of privacy budget. The second drawback of releasing synthetic data is that making valid inferences with synthetic data requires clear communication from the data steward to the public of how the data can be analyzed. ...
... At its core, OT creates a probabilistic map between the unannotated observations (for example, Slide-seq beads) and the means of the classes in the reference representation (for example, mean cell-type profiles) according to their relative similarity. This approach was previously used to map ancestor and descendant cells during differentiation 16 , recover the spatial organization of cells 13,[19][20][21] and associate measurements across modalities (for example, single-cell ATAC-seq and scRNA-seq) [22][23][24] . Through the OT framework, users can set the similarity metric, constrain or bias the by dropout or contamination of ambient RNA, can mask the true identity of an individual cell. ...
... Point attention is calculated by creating a graph between points, using nearest neighbors or other serialization techniques in order to emulate a rolling window, such as the one present in a traditional attention model [5]. The attention is then calculated and aggregated over the neighborhood of this graph, for example, if there is a source node ( , ) and a destination node are calculated with respect to the destination for each edge. ...
... These approaches collectively enable the creation of synthetic datasets that reflect real-world patterns and maintain their utility for various applications. 4) Privacy-Preserving Measures: Privacy considerations are integrated into the synthetic data generation approach [118]. Techniques such as differential privacy, homomorphic encryption, and data perturbation are employed to safeguard individual privacy while maintaining the utility of the synthetic data. ...
... Synthetic data offer no mathematical guarantee of privacy per se 96 but can be paired with techniques such as differential privacy to train AI models 97 . A limitation of differential privacy synthetic data is that it may be difficult to preserve correlations between all variables in the original data, although preserving some correlations is achievable 98 . Other limitations include the types of data that can be synthesised, and whether they are realistic. ...
... Private synthetic data generation. There has been much progress in recent years on methods for differentially private data synthesis and generation [3,7,13,56,68,84,96,8]. Some of the best-performing methods follow the Select-Measure-Project paradigm [78,57]; these algorithms first select highly representative queries to evaluate on the data, measure those queries in a differentially private manner (often with standard additive noise mechanisms), and then project the private measurements onto a parametric distribution, which can then be used to generate arbitrarily many new datapoints. ...
... This important case has been extensively studied in computer science in general and in machine learning in particular (see e.g. [1,5,8,11]), as well as in other applications of probability theory -though mainly when the X i 's are independent. ...
... where N k is an ǫ k −net for the unit sphere S d−1 = {u : u ∈ R d , ||u|| = 1} for some sequence ǫ k > 0 with lim k→∞ ǫ k = 0. The perspective encapsulated by (aCFW) has been successfully employed with several families of random matrices and has rendered, for instance, almost sure limits for the smallest eigenvalues of random matrices (these are notoriously more difficult to understand than the largest eigenvalues: Tikhomirov [10], Livshyts [7] define nets of the unit sphere based on sizes of the largest entries or sparsity: see the definitions of peaky and compressible/incompressible vectors in [10], [7], respectively), as well as tail bounds for the operator norms of matrices with independent entries (see, for instance, chapter 4 in Vershyinin [12]). A special case of suprema of random processes (i.e., collections of random variables that are not necessarily independent) for which there exist elegant and powerful results is the Gaussian family: these are normal random variables (X t ) t∈T , whose indices lie in some set T, and the mostly analyzed statistic for such collections of random variables is sup t∈T X t . ...