Institute for System Programming, Russian Academy of Sciences
Recent publications
The influence of the viscosity on a wave attractor flow has been previously studied, particularly in relation to the widening of the hydrodynamical structures. In this work, we simulate an attractor flow with a peculiar bottom shape that includes an underwater hill. During the simulation, we discovered a side structure appearing beyond the wave attractor. We determined that the appearance of this structure is connected to viscosity. In this article, we consider the behavior of this newly found structure. Additionally, we discuss the challenges of energy accumulation and the estimation of the Reynolds number, which is a non-trivial problem in the context of wave attractor flows.
The rapid development of machine learning and deep learning has introduced increasingly complex optimization challenges that must be addressed. Indeed, training modern, advanced models has become difficult to implement without leveraging multiple computing nodes in a distributed environment. Distributed optimization is also fundamental to emerging fields such as federated learning. Specifically, there is a need to organize the training process to minimize the time lost due to communication. A widely used and extensively researched technique to mitigate the communication bottleneck involves performing local training before communication. This approach is the focus of our paper. Concurrently, adaptive methods that incorporate scaling, notably led by Adam, have gained significant popularity in recent years. Therefore, this paper aims to merge the local training technique with the adaptive approach to develop efficient distributed learning methods. We consider the classical Local SGD method and enhance it with a scaling feature. A crucial aspect is that the scaling is described generically, allowing us to analyze various approaches, including Adam, RMSProp, and OASIS, in a unified manner. In addition to the theoretical analysis, we validate the performance of our methods in practice by training a neural network. Bibliography: 49 titles.
Modern realities and trends in learning require more and more generalization ability of models, which leads to an increase in both models and training sample size. It is already difficult to solve such tasks in a single device mode. This is the reason why distributed and federated learning approaches are becoming more popular every day. Distributed computing involves communication between devices, which requires solving two key problems: efficiency and privacy. One of the most well-known approaches to combat communication costs is to exploit the similarity of local data. Both Hessian similarity and homogeneous gradients have been studied in the literature, but separately. In this paper we combine both of these assumptions in analyzing a new method that incorporates the ideas of using data similarity and clients sampling. Moreover, to address privacy concerns, we apply the technique of additional noise and analyze its impact on the convergence of the proposed method. The theory is confirmed by training on real datasets. Bibliography: 45 titles.
The general theory of greedy approximation with respect to arbitrary dictionaries is well developed in the case of real Banach spaces. Recently, some of results proved for the Weak Chebyshev Greedy Algorithm (WCGA) in the case of real Banach spaces were extended to the case of complex Banach spaces. In this paper we extend some of the results known in the real case for greedy algorithms other than the WCGA to the case of complex Banach spaces. Bibliography: 25 titles.
Distributed optimization plays an important role in modern large-scale machine learning and data processing systems by optimizing the utilization of computational resources. One of the classical and popular approaches is Local Stochastic Gradient Descent (Local SGD), characterized by multiple local updates before averaging, which is particularly useful in distributed environments to reduce communication bottlenecks and improve scalability. A typical feature of this method is the dependence on the frequency of communications. But in the case of a quadratic target function with homogeneous data distribution over all devices, the influence of the frequency of communications vanishes. As a natural consequence, subsequent studies include the assumption of a Lipschitz Hessian, as this indicates the similarity of the optimized function to a quadratic one to a certain extent. However, in order to extend the completeness of Local SGD theory and unlock its potential, in this paper we abandon the Lipschitz Hessian assumption by introducing a new concept of approximate quadraticity. This assumption gives a new perspective on problems that have near quadratic properties. In addition, existing theoretical analyses of Local SGD often assume a bounded variance. We, in turn, consider the unbounded noise condition, which allows us to broaden the class of problems under study. Bibliography: 36 titles.
Background: In clinical practice, various methods are used to identify ALK gene rearrangements in tumor samples, ranging from “classic” techniques, such as IHC, FISH, and RT-qPCR, to more advanced highly multiplexed approaches, such as NanoString technology and NGS panels. Each of these methods has its own advantages and disadvantages, but they share the drawback of detecting only a restricted (although sometimes quite extensive) set of preselected biomarkers. At the same time, whole transcriptome sequencing (WTS, RNAseq) can, in principle, be used to detect gene fusions while simultaneously analyzing an incomparably wide range of tumor characteristics. However, WTS is not widely used in practice due to purely analytical limitations and the high complexity of bioinformatic analysis, which requires considerable expertise. In particular, methods to detect gene fusions in RNAseq data rely on the identification of chimeric reads. However, the typically low number of true fusion reads in RNAseq limits its sensitivity. In a previous study, we observed asymmetry in the RNAseq exon coverage of the 3′ partners of some fusion transcripts. In this study, we conducted a comprehensive evaluation of the accuracy of ALK fusion detection through an analysis of differences in the coverage of its tyrosine kinase exons. Methods: A total of 906 human cancer biosamples were subjected to analysis using experimental RNAseq data, with the objective of determining the extent of asymmetry in ALK coverage. A total of 50 samples were analyzed, comprising 13 samples with predicted ALK fusions and 37 samples without predicted ALK fusions. These samples were assessed by targeted sequencing with two NGS panels that were specifically designed to detect fusion transcripts (the TruSight RNA Fusion Panel and the OncoFu Elite panel). Results: ALK fusions were confirmed in 11 out of the 13 predicted cases, with an overall accuracy of 96% (sensitivity 100%, specificity 94.9%). Two discordant cases exhibited low ALK coverage depth, which could be addressed algorithmically to enhance the accuracy of the results. It was also important to consider read strand specificity due to the presence of antisense transcripts involving parts of ALK. In a limited patient sample undergoing ALK-targeted therapy, the algorithm successfully predicted treatment efficacy. Conclusions: RNAseq exon coverage analysis can effectively detect ALK rearrangements.
The paper presents a review of the current state of subgradient and accelerated convex optimization methods, including the cases with the presence of noise and access to various information about the objective function (function value, gradient, stochastic gradient, higher derivatives). For nonconvex problems, the Polyak–Lojasiewicz condition is considered and a review of the main results is given. The behavior of numerical methods in the presence of a sharp minimum is considered. The aim of this review is to show the influence of the works of B.T. Polyak (1935–2023) on gradient optimization methods and their surroundings on the modern development of numerical optimization methods.
Real world data is not stationary and thus models must be monitored in production. One way to be sure in a model’s performance is regular testing. If the labels are not available, the task of minimizing the labeling cost can be formulated. In this work, we investigate and develop various ways to construct a minimum test set for a given trained model, in a fashion where the accuracy of the model calculated on the chosen subset is as close to the real one as possible. We focus on the white box scenario and propose a novel approach that uses neuron coverage as a observable functional to maximize in order to minimize the number of samples. We evaluate the proposed approach and compare it to Bayesian methods and stratification algorithms that are the main approaches to solve this task in literature. The developed method shows approximately the same level of performance but has a number of advantages over its competitors. It is deterministic, thus eliminating the dispersion of the results. Also, this method can give one a hint on the optimal budget.
Graph neural networks (GNNs) have shown great promise in a variety of tasks involving graph data, including recommendation systems. However, as GNNs become more widely adopted in practical applications, concerns have arisen about their vulnerability to adversarial attacks. These attacks can lead to biased recommendations, potentially causing economic losses and safety risks. In this work, we consider an industrial application of recommendation systems for transport logistics and study their vulnerability to membership inference attacks. The dataset represents real train flows in Russia, published in the ETIS project. Experiments with three popular GNN architectures show that all of them can be successfully attacked even when the adversary has minimal background knowledge. Specifically, an attacker with access to only 1-2% of the actual data can successfully train their own GNN model to infer the membership of a shipper-consignee association in the training set with an accuracy over 94%. Our study also confirms that overfitting is the primary factor that influences the attack performance of recommendation systems.
Ensuring the security and reliability of machine learning frameworks is crucial for building trustworthy AI-based systems. Fuzzing, a popular technique in the secure software development lifecycle (SSDLC), can be used to develop secure and robust software. Popular machine learning frameworks such as PyTorch and TensorFlow are complex and written in multiple programming languages including C/C++ and Python. We propose a dynamic analysis pipeline for Python projects using the Sydr-Fuzz toolset. Our pipeline includes fuzzing, corpus minimization, crash triaging, and coverage collection. Crash triaging and severity estimation are important steps to ensure that the most critical vulnerabilities are addressed promptly. Furthermore, the proposed pipeline is integrated in GitLab CI. To identify the most vulnerable parts of the machine learning frameworks, we analyze their potential attack surfaces and develop fuzz targets for PyTorch, TensorFlow, and related projects such as h5py. Applying our dynamic analysis pipeline to these targets, we were able to discover 3 new bugs and propose fixes for them.
An important limitation of existing adversarial attacks on real-world object detectors lies in their threat model: adversarial patch-based methods often produce suspicious images while image generation approaches do not restrict the attacker’s capabilities of modifying the original scene. We design a threat model where the attacker modifies individual image segments and is required to produce realistic images. We also develop and evaluate a white-box attack that utilizes generative adversarial nets and diffusion models as a generator of malicious images. Our attack is able to produce high-fidelity images as measured by the Fréchet inception distance (FID) and reduces the mAP of Faster R-CNN model by over 0.2 on Cityscapes and COCO-Stuff datasets. A PyTorch implementation of our attack is available at https://github.com/DariaShel/gan-attack.
Introduction. The classification of the Northwest dialect of the Mari language is currently not entirely clear. According to the Encyclopedia of the Mari El Republic, the “Northwest Mari language exhibits some features of both Hill Mari and Meadow-Eastern Mari languages.” However, G. Bereсzki believed that the Northwest dialect belongs to the Hill Mari language. The aim of the study is to analyze the linguistic similarity assessment of the language of the first Gospel created by Priest S. Bobrovsky (1821) and the dialect of the 20th century according to the data of I. G. Ivanov and G. M. Tuzharov, in relation to the Hill and Meadow Mari languages from the perspective of glottochronology. Materials and Methods. To assess the proximity of the Northwestern Mari dialect of the 18th–19th centuries to the Meadow and Hill Mari dialects, phonetic isoglosses and glottochronology were analysed. For this purpose, a glossed corpus of the first Gospel was uploaded on LingvoDoc, from which a concordance was created. Proprietary LingvoDoc programs were applied to it for comparative-historical analysis. Result and Discussion. As a result, it was found that in the 19th century, the first Northwest Mari Gospel showed a lexical similarity of 97% with Meadow Mari and 94% with Hill Mari. In the dictionary compiled in the 20th century, there was already significantly less similarity with both Meadow and Hill languages, but the same pattern persisted: lexically, the Northwest dialect in the 19th century and in the 20th century was closer to Meadow Mari. Kreknin’s and Platunov’s 18th-century dictionary practically matched the Meadow Mari book “Beginnings of Christian Doctrine...” created in the early 19th century, with only one diagnostic position showing double reflection. In the Gospel translated by Priest S. Bobrovsky, three positions already showed reflection typical of the Hill Mari grammar by A. Albinsky. Conclusion. Therefore, it becomes clear that initially in the 18th century, Northwestern Mari differed little from the Meadow Mari language, while by the 19th century, it had become much closer to the Hill Mari dialect. The materials of the article may be useful for specialists in the Mari and Finno-Ugric languages.
In this paper we discuss the problems arising out of active introduction of artificial intelligence (AI) technologies to medicine and other humanities, as well as causes of these problems and steps that are taken worldwide to solve them. We focus on the methods and tools for developing trusted AI technologies, which are being created within the ISP RAS Trusted AI Research Center. We present the results of interdisciplinary projects executed in the Center and suggest a number of solutions to speed up the development of humanitarian AI technologies. The paper expands on the report given at the General Assembly of the RAS on March 12, 2024.
In this study, we analysed biological pathway diversity among Europeans and Northern Americans of European origin, the groups of people that share a common genetic ancestry but live in different geographic regions. We used a novel complex approach for analysing genomic data: we studied the total effects of multiple weak selection signals, accumulated from independent SNPs within a pathway. We found significant differences between immunity-related biological pathways from the two groups. All identified pathways included genes belonging to the major histocompatibility complex (MHC) system, which plays an important role in adaptive immune responses. We suggest that the ways of evolution were different for the MHC-I and MHC-II gene groups at least in Europeans and Americans of European origin. We hypothesise that the observed variability between the two populations was triggered by selection pressures due to the different pathogen landscapes and pathogen loads on the two continents. Our findings can be important for epidemic prevention and control, as well as for analysing processes related to allergies, organ transplantation, and autoimmune diseases.
The paper presents an improved approach for modeling multicomponent gas mixtures based on quasi‐gasdynamic equations. The proposed numerical algorithm is implemented as a reactingQGDFoam solver based on the open‐source OpenFOAM platform. The following problems have been considered for validation: the Riemann problems, the backward facing step problem, the interaction of a shock wave with a heavy and a light gas bubble, the unsteady underexpanded hydrogen jet flow in an air. The stability and convergence parameters of the proposed numerical algorithm are determined. The simulation results are found to be in agreement with analytical solutions and experimental data.
In this paper we discuss the problem of creating trusted artificial intelligence (AI) technologies. Modern AI is based on machine learning and neural networks and is vulnerable to biases and errors. Efforts are made to establish standards for the development of trusted AI technologies, but they have not yet succeeded. AI technologies trust can only be achieved with the appropriate scientific and technological base and corresponding tools and techniques for countering attacks. We present the ISP RAS Trusted AI Research Center results and propose a work model that can ensure technological independence and long-term sustainable development in this area.
The widespread adoption of cloud computing necessitates privacy-preserving techniques that allow information to be processed without disclosure. This paper proposes a method to increase the accuracy and performance of privacy-preserving Convolutional Neural Networks with Homomorphic Encryption (CNN-HE) by Self-Learning Activation Functions (SLAF). SLAFs are polynomials with trainable coefficients updated during training, together with synaptic weights, for each polynomial independently to learn task-specific and CNN-specific features. We theoretically prove its feasibility to approximate any continuous activation function to the desired error as a function of the SLAF degree. Two CNN-HE models are proposed: CNN-HE-SLAF and CNN-HE-SLAF-R. In the first model, all activation functions are replaced by SLAFs, and CNN is trained to find weights and coefficients. In the second one, CNN is trained with the original activation, then weights are fixed, activation is substituted by SLAF, and CNN is shortly re-trained to adapt SLAF coefficients. We show that such self-learning can achieve the same accuracy 99.38% as a non-polynomial ReLU over non-homomorphic CNNs and lead to an increase in accuracy (99.21%) and higher performance (6.26 times faster) than the state-of-the-art CNN-HE CryptoNets on the MNIST optical character recognition benchmark dataset.
Institution pages aggregate content on ResearchGate related to an institution. The members listed on this page have self-identified as being affiliated with this institution. Publications listed on this page were identified by our algorithms as relating to this institution. This page was not created or approved by the institution. If you represent an institution and have questions about these pages or wish to report inaccurate content, you can contact us here.
5 members
Sergey Sergeevich Gulin
  • Functional Languages
Kirill Lukyanov
  • Research Center for Trusted Artificial Intelligence
Savva Mitrofanov
  • Programming Technologies
Information
Address
Moscow, Russia