Nikolay Butakov's research while affiliated with ITMO University and other places

Publications (47)

Chapter
Knowledge Base Question-Answering (KBQA) systems have a number of benefits over Question-Answering over Text (QAT) systems: the ability to process complex, multi-hop questions and the absence of the need for answers containing texts. However, to develop a KBQA system, a knowledge base (KB) and a dataset for training is required. This fact is a sign...
Conference Paper
Full-text available
Topic modeling is a popular unsupervised method for text corpora processing to obtain interpreted knowledge on the data. However there is an automatic quality measurement gap between existing metrics, human evaluation and performance on the target tasks. That is a big challenge for automatic hyperparameter tuning methods as they heavily rely on the...
Article
Topic modelling is a popular unsupervised method for text processing that provides interpretable document representation. One of the most high-level approaches is additively regularized topic models (ARTM). This method features better quality than other methods due to its flexibility and advanced regularization abilities. However, it is challenging...
Chapter
Nowadays, question answering systems have applications in many areas and deep learning models now show the best results, especially in the task of obtaining an answer based on a question and some text. To train such models, a large number of question-answer-text triplets (QAT) are required. We offer a weak-supervision approach that allows recoverin...
Article
The real estate business has a lot of risks, and in order to minimize them, you need a lot of information from different sources. Systems based on natural language processing can help customers find this information more easily: question answering, information retrieval, etc. The existing method of question answering requires data aligned with poss...
Conference Paper
Communities of various specialists are using public groups on platforms like Telegram or Slack to discuss specific domain-oriented topics, for instance, Python programming language, Clickhouse database management system or even film making peculiarities. Conversations in such chats often have a form of questions and answers: someone is looking for...
Conference Paper
Full-text available
Online advertising is one of the most widespread ways to reach and increase a target audience for those selling products. Usually having a form of a banner, advertising engages users into visiting a corresponding webpage. Professional generation of banners requires creative and writing skills and a basic understanding of target products. The great...
Conference Paper
Topic modelling is a popular unsupervised method for text processing which provides interpretable document representation. One of the most high-level approaches, considering its capability of imitating the behaviour of various methods such as LDA or PLSA, is based on additive regularization technique. However, due to its flexibility and advanced re...
Chapter
The tasks of aspect identification and term extraction remain challenging in natural language processing. While supervised methods achieve high scores, it is hard to use them in real-world applications due to the lack of labelled datasets. Unsupervised approaches outperform these methods on several tasks, but it is still a challenge to extract both...
Preprint
Full-text available
The tasks of aspect identification and term extraction remain challenging in natural language processing. While supervised methods achieve high scores, it is hard to use them in real-world applications due to the lack of labelled datasets. Unsupervised approaches outperform these methods on several tasks, but it is still a challenge to extract both...
Chapter
It is common practice nowadays to use multiple social networks for different social roles. Although this, these networks assume differences in content type, communications and style of speech. If we intend to understand human behaviour as a key-feature for recommender systems, banking risk assessments or sociological researches, this is better to a...
Preprint
It is common practice nowadays to use multiple social networks for different social roles. Although this, these networks assume differences in content type, communications and style of speech. If we intend to understand human behaviour as a key-feature for recommender systems, banking risk assessments or sociological researches, this is better to a...
Article
Full-text available
Many companies want or prefer to use chatbot systems to provide smart assistants for accompanying human specialists especially newbies with automatic consulting. Implementation of a really useful smart assistant for a specific domain requires a knowledge base for this domain, that often exists only in the form of text documentation and manuals. Lac...
Article
Full-text available
This article describes the study results of semi-structured data processing and analysis of the Russian court decisions (almost 30 million) using distributed cluster-computing framework and machine learning. Spark was used for data processing and decisions trees were used for analysis. The results of the automation of data collection and structurin...
Chapter
Full-text available
Today advanced research is based on complex simulations which require a lot of computational resources that usually are organized in a very complicated way from technical part of the view. It means that a scientist from physics, biology or even sociology should struggle with all technical issues on the way of building distributed multi-scale applic...
Article
Full-text available
Data provided by social media becomes an increasingly important analysis material for social scientists, market analysts, and other stakeholders. Diversity of interests leads to the emergence of a variety of crawling techniques and programming solutions. Nevertheless, these solutions have a lack of flexibility to satisfy requirements of different u...
Conference Paper
Modern cities are a subject for various threats like terrorist attacks or natural disasters. Effective response on them requires fast delivering of information as close as possible to the source of events. Online social networks can play a role of monitoring system for such kind events with its users as particular sensors. But to exploit such a sys...
Chapter
Volunteering and community reaction is known to be an essential part of response to critical events. Rapid evolution and emergence of the new means of communication allowed even further expansion of these practices via the medium of the social networks. A new category of volunteers emerged – those that are not in the proximity to the area of emerge...
Poster
Full-text available
Conference poster for paper "Kalyuzhnaya, A. V., Nikitin, N. O., Butakov, N., & Nasonov, D. (2018, June). Precedent-Based Approach for the Identification of Deviant Behavior in Social Media. In International Conference on Computational Science (pp. 846-852). Springer, Cham. "
Chapter
The current paper is devoted to a problem of deviant users’ identification in social media. For this purpose, each user of social media source should be described through a profile that aggregates open information about him/her within the special structure. Aggregated user profiles are formally described in terms of multivariate random process. The...
Article
Full-text available
Due to the rapid development of social networks and their increasing availability from almost any device and platform, they have became an essential part of daily life for most of the users all over the world. Thus, comes the natural assumption that the way of how user prefers to live offline affects the content of his/her internet profile. Having...
Article
Full-text available
Sentiment is an important feature of natural language. It is used to understand semantic of texts and opinion of people. There are many practical applications, which require to extract sentiment from texts: advertising analytics, interactive chat bots, opinion mining. Today, different supervised techniques are used to extract sentiment from texts w...
Article
Full-text available
В течение последних нескольких лет в России наблюдается, с одной стороны, увеличение интереса к популяризации научных знаний, а с другой - рост числа людей, верящих в состоятельность гомеопатии и испытывающих недоверие или даже страх по поводу генно-модифицированных продуктов. Это значит, что необходимость распространения научных знаний весьма акту...
Conference Paper
The Multiscale Modelling and Simulation approach is a powerful methodological way to identify sub-models and classify their interaction. The execution order and interaction of computational modules are described in the form of workflow. This workflow can be executed as a single HPC cluster job if there is a middleware which schedule modules executi...
Conference Paper
The ‘VKontakte’ social network is increasingly becoming a useful source of data for studying many aspects of Russian social life. This research paper uses this source to investigate one of the most crucial issues for many young Russians – how successful are Russia’s leading universities in the employability rates of their graduates on completion of...
Article
The development of an efficient Early Warning System (EWS) is essential for the prediction and prevention of imminent natural hazards. In addition to providing a computationally intensive infrastructure with extensive data transfer, high-execution reliability and hard-deadline satisfaction are important requirements of EWS scenario processing. This...
Article
Full-text available
In the past few decades of the development of models of information spread through complex networks it became evident that robust assessment and simulation of such processes seems to be feasible when reorganization of the network is neglected. Social networking sites are a particular case of such complication. Moreover, there are few distinctive fe...
Article
Full-text available
Social sensing is increasingly becoming a viable addition to the urban monitoring toolkit for practitioners and decision-makers. It seems to be more flexible and cost-effective as compared to dedicated monitoring systems based on instrumental sensors and surveillance cameras. However, benefitting from these advantages requires deploying fine-tuned...
Article
Full-text available
Simulation of the agent-based model has several problems related to scalability, the accuracy of reproduction of motion. The increase in the number of agents leads to additional computations and hence the program run time also increases. This problem can be solved using distributed simulation and distributed computational environments such as clust...
Article
Urgent computing capabilities for early warning systems and decision support systems are vital in situations that require execution be completed before a specified deadline. The cost of missing the deadline in such situations can be unacceptable, while providing insufficient results can mean an ineffective solution that may come at a very high cost...
Article
Full-text available
The importance of online social networks (OSN) and their data leads to the need to collect this data for different purposes. Restrictions imposed by various OSNs prevents obtaining this data in the required volume and time. Sharing credentials by many users in combination with different user needs and their request types can solve this problem, but...
Article
Full-text available
Currently social networks are a medium designed to share and spread information and opinions among users. To effectively spread information characteristics of information sources and their positions in the network have to be properly adjusted. Sources play a crucial role as they initiate and support the spreading process. Adjusting sources has a ce...
Article
Full-text available
An efficient scheduling is the essential part of complex scientific applications processing in computational distributed environments. The computational complexity comes as from environment heterogeneity as from the application structure that usually is represented as a workflow which contains different linked tasks. A lot of well-known techniques...
Article
Full-text available
In this work the framework for detectors layout optimization based on a multi-agent simulation is proposed. Its main intention is to provide a decision support team with a tool for automatic design of social threat detection systems for public crowded places. Containing a number of distributed detectors, this system performs detection and an identi...
Conference Paper
Full-text available
Today technological progress makes scientific community to challenge more and more complex issues related to computational organization in distributed heterogeneous environments, which usually include cloud computing systems, grids, clusters, PCs and even mobile phones. In such environments, traditionally, one of the most frequently used mechanisms...
Conference Paper
Data visualization traditionally is the most powerful tool for demonstration and analysis of scientific results and mathematical models in particular. In this paper we introduce the graphical framework for citation graph clustering. Furthermore, we discuss ways to detect factors responsible for scientific groups formation. Two datasets of scientifi...
Article
Typical patterns of using scientific workflows include their periodical executions using a fixed set of computational resources. Using the statistics from multiple runs, one can accurately estimate task execution and communication times to apply static scheduling algorithms. Several workflows with known estimates could be combined into a set to imp...
Article
Full-text available
Typical patterns of using scientific workflow management systems (SWMS) include periodical executions of prebuilt workflows with precisely known estimates of tasks' execution times. Combining such workflows into sets could sufficiently improve resulting schedules in terms of fairness and meeting users' constraints. In this paper, we propose a clust...
Article
Full-text available
Investigations in development of efficient early warning systems (EWS) are essentially for prediction and warning of upcoming natural hazards. Besides providing of communicationand computationally intensive infrastructure, the high resource reliability and hard deadline option are requiredfor EWS scenarios processing in order to get guaranteed info...
Chapter
The optimal workflow scheduling is one of the most important issues in heterogeneous distributed computational environment. Existing heuristic and evolutionary scheduling algorithms have their advantages and disadvantages. In this work we propose a hybrid algorithm based on Heterogeneous Earliest Finish Time heuristic and genetic algorithm that com...
Article
State-of-the-art distributed computational environments requires increasingly flexible and efficient workflow scheduling procedures in order to satisfy the increasing requirements of the scientific community. In this paper, we present a novel, nature-inspired scheduling approach based on the leveraging of inherited populations in order to increase...

Citations

... Though even with the development of neural models for specific task-solving topic models are still used not only for an understanding of contents of large corpora but also for other problems, such as information retrieval [17], downstream document classification [18], or sentiment extraction [19]. Resulting distributions over documents can be used to find the subsets of homogeneous data, which can be utilized to improve the performance of neural networks by fine-tuning the general model on domain-specific data [20]. ...
... We prepared a set of topic models with additive regularization from various optimization generations with the help of the evolutionary approach described in [32]. The main idea of the method is to effectively optimize the hyperparameters of topic models with additive regularization paying attention to the iterative improvement of the models. ...
... For decoding, the global representation and globally scoped local representation are used together along with a SoftMax activation function. One of the models that tried to build on top of the learnings by ABAE model is the CMAM [9] model that introduces a novel convolutional multi-attention mechanism which helps us find the aspects in the form of (aspect, term) pairs. It can perform better than the ABAE model in terms of F1 score for some of the aspects. ...
... Literature reviewed against the proposed method of Sindhi documents retrieval system for students is divided into multiple directions such as information retrieval, indexing of documents, topics, modeling, topic searching, data mining, and pattern discovery. A scheme was presented in [10] for IR that was completely based on text information. The proposed approach is based on ensemble and data-driven techniques, which eventually highlight the combinations of various default sets. ...
... and court decisions analysis (Metsker et al., 2019). A recent monograph (Frolova & Ermakova, 2022) states that prediction of court decisions is one of the captivating legal technologies that increasingly attracts the attention of legal practitioners and NLP researchers alike. ...
... Second, some private details are hard to get such as age and gender due to the limited access of some APIs. Some works [1,56] tackle this limitation by using face detection APIs, which predict the age and gender from the photo of profile. However, such researches still restricted and my not respond to our expectations since the majority of suicidal people do not make real photos and usually choose dark photos. ...
... Das et al. [31] applied a robust and effective algorithm that adaptively tunes the batch size for promoting the performance of Spark Streaming. Petrov et al. [32] proposed a robust and adaptive performance model for Spark Streaming to achieve the goal of allocating resources dynamically and reducing the total cost. ...
... The proposed mechanism is based on a sentiment graph for categorizing the polarity of each term. Another sentiment analysis method based on topic modeling is presented in [44] where each sentiment is classified based on its topic. Each topic is manually labeled using a dictionary that measures the false positive rate for each topic. ...
... Still, the automatic evaluation is not well aligned with human evaluation [24], [28] and the problem of the lack of a good metric for measuring the quality of topic models is intensified, including with the emergence and active development of neural topic models [6]- [8], which have the ability to optimize significantly the provided quality metric, but at the same time get worse results based on the results of human perception [24]. ...
... and data from official site "Our SPb". With the help of crawler system [6] posts from selected groups of housing estates, city districts and urban activists were collected, which resulted in approximately 430 thousand texts. Nearly 4000 posts on Saint-Petersburg with district metadata were extracted from "Yandex.local". ...