Vadim Zaytsev

Vadim Zaytsev
Raincode Labs

PhD

About

83
Publications
5,657
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
410
Citations
Additional affiliations
December 2010 - November 2013
Centrum Wiskunde & Informatica
Position
  • GrammarLab: Foundations for a Grammar Laboratory
May 2008 - November 2010
Universität Koblenz-Landau
Position
  • Software Language Processing Suite
January 2004 - April 2008
Victoria University Melbourne
Position
  • Language-Parametric Program Restructuring

Publications

Publications (83)
Conference Paper
We propose a tool and underlying technique that uses semi-parsing to extract control flow graphs from legacy source code (written in COBOL). Obtaining such control flow graphs can be useful in the industrial setting of legacy moderni-sation, to quickly demonstrate to code owners that modernisation engineers did not break their business logic. They...
Presentation
Full-text available
Refactoring is a common step in the process of modernising software. This task is often delegated to experts, e.g. when dealing with complex legacy software. An example of such experts is the company Raincode Labs, a company that provides services in the realm of legacy modernisation. When working on code critical to a business, it is important to...
Poster
Full-text available
When working with a complex process, it is difficult to get a clear idea on how exactly changes to the input can impact the output. Visualising how the steps of such process evolves with input can help understanding and/or boost confidence in the produced result. We took a suitable existing industrial process, and created a visualisation for it usi...
Conference Paper
Full-text available
Large code refactoring projects can consist of hundreds of refactoring rules that are applied iteratively to make code easier to maintain. Visualising the refactoring process can help engineers and stakeholders understand how chains of refactorings were applied and to gain more confidence in the produced result. An apparently suitable existing visu...
Conference Paper
Full-text available
Software developers repeatedly perform similar but non-identical changes to a systems source code. Such groups of similar systematic code changes are performed for various reasons: adapting code to a changed API, migrating to a different library, refactoring to improve code quality, performing routine code maintenance tasks, fixing multiple manifes...
Conference Paper
Full-text available
Software written in legacy programming languages is notoriously ubiquitous and often comprises business-critical portions of codebases and portfolios. Some of these languages, like COBOL, mature, grow, and acquire modern tooling that makes maintenance activities more bearable. Others, like many fourth generation languages (4GLs), stagnate and becom...
Conference Paper
Full-text available
Most software engineers interact with some form of code differencing every day, either directly or indirectly. Yet, many existing algorithms and tools used in that context have not significantly evolved since the basic Unix diff utility. As a consequence, many specific characteristics and semantics of code are not taken into account in the process,...
Conference Paper
Full-text available
In an ongoing industry-university collaboration we are developing a language-parametric framework for mining code idioms in legacy systems. This modular framework has a pipeline architecture and a language-parametric meta representation of the artefacts used by each of its 5 components: source code importer, mining preprocessor, pattern miner, patt...
Conference Paper
Full-text available
Discovering regularities in source code is of great interest to software engineers, both in academia and in industry, as regularities can provide useful information to help in a variety of tasks such as code comprehension, code refactoring, and fault localisation. However, traditional pattern mining algorithms often find too many patterns of little...
Conference Paper
Full-text available
Event-based parsing is a largely unexplored problem. Despite several hugely popular event-based parsers like SAX, there is very little research on the ways grammar engineers can be given explicit control over handling input tokens, and the consequences of exposing this control. Tool support is also underwhelming, with no language workbenches and ve...
Conference Paper
Full-text available
Typically in modernisation projects any concerns for code quality are silenced until the end of the migration, to simplify an already complex process. Yet, we claim from experience that prioritising quality above many other issues has many benefits. In this experience report, we discuss a modernisation project of mBank, a big Polish bank, where bad...
Conference Paper
Full-text available
Compiler construction is one of the oldest areas of software engineering, yet despite its maturity it has underdeveloped sides such as compiler testing. There exist many disparate methods for testing parsers, optimisers and other components, but no unified methodology that consumable by practitioners from a book to be directly applied to fulfil the...
Conference Paper
Most modern software languages enjoy relatively free and relaxed concrete syntax, with significant flexibility of formatting of the program/model/sheet text. Yet, in the dark legacy corners of software engineering there are still languages with a strict fixed column-based structure — the compromises of times long gone, attempting to combine some hu...
Conference Paper
Full-text available
One of the contemporary methods of tackling complexity and obscurity of information systems is megamodelling: creating explicit models that express relations between software artefacts, languages and transformations. Such megamodels are capable of encapsulating architectural knowledge of a system while retaining the ability to ``zoom in'' and provi...
Conference Paper
Any grammar engineer can tell a good grammar from a bad one, but there is no commonly accepted taxonomy of indicators of required grammar refactorings. One of the consequences of this lack of general smell taxonomy is the scarcity of tools to assess and improve the quality of grammars. By combining two lines of research — on smell detection and on...
Article
Full-text available
Most modern software languages enjoy relatively free and relaxed concrete syntax, with significant flexibility of formatting of the program/model/sheet text. Yet, in the dark legacy corners of software engineering there are still languages with a strict fixed column-based structure — the compromises of times long gone, attempting to combine some hu...
Conference Paper
Full-text available
Legacy software systems were often written not just in programming languages typically associated with legacy, such as COBOL, JOVIAL and PL/I, but also in decommissioned or deprecated 4GLs. Writing compilers and other migration and renovation tools for such languages is an active business that requires substantial effort but has proven to be a succ...
Conference Paper
Full-text available
Article
The context of this work is specification, detection and ultimately removal of detectable harmful patterns in source code that are associated with defects in design and implementation of software. In particular, we investigate five code smells and four antipatterns previously defined in papers and books. Our inquiry is about detecting those in sour...
Book
This tutorial volume includes the revised and extended tutorials (briefings) held at the 5th International Summer School on Grand Timely Topics in Software Engineering, GTTSE 2015, in Braga, Portugal, in August 2015. GTTSE 2015 applied a broader scope to include additional areas of software analysis, empirical research, modularity, and product line...
Conference Paper
Full-text available
Coding conventions are lexical, syntactic or semantic restrictions enforced on top of a software language for the sake of consistency within the source base. Specifying coding conventions is currently an open problem in software language engineering, addressed in practice by resorting to natural language descriptions which complicate conformance ve...
Conference Paper
IBM's High Level Assembler (HLASM) is a low level programming language for z/Architecture mainframe computers. Many legacy codebases contain large subsets written in HLASM for various reasons, and such components usually had to be manually rewritten in COBOL or PL/I before a migration to a modern framework could take place. Now, the Raincode ASM370...
Conference Paper
Full-text available
Software language identification techniques are applicable to many situations from universal IDE support to legacy code analysis. Most widely used heuristics are based on software artefact metadata such as file extensions or on grammar-based text analysis such as keyword search. In this paper we propose to use statistical language models from the n...
Article
SPPF (shared packed parse forest) is the best known graph representation of a parse forest (family of related parse trees) used in parsing with ambiguous/conjunctive grammars. Systematic general purpose transformations of SPPFs have never been investigated and are considered to be an open problem in software language engineering. In this paper, we...
Article
Full-text available
Relating formal grammars is a hard problem that balances between language equivalence (which is known to be undecidable) and grammar identity (which is trivial). In this paper, we investigate several milestones between those two extremes and propose a methodology for inconsistency management in grammar engineering. While conventional grammar conver...
Article
Full-text available
SPPF (shared packed parse forest) is the best known graph representation of a parse forest (family of related parse trees) used in parsing with ambiguous/conjunctive grammars. Systematic general purpose transformations of SPPFs have never been investigated and are considered to be an open problem in software language engineering. In this paper, we...
Article
SPPF (shared packed parse forest) is the best known graph representation of a parse forest (family of related parse trees) used in parsing with ambiguous/ conjunctive grammars. Systematic general purpose transformations of SPPFs have never been investigated and are considered to be an open problem in software language engineering. In this paper, we...
Conference Paper
Full-text available
At the end of the 2014 MODELS Educators Symposium, a panel discussed the topic of the use of MOOCs (Massive Open Online Courses) in model-driven engineering education with the audience. Currently , there are no MOOCs that target software modeling. The panel argued if it would be worthwhile to investigate the idea of using MOOCs for modeling courses...
Conference Paper
Full-text available
Having multiple representations of the same instance is common in software language engineering: models can be visualised as graphs, edited as text, serialised as XML. When mappings between such representations are considered, terms “parsing” and “unparsing” are often used with incompatible meanings and varying sets of underlying assumptions. We in...
Article
In this paper we describe composition of a corpus of grammars in a broad sense in order to enable reuse of knowledge accumulated in the field of grammarware engineering. The Grammar Zoo displays the results of grammar hunting for big grammars of mainstream languages, as well as collecting grammars of smaller DSLs and extracting grammatical knowledg...
Article
In this paper, we study controlled adaptability of metamodel transformations. We consider one of the most rigid metamodel evolution formalisms - automated grammar transformation with operator suites, where a transformation script is built in such a way that it is essentially meant to be applicable only to one designated input grammar fragment. We p...
Conference Paper
Full-text available
There exist many techniques for imprecise manipulation of source code (robust parsing, error repair, lexical analysis, etc), mostly relying on heuristic-based tolerance. Such techniques are rarely fully formalised and quite often idiosyncratic, which makes them very hard to compare with respect to their applicability, tolerance level and general us...
Article
We all use software modelling in some sense, often without using this term. We also tend to use increasingly sophisticated software languages to express our design and implementation intentions towards the machine and towards our peers. We also occasionally engage in metamodelling as a process of shaping the language of interest, and in megamodelli...
Article
Full-text available
The evolution of a software language (whether modelled by a grammar or a schema or a metamodel) is not limited to development of new versions and dialects. An important dimension of a software language evolution is maturing in the sense of improving the quality of its definition. In this paper, we present a maturity model used within the Grammar Zo...
Article
We should test people the same way we test software. During the last decades, the field of software testing has matured into a solid sector of software engineering with a wide variety of available techniques, empirically supported usefulness and applicability claims, a sophisticated ontology of terms and approaches, as well as an arsenal of tools....
Article
There are many software languages which are not exposed protocols, exchange formats, interfaces and storage formats, and are only used for intermediate representation, runtime data manipulation and tool-specific serialisation. Yet, they can be important for technology comprehension, since such internal implementation details may have indirect impac...
Article
Full-text available
Code duplication in a program can make understanding and maintenance difficult. The problem can be reduced by detecting duplicated code, refactoring it into a separate procedure, and replacing all the clones by appropriate calls to the new procedure. In this paper, we report on a confirmatory replication of a tool that was used to detect such refac...
Article
Full-text available
Software Language Engineering (SLE) has emerged as a field in computer science research and software engineering, but it has yet to become entrenched as part of the standard curriculum at universities. Many places have a compiler construction (CC) course and a programming languages (PL) course, but these are not aimed at training students in typica...
Article
Grammars in a broad sense (specifications of structural commitments) are complex artefacts that define software languages. Assessing and improving their quality in an automated, non-idiosyncratic manner is an unsolved problem which we face in an especially acute form in the case of mass maintenance of hundreds of heterogeneous grammars (parser spec...
Conference Paper
Micropatterns and nanopatterns have been previously demonstrated to be useful techniques for object-oriented program comprehension. In this paper, we use a similar approach for identifying structurally similar fragments in grammars in a broad sense (contracts for commitment to structure), in particular parser specifications, metamodels and data mod...
Conference Paper
The OOPSLE workshop is a discussion-oriented and collaborative forum for formulating and addressing with open, unsolved and unsolvable problems in software language engineering (SLE), which is a research domain of systematic, disciplined and measurable approaches of development, evolution and maintenance of artificial languages used in software dev...
Article
The classic approach to grammar manipulation is based on instant processing of grammar edits, which limits the kinds of grammar evolution scenarios that can be expressed with it. Treating transformation preconditions as guards poses limitations on concurrent changes of the same grammar, on reuse of evolution scripts, on expressing optionally execut...
Article
Megamodels may be difficult to understand because they reside at a high level of abstraction and they are graph-like structures that do not immediately provide means of order and decomposition as needed for successive examination and comprehension. To improve megamodel comprehension, we introduce modeling features for the recreation, in fact, renar...
Article
Grammars in a broad sense (specifications of structural commitments) are complex artefacts that define software languages. Assessing and improving their quality in an automated, non-idiosyncratic manner is an unsolved problem which we face in an especially acute form in the case of mass maintenance of hundreds of heterogeneous grammars (parser spec...
Article
This document is a case study in aggressive self-archiving. It collects all initiatives undertaken by its author in 2012, including unpublished ones, explains their relevance and relation with one another. Discussed topics include guided convergence of formal grammars in a broad sense, programmable grammar transformation operator suites, metasyntac...
Conference Paper
We study the use of megamodels (models of linguistic architecture) for presenting software language engineering scenarios. Megamodels and techniques similar to them are frequently found in situations when a linguistic architecture needs to be understood without the implicit knowledge that was originally present, and in situations when such knowledg...
Conference Paper
Full-text available
In this paper, we study controlled adaptability of metamodel transformations. We consider one of the most rigid metamodel transformation formalisms --- automated grammar transformation with operator suites, where a transformation script is built in such a way that it is essentially meant to be applicable only to one designated input grammar fragmen...
Article
Full-text available
This report is meant to be used as auxiliary material for the guided grammar convergence technique proposed earlier as problem-specific improvement in the topic of convergence of grammars. It contains a narrated MegaL megamodel, as well as full results of the guided grammar convergence experiment on the Factorial Language, with details about each g...
Article
Automation of grammar recovery is an important research area that received attention over the last decade and a half. Given the abundance of available documentation for software languages that is only going to keep increasing in the future, there is need for reliable extraction techniques that allow grammar engineers to derive useful information fr...
Conference Paper
Reusing existing grammar knowledge residing in standards, specifications and manuals for programming languages, faces several challenges. One of the most significant of them is the diversity of syntactic notations: without loss of generality, we can state that every single language document uses its own notation, which is more often than not, a dia...
Article
Full-text available
Currently existing syntactic definitions employ many different notations (usually dialects of EBNF) with slight deviations among them, which prevent efficient automated processing. When changes in such notation are required either due to maintenance activities such as correction or evolution, or because a grammar collection is written in a differen...
Article
Vadim Zaytsev. Guided Grammar Convergence. Draft submitted to The Eighth European Conference on Modelling Foundations and Applications (ECMFA 2012), July 2012. Submitted, pending reviews.
Article
Vadim Zaytsev. Notation-Parametric Grammar Recovery. To appear in The Proceedings of the 12th International Workshop on Language Descriptions, Tools, and Applications (LDTA 2012), March 2012
Article
Bernd Fischer, Ralf Lämmel, Vadim Zaytsev. Comparison of Context-free Grammars Based on Parsing Generated Test Data. In Uwe Aßmann, Anthony Sloane, editors, Post-proceedings of the Fourth International Conference on Software Language Engineering (SLE 2011), LNCS 6940, pages 324–343, Springer, Heidelberg, 2012.
Article
Vadim Zaytsev. BNF WAS HERE: What Have We Done About the Unnecessary Diversity of Notation for Syntactic Definitions. Accepted at The Technical Track on Programming Languages of the 27th ACM Symposium on Applied Computing (SAC 2012), March 2012. To appear.
Article
Full-text available
The paper describes in detail the recovery effort of one of the official MediaWiki grammars. Over two hundred grammar transformation steps are reported and annotated, leading to delivery of a level 2 grammar, semi-automatically extracted from a community created semi-formal text using at least five different syntactic notations, several non-enforce...
Conference Paper
There exist a number of software engineering scenarios that essentially involve equivalence or correspondence assertions for some of the context-free grammars in the scenarios. For instance, when applying grammar transformations during parser development--be it for the sake of disambiguation or grammar-class compliance--one would like to preserve t...
Conference Paper
Full-text available
We have analyzed a substantial number of language documentation artifacts, including language standards, language specifications, language reference manuals, as well as internal documents of standardization bodies. We have reverse-engineered their intended internal structure, and compared the results. The Language Document Format (LDF), was develop...
Article
Full-text available
Grammar convergence is a method that helps discovering relationships between different grammars of the same language or different language versions. The key element of the method is the operational, transformation-based representation of those relationships. Given input grammars for convergence, they are transformed until they are structurally equa...
Conference Paper
We describe a completed effort to recover the relationships between all the grammars that occur in the different versions of the Java Language Specification (JLS). The relationships are represented as grammar transformations that capture all accidental or intended differences between the JLS grammars. This process is mechanized and it is driven by...
Conference Paper
The process of grammar convergence involves grammar extraction and transformation for structural equivalence and contains a range of technical challenges. These need to be addressed in order for the method to deliver useful results. The paper describes a DSL and the infrastructure behind it that automates the convergence process, hides negligible b...
Conference Paper
Full-text available
Grammar convergence is a lightweight verification method for estab- lishing and maintaining the correspondence between grammar knowledge in- grained in all kinds of software artifacts, e.g., object models, XML schemas, parser descriptions, or language documents. The central idea is to extract gram- mars from diverse software artifacts, and to trans...
Article
Full-text available
The ISO programming language standards are valuable documents that de-scribe the syntax and semantics of mainstream languages. New features are pro-posed after thorough reviews by the standardization committees, leading to change documents that describe which modifications have to be enforced in the language standard document in order to actually a...
Article
Full-text available
1 Grammar consistency checking Many software languages (and programming lan-guages, in particular) are described simultaneously by multiple grammars that are found in different soft-ware artifacts. For instance, one grammar may reside in a language specification; another grammar may be encoded in a parser specification; yet another gram-mar may be...
Article
Introduction. The most used programming language nowadays is COBOL. At the Free Unversity in Amsterdam we have done numerous transformations on COBOL, parsed and transformed millions of lines of code. COBOL is standard-ised, but vendors usually deviate from the standard, making their own dialects. In order to parse code, we need a working grammar,...
Article
The Java Language Specification (JLS) is an industrial standard that is critical to the Java platform. Each of the 3 versions of the JLS contains 2 dif- ferent grammars — a "more readable" one, and a "more implementable" one. The JLS does not describe the correspondences between the 6 grammars in any systematic, detailed manner. We have found that...
Article
Using a mechanized process, we have reverse-engineered the relationships between the 6 different grammars that are contained in 3 versions of the Java Language Speci- fication (JLS). To this end, we have extracted the gram- mars from (the HTML representation of) the JLS, and we have systematically resolved or captured all accidental or intended dif...
Article
A grammar of a programming language is needed for re-engineering source code. However, the grammar is not always readily available and should be reverse engineered from a language manual or other documents of similar na-ture by using automatic and semi-automatic recovery meth-ods. These techniques are also needed for modern lan-guages such as C#. T...
Article
Another view on code obfuscation is presented below: we address it not as a tech-nique that opposes reverse engineering, but as a restructuring transformation use-ful in different areas as well. Besides, we consider code obfuscation as one of the techniques that can be researched and implemented in a programming language parametric way. The paper e...
Article
Grammarware engineering is a discipline for uniformly ap-proaching analysis and manipulation of structured defini-tions (grammars). There are many known scenarios of soft-ware language evolution, transformation, testing, etc., which usefulness is not disputed, yet limited by the lack of visual-ization techniques. In this short paper, several exampl...
Article
The paper concerns the problem of applying software metrics to very large heterogeneous codebases. It is de-scribed how to create such a codebase and how to apply previously proposed benchmarks such as cyclomatic com-plexity and function point analysis to it. Big companies such as ones doing insurance or banking tend to maintain big IT portfolios c...

Network

Cited By

Projects

Projects (2)
Project
In software engineering, code differencing techniques have not evolved much from the initial diff algorithm created in 1976 by Hunt and McIlroy, whereas the discipline of computer science has evolved with parsing and clone detection techniques or more semantic and model-based approaches. We want to explore these techniques to discover how they could be applied to the discipline of differencing to create concrete industrially useful tools. For that purpose, we will rely on the help of our industrial partner who has concrete industrial experience, as well as knowledge about processes that are currently still performed manually and require automation. This industrial partner is Raincode Labs, who has many years of experience in the domain of software evolution and migration, and who can provide us with real-life industrial use cases and data along with their experience.