Łukasz BolikowskiUniversity of Warsaw | UW · Interdisciplinary Centre for Mathematical and Computational Modelling
Łukasz Bolikowski
PhD in Computer Science
About
42
Publications
8,624
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
525
Citations
Introduction
Head of Data Analysis Laboratory at ICM, University of Warsaw. His current research interests focus on (big and small) data analysis and on new models of scholarly communication.
Holds a PhD in Computer Science from Polish Academy of Sciences, and has a solid mathematical background. 10+ years of experience at ICM UW, working as a researcher, analyst and/or software developer in numerous interdisciplinary projects.
Also an independent expert for Polish and European research funding agencies.
Additional affiliations
November 2002 - present
Education
March 2011
June 2004
Publications
Publications (42)
OpenAIRE is the European Union initiative for an Open Access Infrastructure for Research in support of open scholarly communication and access to the research output of European funded projects and open access content from a network of institutional and disciplinary repositories. This article outlines the curation activities conducted in the OpenAI...
In this work, we compare two simple methods of tagging scientific
publications with labels reflecting their content. As a first source of labels
Wikipedia is employed, second label set is constructed from the noun phrases
occurring in the analyzed corpus. We examine the statistical properties and the
effectiveness of both approaches on the dataset...
The field of digital document content analysis includes many important tasks, for example page segmentation or zone classification. It is impossible to build effective solutions for such problems and evaluate their performance without a reliable test set, that contains both input documents and expected results of segmentation and classification. In...
CERMINE is a comprehensive open source system for extracting structured metadata from scientific articles in a born-digital form. Among other information, CERMINE is able to extract authors and affiliations of a given publication, establish relations between them and present extracted metadata in a structured, machine-readable form. Affiliations ex...
CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algori...
CERMINE is a comprehensive open source system for extracting structured metadata and references from born-digital scientific literature. Among other information, the system is able to extract information related to the context the article was written in, such as the authors and their affiliations, the relations between them or references to other a...
Minting persistent identifiers and managing their metadata is typically governed by a single organization. Such a single point of failure poses a risk to longevity and long-term preservation of identifiers. In this paper we address the risk by proposing a radically different approach, in which minting and management of persistent identifiers is dis...
Contemporary scholarly communication is undergoing a paradigm shift, which in some ways echoes the one from the start of the Digital Era, when publications moved to a digital form. The most prominent example of such requirements is provided by the European Commission with the Open Access mandates for publications and data in Horizon2020. Finally, r...
The Information Inference Framework presented in this paper provides a general-purpose suite of tools enabling the definition and execution of flexible and reliable data processing workflows whose nodes offer application-specific processing capabilities. The IIF is designed for the purpose of processing big data, and it is implemented on top of Apa...
Scientific literature analysis improves knowledge propagation and plays a key role in understanding and assessment of scholarly communication in scientific world. In recent years many tools and services for analysing the content of scientific articles have been developed. One of the most important tasks in this research area is understanding the ro...
Most commonly the first part of record deduplication is blocking. During this phase, roughly similar entities are grouped into blocks where more exact clustering is performed. We present a blocking method for citation matching based on hash functions. A blocking workflow implemented in Apache Hadoop is outlined. A few hash functions are proposed an...
In this work, two simple methods of tagging scientific publications with labels reflecting their content are presented and compared. As a first source of labels, Wikipedia is employed. A second label set is constructed from the noun phrases occurring in the analyzed corpus. The corpus itself consists of abstracts from 0.7 million scientific documen...
CERMINE is a comprehensive open source system for extracting metadata and parsed bibliographic references from scientific articles in born-digital form. The system is based on a modular workflow, whose architecture allows for single step training and evaluation, enables effortless modifications and replacements of individual components and simplifi...
Conducting a research in an efficient, repetitive, evaluable, but also convenient (in terms of development) way has always been a challenge. To satisfy those requirements in a long term and simultaneously minimize costs of the software engineering process, one has to follow a certain set of guidelines. This article describes such guidelines based o...
Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big...
Our goal is to investigate the influence of the geometry and topology of the domain $\Omega$ on the solutions of the phase transition and other diffusion-driven phenomena in $\Omega$, modeled e.g.\ by the Allen-Cahn, Cahn-Hilliard, reaction--diffusion equations. We present FEM numerical schemes for the Allen--Cahn and Cahn--Hilliard equation based...
Content Analysis System (CoAnSys) is a research framework for mining
scientific publications using Apache Hadoop. This article describes the
algorithms currently implemented in CoAnSys including classification,
categorization and citation matching of scientific publications. The size of
the input data classifies these algorithms in the range of big...
Conducting a research in an efficient, repetitive, evaluable, but also
convenient (in terms of development) way has always been a challenge. To
satisfy those requirements in a long term and simultaneously minimize costs of
the software engineering process, one has to follow a certain set of
guidelines. This article describes such guidelines based o...
During the process of citation matching links from bibliography entries
to referenced publications are created. Such links are indicators of
topical similarity between linked texts, are used in assessing the
impact of the referenced document and improve navigation in the user
interfaces of digital libraries. In this paper we present a citation
matc...
SYNAT platform powered by the YADDA2 architecture has been extended with the Author Disambiguation Framework and the Query Framework. The former framework clusters occurrences of contributor names into identities of authors, the latter answers queries about authors and documents written by them. This paper presents an outline of the disambiguation...
Bibliographic references between scholarly publications contain valuable information for researchers and developers involved with digital repositories. They are indicators of topical similarity between linked texts, impact of the referenced document, and improve navigation in user interfaces of digital libraries. Consequently, several approaches to...
Mathematical publications are often labelled with Mathematical Subject Classification codes. These codes are grouped in a tree-like hierarchy created by experts. In this paper we posit that this hierarchy is highly correlated with content of publications. Following this assumption we try to reconstruct the MSC tree basing on our publications corpor...
At CeON ICM UW we are in possession of a large collection of scholarly documents that we store and process using MapReduce paradigm. One of the main challenges is to design a simple, but effective data model that fits various data access patterns and allows us to perform diverse analysis efficiently. In this paper, we will describe the organization...
One of the common problems when dealing with digital libraries is lack of classification codes in some of the documents. In the following publication we deal with this problem in a multi-label, hierarchical case of Mathematics Subject Classification System. We develop modifications of ML-KNN algorithm and show how they improve results given by the...
ThePrimate ® is an HPW [High Power Web] platform which allows dynamic classification of digital resources from european databases -including those coming from Europeana -the easy retrieval of data, reuse of items and methods based on enhanced tools, a deeper layer of attached information. The platform is strictly connected to Filamento®, an origina...
OpenAIRE and OpenAIREplus (Open Access Infrastructure for Research in Europe) are EC funded projects (Dec 2009 - May 2014) whose goals are to realize, enhance, and operate the Open Access European scholarly communication data infrastructure. This paper describes the high-level architecture and functionalities of that infrastructure, including servi...
YADDA2 is an open software platform which facilitates creation of digital library applications. It consists of versatile building blocks providing, among others: storage, relational and full-text indexing, process management, and asynchronous communication. Its loosely-coupled service-oriented architecture enables deployment of highly-scalable, dis...
Author name disambiguation allows to distinguish between two or more authors sharing the same name. In a previous paper, we have proposed a name disambiguation framework in which for each author name in each article we build a context consisting of classification codes, bibliographic references, co-authors, etc. Then, by pair wise comparison of con...
We present a comprehensive system for extracting metadata from scholarly articles. In our approach the entire document is inspected, including headers and footers of all the pages as well as bibliographic references. The system is based on a modular workflow which allows for evaluation, unit testing and replacement of individual components. The wor...
In this paper we propose a flexible, modular framework for author name disambiguation. Our solution consists of the core which orchestrates the disambiguation process, and replaceable modules performing concrete tasks. The approach is suitable for distributed computing, in particular it maps well to the MapReduce framework. We describe each compone...
In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work.
We ask the question of patterns' stability for the reaction-diffusion equation with Neumann boundary conditions in an irregular domain in R N , N ≥ 2, the model example being two convex regions connected by a small 'hole' in their boundaries. By patterns we mean solutions having an interface, i.e. a transition layer between two constants. It is wel...
We have investigated the problem of clustering documents according to their semantics, given incomplete and incoherent hints reflecting the documents’ affinities. The problem has been rigorously defined using graph theory in set-theoretic notation. We have proved the problem to be NP-hard, and proposed five heuristic algorithms which deal with the...
The interlanguage links in Wikipedia connect pages on the same subject written in different languages. In theory, each connected component should be a clique and cover one topic. However, incoherent edits and obvious mistakes result in topic coalescence, yielding a non-trivial topology that is studied in this paper. We show that the component size...
YADDA framework facilitates information exchange between digital document repositories. YADDAWeb, its web-based interface, provides browse and search functionalities. Content providers use DeskLight application to add or modify metadata and content. Internally, YADDA contains flexible repository aggregation mechanisms, multiple hierarchy support an...
In this paper we propose two new metrics on the space of phylogenetic trees. The problem of determining how distant two trees are from each other is crucial because many various methods have been proposed for re- constructing phylogenetic trees from data. These techniques (in fact often heuristics) applied to the same data results in different tree...
We have implemented a parallel version of Position-Specific Iterated Smith-Waterman algorithm using UPC. The core Smith-Waterman implementation is vectorized for Cray X1E, but it is possible to use an FPGA core instead. The quality of results and performance are discussed.