Jakub Piskorski

Jakub Piskorski
Polish Academy of Sciences | PAN

About

116
Publications
15,930
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,897
Citations
Citations since 2017
15 Research Items
623 Citations
2017201820192020202120222023020406080100
2017201820192020202120222023020406080100
2017201820192020202120222023020406080100
2017201820192020202120222023020406080100

Publications

Publications (116)
Conference Paper
Full-text available
In order to address the growing need of monitoring climate-change denial narratives in online sources, NLP-based methods have the potential to automate this process. Here, we report on preliminary experiments of exploiting Data Augmentation techniques for improving climate change denial classification. We focus on a selection of both known techniqu...
Preprint
Full-text available
This workshop is the fourth issue of a series of workshops on automatic extraction of socio-political events from news, organized by the Emerging Market Welfare Project, with the support of the Joint Research Centre of the European Commission and with contributions from many other prominent scholars in this field. The purpose of this series of work...
Article
We describe a simple IR approach for linking news about events, detected by an event extraction system, to messages from Twitter (tweets). In particular, we explore several methods for creating event-specific queries for Twitter and provide a quantitative and qualitative evaluation of the relevance and usefulness of the information obtained from th...
Conference Paper
Full-text available
This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents , normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language...
Article
Full-text available
In the two recent decades various security authorities around the world acknowledged the importance of exploiting the ever-growing amount of information published on the web on various types of events for early detection of certain threats, situation monitoring and risk analysis. Since the information related to a particular real-world event might...
Conference Paper
In this paper, we present a number of experiments on the construction of fine-grained and out-of-context multi-word entity classification models. These models exploit a large BabelNet-derived multilingual Named Entity corpus of 49 languages from 7 different scripts, which is also presented in this work. In particular, we compare SVM-based character...
Conference Paper
Full-text available
We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams pa...
Conference Paper
Nowadays, an ever-growing amount of information is being transferred through web-based social media. In particular, Twitter emerged to be an important social medium providing most up-to-date information and comments on current events and topics of any kind. This led to a continuous growth of the interest of various security-related organizations in...
Book
Information extraction (IE) and text summarization (TS) are powerful technologies for finding relevant pieces of information in text and presenting them to the user in condensed form. The ongoing information explosion makes IE and TS critical for successful functioning within the information society. These technologies face particular challenges du...
Article
This chapter presents a number of techniques for multilingual event extraction, the main task is to accurately and efficiently detect key information about security-related events from electronic news media and summarize it in the form of database-like structures. Gathering such information over time is an important task for developing global news...
Book
Full-text available
Information extraction (IE) and text summarization (TS) are powerful technologies for finding relevant pieces of information in text and presenting them to the user in condensed form. The ongoing information explosion makes IE and TS critical for successful functioning within the information society. These technologies face particular challenges d...
Chapter
In this chapter we present a brief overview of Information Extraction, which is an area of natural language processing that deals with finding factual information in free text. In formal terms, facts are structured objects, such as database records. Such a record may capture a real-world entity with its attributes mentioned in text, or a real-world...
Conference Paper
Full-text available
This paper explores a linguistically-oriented method for assigning fine-grained geotagging information to border security-related events reported in online news. It is based on a Cognitive Linguistics model of the semantics of the Locative Prepositional Phrases (LPP) containing place names in news text. We first propose a corpus annotation standard...
Conference Paper
This paper describes a staged effort to address the issue of Border Control information sharing across EU Member States. Specifically, the first step of the adopted solution is described; it consists of the deployment of a distributed network managing a commonly agreed set of messages. The paper is accompanied by a demo of the Information Sharing A...
Conference Paper
We describe a simple IR approach for linking news about events, detected by an event extraction system, to messages from Twitter (tweets). In particular, we explore several methods for creating event-specific queries for Twitter and provide a quantitative and qualitative evaluation of the relevance and usefulness of the information obtained from the...
Article
We present the named entity annotation subtask of a project aiming at creating the National Corpus of Polish. We summarize the annotation requirements defined for this corpus, and we discuss how existing lexical resources and grammars for named entity recognition for Polish have been adapted to meet those requirements. We show detailed results of t...
Conference Paper
Nowadays, many influential security-related facts are reported multiple times by different sources and in different languages. Therefore, in the recent years, the research on advancing event extraction technology shifted from classical single-document extraction toward cross-document information aggregation and fact validation. However, relatively...
Chapter
This article presents a real-time and multilingual news event extraction system developed at the Joint Research Centre of the European Commission. It is capable of accurately and efficiently extracting violent and natural disaster events from online news. In particular, a linguistically relatively lightweight approach is deployed, in which clustere...
Conference Paper
Full-text available
An ever-growing amount of information relevant for early detection of certain threats can be extracted from on-line news. This led to an emergence of news mining tools to help analysts to digest the overflow of information and to extract valuable knowledge from on line news sources. This paper gives an overview of the fully operational Real-time Ne...
Chapter
This chapter gives an overview of tools developed for Frontex, the European Agency for the Management of Operational Cooperation at the External Borders of the Member States of the European Union, to facilitate the process of extracting structured information on events related to border security from on-line news articles, with a particular focus o...
Conference Paper
Full-text available
Nowadays, many influential facts are reported multiple times by different sources and in different languages. This paper presents the results of an experiment on deploying cross-lingual information fusion techniques for refining the results of a large-scale multilingual news event extraction system. An evaluation on a test corpus consisting of 618...
Conference Paper
This presentation gives an overview of an effort to construct OSINT (Open-Source Intelligence) tools for Frontex, the European Agency for the Management of Operational Cooperation at the External Borders of the Member States of the European Union, to facilitate automating the process of extracting structured knowledge from on-line news articles on...
Conference Paper
This paper presents an endeavor aiming at construction of a real-time event extraction system for border security-related intelligence gathering from online news. First, the background and motivation behind the presented work is given. Next, the paper describes the event extraction processing chain, the specifics of the domain, i.e., illegal migrat...
Conference Paper
Full-text available
We present initial results in the named entity annotation subtask of a project aiming at creating the National Corpus of Polish. We summarize the annotation requirements defined for this corpus, and we discuss how existing lexical resources and grammars for Polish named entities have been adapted to meet those requirements. We show first results of t...
Conference Paper
Full-text available
We describe a methodology for building event extraction systems. The approach is based on multilingual domain-specific grammars and exploits weakly supervised machine learning algorithms for lexical acquisition. We report on the process of adapting an already existing event extraction system for the domain of conflicts and crises to the Portuguese...
Article
Summary form only given. This talk gives an overview of an effort on deploying news event extraction technology for border security intelligence gathering and real-time situation monitoring for Frontex, the European Agency for the Management of Operational Cooperation at the External Borders of the Member Stated of the European Union. In particular...
Conference Paper
This paper gives an overview of an ongoing effort to construct tools for automating the process of extracting structured information about border-security related events from on-line news. The paper describes our overall approach to the problem, the system architecture and event information access and moderation. Keywordsevent extraction from on-l...
Article
Full-text available
We describe a multilingual methodology for adapting an event extraction system to new languages. The methodology is based on highly multilingual domain-specific grammars and exploits weakly supervised machine learning algorithms for lexical acquisition. We adapted an already existing event extraction system for the domain of conflicts and crises to...
Article
Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tacklin...
Book
These proceedings contain the final versions of the papers presented at the 7th International Workshop on Finite-State Methods and Natural Language Processing, FSMNLP 2008. The workshop was held in Ispra, Italy, on September 11–12, 2008. The event was the seventh instance in the series of FSMNLP workshops, and the third that was arranged as a stand...
Conference Paper
In the era of proliferation of electronic news media and an ever-growing demand for prompt and concise information, natural language text processing technologies which map free texts into structured data format are becoming paramount. Recently, we have witnessed an emergence of publicly accessible news aggregation systems for facilitating navigatio...
Conference Paper
Full-text available
This paper presents a real-time and multilingual news event extraction system developed at the Joint Research Centre of the European Commission. It is capable of accurately and efficiently extracting violent and natural disaster events from online news. In particular, a linguistically relatively lightweight approach is deployed, in which clustered...
Conference Paper
Full-text available
This paper presents a real-time news event extraction system developed by the Joint Research Centre of the European Commission. It is capable of accurately and efficiently extracting violent and disaster events from online news without using much linguistic sophistication. In particular, in our linguistically relatively lightweight approach to even...
Chapter
This chapter presents on-going efforts at the Joint-Research Center of the European Commission for automating event extraction from news articles collected through the Internet with the Europe Media Monitor system. Event extraction builds on techniques developed over several years in the fields of information extraction, whose basic goal is to deri...
Article
Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This paper reports on knowledge-poor methods for tack-ling...
Chapter
Full-text available
High ranking of a Web site in search engines can be directly correlated to high revenues. This amplifies the phenomenon of Web spamming which can be defined as preparing or manipulating any features of Web documents or hosts to mislead search engines' ranking algorithms to gain an undeservedly high position in search results. Web spam remarkably de...
Conference Paper
Full-text available
We study the usability of linguistic features in the Web spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Pre- liminary analysis seems to indicate that certain linguistic features may be useful for the spam-detection task when comb...
Conference Paper
This paper reports on our experience of adapting a real-world live event extraction system based on a cascade of finite-state extraction grammars to the processing of a new language, namely Italian. The real-time event extraction processing chain and the pattern specification language are briefly presented. The major part of the paper focuses on th...
Conference Paper
Full-text available
This paper presents a fully operational real-time event extraction system which is capable of accurately and efficiently ex- tracting violent and natural disaster events from vast amount of online news articles per day in different languages. Due to the requirement that the system must be mul- tilingual and easily extendable, it is based on a shall...
Conference Paper
This paper presents the results of recent experiments on application of string distance metrics to the problem of named entity lemmatisation in Polish. It extends of our work in [1] by introducing new results for organisation names. Furthermore, the results presented here and in [2,3] centering around the same topic were used to make a comparative...
Conference Paper
Full-text available
The emergence of information extraction (IE) oriented pattern engines has been observed during the last decade. Most of them exploit heavily finite-state devices. This paper introduces EXPRESS – a new extraction pattern engine, whose rules are regular expressions over flat feature structures. The underlying pattern language can be seen as a blend of...
Chapter
Finite-state automata are state-of-the-art representation of dictionaries in natural language processing. We present a novel compression technique that is especially useful for gazetteers – a particular sort of dictionaries. We replace common substructures in the automaton by unique copies. To find them, we treat a transition vector as a string, an...
Conference Paper
Full-text available
The paper presents two techniques for lemmatization of Polish person names. First, we apply a rule-based approach which relies on linguistic information and heuristics. Then, we investigate an alternative knowledge-poor method which employs string distance measures. We provide an evaluation of the adopted techniques using a set of newspaper texts.
Chapter
This chapter reports on creation of a corpus of Polish free-text documents, tagged with name mentions of CIS-relevant entities, which constitutes a core resource for development and evaluation of information extraction components used within a cadastre framework. Unstructured information in the form of free text documents is not much useful for any...
Conference Paper
This paper describes ongoing work to construct a knowledge base of politically motivated violent events (PMVE) consisting of a domain ontology and of instance data accessible via a browser and visualization interface. The instance data is semi-automatically extracted from news articles gathered by the online news monitoring system of the European C...
Conference Paper
This paper presents nexus, an event extraction system, developed at the Joint Research Center of the European Commission utilized for populating violent incident knowledge bases. It automatically extracts security-related facts from on-line news articles. In particular, the paper focuses on a novel bootstrapping algorithm for weakly supervised acqu...
Conference Paper
String distance metrics have been widely used in various applications concerning processing of textual data. This paper reports on the exploration of their usability for tackling the reference matching task and for the automatic correction of misspelled search engine queries, in the context of highly inflective languages, in particular focusing on...
Article
Full-text available
The paper presents a collection of resources developed for Information Extraction (IE) from Polish texts. In particular, we mention two IE platforms adapted to Polish and several IE applications built on top of one of them: named entity recognition, creation of terminology lexicons, and data extraction from medical texts.
Article
Full-text available
This paper presents results of the numerous experiments on usability of well-established string distance metrics and some new variants thereof for various name matching tasks in Polish.
Conference Paper
Full-text available
In these days, we are witnessing a growing trend of exploiting lightweight linguistic analysis for converting of the vast amount of raw textual data into structured knowledge. Although a considerable number of monolingual and task-oriented NLP systems have been presented, relatively few general-purpose architectures exist, e.g., GATE [1] or ELLOGON...
Article
Full-text available
This paper reports on an endavour of creating basic linguistic resources for geo-referencing of Polish free-text documents. We have defined a fine-grained named entity hierarchy, produced an exhaustive gazetteer, and developed named-entity grammars for Polish. Ad-ditionally, an annotated corpus for the cadastral domain was prepared for evaluation p...
Conference Paper
Full-text available
Development of m odern Cadastral Information Systems (CIS) requires deployment of tools for automatic estimation of real estates' value which is influenced by a number of factors. After differentiation of the factors, apropriate information on certain locations needs to be acquired. Since most up-to-date information is transmited mainly as free-tex...
Conference Paper
Full-text available
This paper compares two storage models for gazetteers, nameley the standard one based on numbered indexing automata associated with an auxiliary storage device against a pure finite-state model, the latter being superior in terms of space and time complexity.
Conference Paper
This paper describes compact storage models for gazetteers using state-of-the-art finite-state technology. In particular, we compare the standard method based on numbered indexing automata associated with an auxiliary storage device, against a pure finite-state representation, the latter being superior in terms of space and time complexity, when ap...
Article
This article reports on some experiments on automatic classification of Polish newspaper articles. In particular, we explore two alternative approaches, one based on deployment of linguistic features and second involving purely language-independent character-level n-gram modelling. Extensive evaluation results are presented. Interestingly, both the...
Conference Paper
Full-text available
Automatic content extraction from unrestricted textual data constitutes a core technology for semantic web services. Intelligent content extraction must furthermore address the pecularities of the medium, i.e., must analyze natural language to a certain depth, in order to go beyond the realm of pure keyword-base approaches. This demo presents SProU...
Conference Paper
Full-text available
We present several extensions to the shallow text processor SProUT, viz., (1) a fast imperfect unifiability test, (2) a special form of sets plus a polymorphic lazy and destructive unification operation, (3) a cheap form of negation, (4) a weak unidirectional form of coreferences, (5) optional context-free stages in the regular shallow cascade, (6)...
Conference Paper
In this paper, we present an environment designed for extraction of medical data from mammogram reports. We process data collected from various Polish health care providers and transform them into attribute-value structures, according to a simpli�ed mammographic ontology. We use a general purpose information extraction (IE) platform, SProUT, enrich...
Conference Paper
Although considerable work on namedentity recognition for English and few other major languages exists, research on this topic with regard to Slavonic languages has been almost neglected. In this paper, we present an attempt towards constructing a named-entity recognition system for Polish on top of SProUT, a novel multi-lingual NLP platform, we di...
Conference Paper
Although considerable work on named-entity recognition for few major languages exists, research on this topic in the context of Slavonic languages has been almost neglected. This paper presents a rule-based named-entity recognition system for Polish built on top of SProUT, a novel multi-lingual NLP platform. We pinpoint the encountered difficulties...
Conference Paper
Full-text available
The paper presents a method for intelligent automatic processing of medical reports. First, we extract single pieces of information using SProUT (a general-purpose Information Extraction platform), and then, externally merge the results in order to obtain a detailed formalised description of the reports.
Conference Paper
Full-text available
The aim of this article is to present the initial results of adapting SProUT, a multi-lingual Natural Language Processing platform developed at DFKI, Germany, to the processing of Polish. The article describes some of the problems posed by the integration of Morfeusz, an external morphological analyzer for Polish, and various solutions to the probl...
Article
Full-text available
In this paper, we present an environment de-signed for extraction of medical data from mam-mographic reports. We process data collected from various Polish health care providers and transform them into attribute-value structures, according to a simplied mammographic on-tology. We use a general purpose informa-tion extraction (IE) platform, SProUT,...
Article
Full-text available
In this paper, we present a rule-based named-entity recognition sys-tem for Polish built on top of SProUT, a novel multi-lingual NLP platform. We pinpoint the encountered difficulties and present some evaluation results.
Chapter
Nowadays, knowledge relevant to business of any kind is mainly transmitted through free-text documents. Latest trends in information technology such as Information Extraction (IE) provide dramatic improvements in conversion of the overflow of raw textual information into valuable and structured data. This chapter gives a comprehensive introduction...
Conference Paper
Full-text available
We present an effort for the development of multilingual named entity grammars in a unification-based finite-state formalism (SProUT). Following an extended version of the MUC7 standard, we have developed Named Entity Recognition grammars for German, Chinese, Japanese, French, Spanish, English, and Czech. The grammars recognize person names, organi...
Chapter
The objective of this chapter is an investigation of the applicability of information extraction techniques in real-world business applications dealing with textual data since business relevant data is mainly transmitted through free-text documents. In particular, we give an overview of the informat