Maarten Marx’s research while affiliated with University of Amsterdam and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (199)


Examples of the annotations of borders in the VGG Image Annotator tool. Note how yellow annotation lines tightly follow the redaction borders instead of going through any partially overlapping redactions
Examples of the different types of redaction in the dataset. The codes in the redactions are not type dependent. The color redaction can appear in different colors. The gray redactions can appear in different shades of gray
Redaction of a signature
Examples of inline- and multiline redactions identified by the Edact-Ray on scans method
An example of border redactions being incorrectly fused by the OCR+Morphology model and correctly separated by the Mask R-CNN model. (In all figures, green indicates correct predictions, red indicates false predictions, and yellow indicates missed predictions)

+5

Redacted text detection using neural image segmentation methods
  • Article
  • Full-text available

January 2025

·

5 Reads

International Journal on Document Analysis and Recognition (IJDAR)

Ruben van Heusden

·

Kaj Meijer

·

Maarten Marx

The redaction of sensitive information in documents is common practice in specific types of organizations. This happens for example in court proceedings or in documents released under the Freedom of Information Act (FOIA). The ability to automatically detect when information has been redacted has several practical applications, such as the gathering of statistics on the amount of redaction present in documents, enabling a critical view on redaction practices. It can also be used to further investigate redactions, and whether or not the used techniques provide sufficient anonymization. The task is particularly challenging because of the large variety of redaction methods and techniques, from software for automatic redaction to manual redactions by pen. Any detection system must be robust to a large variety of inputs, as it will be run on many documents that might not even contain redactions. In this study, we evaluate two neural methods for the task, namely a Mask R-CNN model and a Mask2Former model, and compare them to a rule-based model based on optical character recognition and morphological operations. The best performing, the Mask R-CNN model, has a recall of .94 with a precision of .96 over a challenging data set containing several redaction types. Adding many pages without redaction barely lowers this score (precision drops to .90, recall drops to .92). The Mask2Former model is most robust to inputs without redactions, producing the least false positives of all models.

Download



Bcubed revisited: elements like me

May 2024

·

30 Reads

Discover Computing

BCubed is a mathematically clean, elegant and intuitively well behaved external performance metric for clustering tasks. BCubed compares a predicted clustering to a known ground truth clustering through elementwise precision and recall scores. For each element, the predicted and ground truth clusters containing the element are compared, and the mean over all elements is taken. We argue that BCubed overestimates performance, for the intuitive reason that the clustering gets credit for putting an element into its own cluster. This is repaired, and we investigate the repaired version, called “Elements Like Me (ELM)”. We extensively evaluate ELM from both a theoretical and empirical perspective, and conclude that it retains all of its positive properties, and yields a minimum zero score when it should. Synthetic experiments show that ELM can produce different rankings of predicted clusterings when compared to BCubed, and that the ELM scores are distributed with lower mean and a larger variance than BCubed.


Detection of Redacted Text in Legal Documents

September 2023

·

27 Reads

·

2 Citations

Lecture Notes in Computer Science

We present a technique for automatically detecting redacted text in legal documents, using a combination of Optical Character Recognition (OCR) and morphological operations from the Computer Vision domain, allowing us to detect a wide variety of different types of redaction blocks with little to no training data. As this is a segmentation task, we evaluate our technique using the Panoptic Quality methodology, with the algorithm obtaining F1 scores of 0.79, 0.86 and 0.76 on black, colored and outlined redaction blocks respectively, and an F1 score of 0.62 for gray blocks. The total running time of the algorithm is two seconds on average measured on a thousand pages from a government supplier, with roughly 98%98\% of this time being used by Tesseract and the conversion from PDF to PNG, and 2%2\% by the detection algorithm. Detecting text redaction at scale thus is feasible, allowing a more or less objective measurement of this practice.The redacted text detection code and the manually labelled dataset created for evaluation is released via Github.


Making PDFs Accessible for Visually Impaired Users (and Findable for Everybody Else)

September 2023

·

16 Reads

·

2 Citations

Lecture Notes in Computer Science

We treat documents released under the Dutch Freedom of Information Act as FAIR scientific data and find that they are not findable nor accessible, due to text malformations caused by redaction software. Our aim is to repair these documents. We propose a simple but strong heuristic for detecting wrongly OCRed text segments, and we then repair only these OCR mistakes by prompting a large language model. This makes the documents better findable through full text search, but the repaired PDFs do still not adhere to accessibility standards. Converting them into HTML documents, keeping all essential layout and markup, makes them not only accessible to the visually impaired, but also reduces their size by up to two orders of magnitude. The costs of this way of repairing are roughly one dollar for the 17K pages in our corpus, which is very little compared to the large gains in information quality.


Using Deep-Learned Vector Representations for Page Stream Segmentation by Agglomerative Clustering

May 2023

·

44 Reads

·

2 Citations

Algorithms

Page stream segmentation (PSS) is the task of retrieving the boundaries that separate source documents given a consecutive stream of documents (for example, sequentially scanned PDF files). The task has recently gained more interest as a result of the digitization efforts of various companies and organizations, as they move towards having all their documents available online for improved searchability and accessibility for users. The current state-of-the-art approach is neural start of document page classification on representations of the text and/or images of pages using models such as Visual Geometry Group-16 (VGG-16) and BERT to classify individual pages. We view the task of PSS as a clustering task instead, hypothesizing that pages from one document are similar to each other and different to pages in other documents, something that is difficult to incorporate in the current approaches. We compare the segmentation performance of an agglomerative clustering method with a binary classification model based on images on a new publicly available dataset and experiment with using either pretrained or finetuned image vectors as inputs to the model. To adapt the clustering method to PSS, we propose the switch method to alleviate the effects of pages of the same class having a high similarity, and report an improvement in the scores using this method. Unfortunately, neither clustering with pretrained embeddings nor clustering with finetuned embeddings outperformed start of document page classification for PSS. However, clustering with either pretrained or finetuned representations is substantially more effective than the baseline, with finetuned embeddings outperforming pretrained embeddings. Finally, having the number of documents K as part of the input, in our use case a realistic assumption, has a surprisingly significant positive effect. In contrast to earlier papers, we evaluate PSS with the overlap weighted partial match F1 score, developed as a Panoptic Quality in the computer vision domain, a metric that is particularly well-suited to PSS as it can be used to measure document segmentation.


Enticing Local Governments to Produce FAIR Freedom of Information Act Dossiers

March 2023

·

4 Reads

Lecture Notes in Computer Science

Government transparency is central in a democratic society, and increasingly governments at all levels are required to publish records and data either proactively, or upon so-called Freedom of Information (FIA) requests. However, public bodies who are required by law to publish many of their documents turn out to have great difficulty to do so. And what they publish often is in a format that still breaches the requirements of the law, stipulating principles comparable to the FAIR data principles. Hence, this demo is addressing a timely problem: the FAIR publication of FIA dossiers, which is obligatory in The Netherlands since May 1st 2022.KeywordsIR data collectionFAIR dataGovernments recordsTransparency


Neural Coreference Resolution for Dutch Parliamentary Documents with the DutchParliament Dataset

February 2023

·

44 Reads

·

2 Citations

Data

The task of coreference resolution concerns the clustering of words and phrases referring to the same entity in text, either in the same document or across multiple documents. The task is challenging, as it concerns elements of named entity recognition and reading comprehension, as well as others. In this paper, we introduce DutchParliament, a new Dutch coreference resolution dataset obtained through the manual annotation of 74 government debates, expanded with a domain-specific class. In contrast to existing datasets, which are often composed of news articles, blogs or other documents, the debates in DutchParliament are transcriptions of speech, and therefore offer a unique structure and way of referencing compared to other datasets. By constructing and releasing this dataset, we hope to facilitate the research on coreference resolution in niche domains, with different characteristics than traditional datasets. The DutchParliament dataset was compared to SoNaR-1 and RiddleCoref, two other existing Dutch coreference resolution corpora, to highlight its particularities and differences from existing datasets. Furthermore, two coreference resolution models for Dutch, the rule-based DutchCoref model and the neural e2eDutch model, were evaluated on the DutchParliament dataset to examine their performance on the DutchParliament dataset. It was found that the characteristics of the DutchParliament dataset are quite different from that of the other two datasets, although the performance of the e2eDutch model does not seem to be significantly affected by this. Furthermore, experiments were conducted by utilizing the metadata present in the DutchParliament corpus to improve the performance of the e2eDutch model. The results indicate that the addition of available metadata about speakers has a beneficial effect on the performance of the model, although the addition of the gender of speakers seems to have a limited effect.


The most general manner to injectively align true and predicted segments

December 2022

·

8 Reads

Kirilov et al (2019) develop a metric, called Panoptic Quality (PQ), to evaluate image segmentation methods. The metric is based on a confusion table, and compares a predicted to a ground truth segmentation. The only non straightforward part in this comparison is to align the segments in the two segmentations. A metric only works well if that alignment is a partial bijection. Kirilov et al (2019) list 3 desirable properties for a definition of alignment: it should be simple, interpretable and effectively computable. There are many definitions guaranteeing a partial bijection and these 3 properties. We present the weakest: one that is both sufficient and necessary to guarantee that the alignment is a partial bijection. This new condition is effectively computable and natural. It simply says that the number of correctly predicted elements (in image segmentation, the pixels) should be larger than the number of missed, and larger than the number of spurious elements. This is strictly weaker than the proposal in Kirilov et al (2019). In formulas, instead of |TP|> |FN\textbar| + |FP|, the weaker condition requires that |TP|> |FN| and |TP| > |FP|. We evaluate the new alignment condition theoretically and empirically.


Citations (75)


... Additionally, the Dutch government's habit of merging documents into large, undifferentiated PDFs necessitated Page Stream Segmentation techniques to delineate original document boundaries [15]. Several procedures were implemented to address metadata scarcity, extensive document classification, and information extraction [14,16,17]. ...

Reference:

Enhancing Access Across Europe for Documents Published According to Freedom of Information Act: Applying Woogle Design and Technique to Estonian Public Information Act Document
Detection of Redacted Text in Legal Documents
  • Citing Chapter
  • September 2023

Lecture Notes in Computer Science

... Better and more advanced built-in support to enhance accessibility makes it easier for content creators to produce accessible materials efficiently. AI can, for example, convert inaccessible PDF documents into text that complies with accessibility requirements [8]. ...

Making PDFs Accessible for Visually Impaired Users (and Findable for Everybody Else)
  • Citing Chapter
  • September 2023

Lecture Notes in Computer Science

... For example, Yao et al. (2018) used the combined LSTM model to predict the geomagnetic index, and Chen et al. (2019) proposed the Deep Convolutional Generation Adversarial Network (DCGAN) for the repair of line-TEC images. The clustering algorithm based on deep learning has become a hot spot for unsupervised learning data clustering, which is referred to as deep clustering (Guo et al., 2017) and have been used in burn mapping (Radman et al., 2023), network topology relationships (Ni and Jiang, 2023) and PSS (Busch et al., 2023). In this paper, we will distinguish the mixed echo data from the aspect of echo clustering and classification, and ignore the limitation of external conditions from the structural characteristics of the internal data, it is the first attempt to apply the deep learning clustering algorithm to SuperDARN target echo clustering. ...

Using Deep-Learned Vector Representations for Page Stream Segmentation by Agglomerative Clustering

Algorithms

... Mention detection is often defined as a sequence labeling or classification problem and clustering predictions are performed by assigning scores to mention pairs that indicate whether the mentions are coreferent or not. Deep learningbased methods eliminate the need of feature engineering by learning required features and underlying relations between entity mentions directly from the text [14,15]. With recent advancements in neural architectures (e.g., transformers), the widespread availability of large-scale language models, and the representational power of word embeddings, these approaches significantly improve upon previous state-of-the-art results in coreference resolution. ...

Neural Coreference Resolution for Dutch Parliamentary Documents with the DutchParliament Dataset

Data

... Recall is then calculated as representing the ratio of shared edges by the size of the reference edge set, and precision is the ratio of shared edges by the size of the evaluated edge set. As with B-cubed metrics, swapthis can be compared to a dummy annotator that considers that no character rhymes with any other: its score against Baxter would be 0.85. 5 Identifying that B-cubed metrics tend to produce artificially high scores, Van Heusden et al. (2022) propose to only include clusters of more-than-1 elements in the computation. This does indeed produce lower scores, but the problem of comparability of the results remains. ...

BCubed Revisited: Elements Like Me
  • Citing Conference Paper
  • August 2022

... The tact maxim is based on minimizing selfbenefit and maximizing the benefit of others in conversation. The generosity maxim expects participants in the conversation to show respect to others ( (Ardiati, 2023;Haristiani et al., 2023;Marsili, 2021;Schueler & Marx, 2023). The approbation maxim emphasizes that participants should The exploration of politeness maxims has been extensively conducted in various studies. ...

Speech acts in the Dutch COVID-19 Press Conferences

Language Resources and Evaluation

... An example of a large-scale project in political communication is the ParlaMint corpora of parliamentary proceedings (Erjavec et al., 2023). While they differ from sports commentators' commentaries in that parliamentary data have typically no copyright or personal data issues (Erjavec & Pančur, 2019), the ParlaMint corpora still represent a relevant reference point for comparison. ...

The ParlaMint corpora of parliamentary proceedings

Language Resources and Evaluation

·

·

·

[...]

·

... Since token-level evaluation can be misleadingly high for the intended task, as missing tokens could result in significant misinterpretation, it is essential to accurately capture entire PICO entities. We computed the macro-average precision, recall, and F1 score using seqeval [52], a well-tested tool often deployed in numerous NLP studies for system evaluation [53]. The 95% confidence interval of the performance was estimated based on the bootstrapped test samples. ...

The Automatic Detection of Dataset Names in Scientific Articles

Data

... Traditionally, potentially effective features are extracted and naively combined together to create a feature list, with most studies exploring new features but joining all into a single list [21][22][23]. Little attention is given to how these features should be grouped, such as constructing multiple feature lists instead of relying on a single, consolidated list. ...

Learning to rank for multi-label text classification: Combining different sources of information

Natural Language Engineering

... LTR was proposed in the context of ad hoc information retrieval in which the goal is to create a ranking model that ranks documents with respect to queries. The LTR approach has been used for constructing a ranking model to rank classes with respect to a given document and select the most probable classes for the document as its labels (Yang and Gopal 2012;Ju, Moschitti, and Johansson 2013;Fauzan and Khodra 2014;Azarbonyad and Marx 2019). Yang and Gopal (2012) mapped MLTC to the ad hoc retrieval problem and used LTR for learning a ranking model. ...

How Many Labels? Determining the Number of Labels in Multi-Label Text Classification
  • Citing Chapter
  • August 2019

Lecture Notes in Computer Science