To read the full-text of this research, you can request a copy directly from the authors.
... At the same time, using sparse images and focusing common urban objects, that typically have plain and standardized appearance, makes establishing object matches from the input images unreliable. Given the success of Graph Neural Network (GNN) to address geometrical reasoning problems [17,22,49], we frame MfM as a graph problem, assigning a node to each detection and attempting to regress its location in the global map. In this formulation, we use same-map edges to force the network to preserve the topology of each local map, and same-class edges to account for all possible detections matches, instead of explicitly matching the input detections. ...
World-wide detailed 2D maps require enormous collective efforts. OpenStreetMap is the result of 11 million registered users manually annotating the GPS location of over 1.75 billion entries, including distinctive landmarks and common urban objects. At the same time, manual annotations can include errors and are slow to update, limiting the map's accuracy. Maps from Motion (MfM) is a step forward to automatize such time-consuming map making procedure by computing 2D maps of semantic objects directly from a collection of uncalibrated multi-view images. From each image, we extract a set of object detections, and estimate their spatial arrangement in a top-down local map centered in the reference frame of the camera that captured the image. Aligning these local maps is not a trivial problem, since they provide incomplete, noisy fragments of the scene, and matching detections across them is unreliable because of the presence of repeated pattern and the limited appearance variability of urban objects. We address this with a novel graph-based framework, that encodes the spatial and semantic distribution of the objects detected in each image, and learns how to combine them to predict the objects' poses in a global reference system, while taking into account all possible detection matches and preserving the topology observed in each image. Despite the complexity of the problem, our best model achieves global 2D registration with an average accuracy within 4 meters (i.e., below GPS accuracy) even on sparse sequences with strong viewpoint change, on which COLMAP has an 80% failure rate. We provide extensive evaluation on synthetic and real-world data, showing how the method obtains a solution even in scenarios where standard optimization techniques fail.
The success of Vision Transformer (ViT) in various computer vision tasks has promoted the ever-increasing prevalence of this convolution-free network. The fact that ViT works on image patches makes it potentially relevant to the problem of jigsaw puzzle solving, which is a classical self-supervised task aiming at reordering shuffled sequential image patches back to their original form. Solving jigsaw puzzle has been demonstrated to be helpful for diverse tasks using Convolutional Neural Networks (CNNs), such as feature representation learning, domain generalization and fine-grained classification. In this paper, we explore solving jigsaw puzzle as a self-supervised auxiliary loss in ViT for image classification, named Jigsaw-ViT. We show two modifications that can make Jigsaw-ViT superior to standard ViT: discarding positional embeddings and masking patches randomly. Yet simple, we find that the proposed Jigsaw-ViT is able to improve on both generalization and robustness over the standard ViT, which is usually rather a trade-off. Numerical experiments verify that adding the jigsaw puzzle branch provides better generalization to ViT on large-scale image classification on ImageNet. Moreover, such auxiliary loss also improves robustness against noisy labels on Animal-10N, Food-101N, and Clothing1M, as well as adversarial examples. Our implementation is available at https://yingyichen-cyy.github.io/Jigsaw-ViT.
The reconstruction of shredded documents consists of coherently arranging fragments of paper (shreds) to recover the original document(s). A great challenge in computational reconstruction is to properly evaluate the compatibility between the shreds. While traditional pixel-based approaches are not robust to real shredding, more sophisticated solutions compromise significantly time performance. The solution presented in this work extends our previous deep learning method for single-page reconstruction to a more realistic/complex scenario: the reconstruction of several mixed shredded documents at once. In our approach, the compatibility evaluation is modeled as a two-class (valid or invalid) pattern recognition problem. The model is trained in a self-supervised manner on samples extracted from simulated-shredded documents, which obviates manual annotation. Experimental results on three datasets – including a new collection of 100 strip-shredded documents produced for this work – have shown that the proposed method outperforms the competing ones on complex scenarios, achieving accuracy superior to 90%.
We present an attention-based ranking framework for learning to order sentences given a paragraph. Our framework is built on a bidirectional sentence encoder and a self-attention based transformer network to obtain an input order invariant representation of paragraphs. Moreover, it allows seamless training using a variety of ranking based loss functions, such as pointwise, pairwise, and listwise ranking. We apply our framework on two tasks: Sentence Ordering and Order Discrimination. Our framework outperforms various state-of-the-art methods on these tasks on a variety of evaluation metrics. We also show that it achieves better results when using pairwise and listwise ranking losses, rather than the pointwise ranking loss, which suggests that incorporating relative positions of two or more sentences in the loss function contributes to better learning.
This paper proposes a series of new approaches to improve Generative Adversarial Network (GAN) for conditional image synthesis and we name the proposed model as “ArtGAN”. One of the key innovation of ArtGAN is that, the gradient of the loss function w.r.t. the label (randomly assigned to each generated image) is back-propagated from the categorical discriminator to the generator. With the feedback from the label information, the generator is able to learn more efficiently and generate image with better quality. Inspired by recent works, an autoencoder is incorporated into the categorical discriminator for additional complementary information. Last but not least, we introduce a novel strategy to improve the image quality. In the experiments, we evaluate ArtGAN on CIFAR-10 and STL-10 via ablation studies. The empirical results showed that our proposed model outperforms the state-of-the-art results on CIFAR-10 in terms of Inception score. Qualitatively, we demonstrate that ArtGAN is able to generate plausible-looking images on Oxford-102 and CUB-200, as well as able to draw realistic artworks based on style, artist, and genre. The source code and models are available at: https://github.com/cs-chan/ArtGAN.
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
In the square jigsaw puzzle problem one is required to reconstruct the complete image from a set of non-overlapping, unordered, square puzzle parts. Here we propose a fully automatic solver for this problem, where unlike some previous work, it assumes no clues regarding parts' location and requires no prior knowledge about the original image or its simplified (e.g., lower resolution) versions. To do so, we introduce a greedy solver which combines both informed piece placement and rearrangement of puzzle segments to find the final solution. Among our other contributions are new compatibility metrics which better predict the chances of two given parts to be neighbors, and a novel estimation measure which evaluates the quality of puzzle solutions without the need for ground-truth information. Incorporating these contributions, our approach facilitates solutions that surpass state-of-the-art solvers on puzzles of size larger than ever attempted before.
Individual differences in spatial skill emerge prior to kindergarten entry. However, little is known about the early experiences that may contribute to these differences. The current study examined the relation between children's early puzzle play and their spatial skill. Children and parents (n = 53) were observed at home for 90 min every 4 months (6 times) between 2 and 4 years of age (26 to 46 months). When children were 4 years 6 months old, they completed a spatial task involving mental transformations of 2-dimensional shapes. Children who were observed playing with puzzles performed better on this task than those who did not, controlling for parent education, income, and overall parent word types. Moreover, among those children who played with puzzles, frequency of puzzle play predicted performance on the spatial transformation task. Although the frequency of puzzle play did not differ for boys and girls, the quality of puzzle play (a composite of puzzle difficulty, parent engagement, and parent spatial language) was higher for boys than for girls. In addition, variation in puzzle play quality predicted performance on the spatial transformation task for girls but not for boys. Implications of these findings as well as future directions for research on the role of puzzle play in the development of spatial skill are discussed.
In this paper, we propose a method to both extract and classify symbols in floorplan images. This method relies on the very recent developments of Graph Neural Networks (GNN). In the proposed approach, floorplan images are first converted into Region Adjacency Graphs (RAGs). In order to achieve both classification and extraction, two different GNNs are used. The first one aims at classifying each node of the graph while the second targets the extraction of clusters corresponding to symbols. In both cases, the model is able to take into account edge features. Each model is firstly evaluated independently before combining both tasks simultaneously, increasing the quickness of the results.
Graph neural network (GNN) and label propagation algorithm (LPA) are both message passing algorithms, which have achieved superior performance in semi-supervised classification. GNN performs feature propagation by a neural network to make predictions, while LPA uses label propagation across graph adjacency matrix to get results. However, there is still no effective way to directly combine these two kinds of algorithms. To address this issue, we propose a novel Unified Message Passaging Model (UniMP) that can incorporate feature and label propagation at both training and inference time. First, UniMP adopts a Graph Transformer network, taking feature embedding and label embedding as input information for propagation. Second, to train the network without overfitting in self-loop input label information, UniMP introduces a masked label prediction strategy, in which some percentage of input label information are masked at random, and then predicted. UniMP conceptually unifies feature propagation and label propagation and is empirically powerful. It obtains new state-of-the-art semi-supervised classification results in Open Graph Benchmark (OGB).
This paper focuses on the re-assembly of an archaeological artifact, given images of its fragments. This problem can be considered as a special challenging case of puzzle solving. The restricted case of re-assembly of a natural image from square pieces has been investigated extensively and was shown to be a difficult problem in its own right. Likewise, the case of matching “clean” 2D polygons/splines based solely on their geometric properties has been studied. But what if these ideal conditions do not hold? This is the problem addressed in the paper. Three unique characteristics of archaeological fragments make puzzle solving extremely difficult: (1) The fragments are of general shape; (2) They are abraded, especially at the boundaries (where the strongest cues for matching should exist); and (3) The domain of valid transformations between the pieces is continuous. The key contribution of this paper is a fully-automatic and general algorithm that addresses puzzle solving in this intriguing domain. We show that our approach manages to correctly reassemble dozens of broken artifacts and frescoes.
Restoring artworks seriously damaged or completely destroyed is a challenging task. In particular, the reconstruction of frescoes has to deal with problems such as very small fragments, irregular shapes and missing pieces. Several attempts have been done to develop new techniques for helping restorers in the matching process, starting from traditional image processing methods to the more recent deep learning approaches. However, as often happens in the Cultural Heritage field, the availability of labeled data to test new strategies is limited, and publicly available datasets contain only few samples. For this reason, in this paper we introduce DAFNE, a large dataset that includes hundreds of thousands of images of fresco fragments artificially generated to guarantee a high variability in terms of shapes and dimensions. Fragments have been obtained starting from 62 images of famous frescoes of various artists and historical periods, in order to consider different artistic styles, subjects and colors.
Modeling discourse coherence is an important problem in natural language generation and understanding. Sentence ordering, the goal of which is to organize a set of sentences into a coherent text, is a commonly used task to learn and evaluate the model. In this paper, we propose a novel hierarchical attention network that captures word clues and dependencies between sentences to address this problem. Our model outperforms prior methods and achieves state-of-the-art performance on several datasets in different domains. Furthermore, our experiments demonstrate that the model performs very well even though adding noisy sentences into the set, which shows the robustness and effectiveness of the model. Visualization analysis and case study show that our model captures the structure and pattern of coherent texts not only by simple word clues but also by consecution in context.
Representations of sets are challenging to learn because operations on sets should be permutation-invariant. To this end, we propose a Permutation-Optimisation module that learns how to permute a set end-to-end. The permuted set can be further processed to learn a permutation-invariant representation of that set, avoiding a bottleneck in traditional set models. We demonstrate our model's ability to learn permutations and set representations with either explicit or implicit supervision on four datasets, on which we achieve state-of-the-art results: number sorting, image mosaics, classification from image mosaics, and visual question answering.
Modeling the structure of coherent texts is a task of great importance in NLP. The task of organizing a given set of sentences into a coherent order has been commonly used to build and evaluate models that understand such structure. In this work we propose an end-to-end neural approach based on the recently proposed set to sequence mapping framework to address the sentence ordering problem. Our model achieves state-of-the-art performance in the order discrimination task on two datasets widely used in the literature. We also consider a new interesting task of ordering abstracts from conference papers and research proposals and demonstrate strong performance against recent methods. Visualizing the sentence representations learned by the model shows that the model has captured high level logical structure in these paragraphs. The model also learns rich semantic sentence representations by learning to order texts, performing comparably to recent unsupervised representation learning methods in the sentence similarity and paraphrase detection tasks.
This paper introduces new types of square-piece jigsaw puzzles: those for which the orientation of each jigsaw piece is unknown. We propose a tree-based reassembly that greedily merges components while respecting the geometric constraints of the puzzle problem. The algorithm has state-of-the-art performance for puzzle assembly, whether or not the orientation of the pieces is known. Our algorithm makes fewer assumptions than past work, and success is shown even when pieces from multiple puzzles are mixed together. For solving puzzles where jigsaw piece location is known but orientation is unknown, we propose a pairwise MRF where each node represents a jigsaw piece's orientation. Other contributions of the paper include an improved measure (MGC) for quantifying the compatibility of potential jigsaw piece matches based on expecting smoothness in gradient distributions across boundaries.
Nonograms, also known as Japanese puzzles, are a specific type of logic drawing puzzles. The challenge is to fill a grid with black and white pixels in such a way that a given description for each row and column, indicating the lengths of consecutive segments of black pixels, is adhered to. Although the Nonograms in puzzle books can usually be solved by hand, the general problem of solving Nonograms is NP-hard. In this paper, we propose a reasoning framework that can be used to determine the value of certain pixels in the puzzle, given a partial filling. Constraints obtained from relaxations of the Nonogram problem are combined into a 2-Satisfiability (2-SAT) problem, which is used to deduce pixel values in the Nonogram solution. By iterating this procedure, starting from an empty grid, it is often possible to solve the puzzle completely. All the computations involved in the solution process can be performed in polynomial time. Our experimental results demonstrate that the approach is capable of solving a variety of Nonograms that cannot be solved by simple logic reasoning within individual rows and columns, without resorting to branching operations. In addition, we present statistical results on the solvability of Nonograms, obtained by applying our method to a large number of Nonograms.
Comparing scene, pattern or object models to structures in images or determining the correspondence between two point sets are examples of attributed graph matching. In this paper we show how such problems can be posed as one of inference over hidden Markov random fields. We review some well known inference methods studied over past decades and show how the Junction Tree framework from Graphical Models leads to algorithms that outperform traditional relaxation-based ones.
We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.
We explore the problem of reconstructing an image from a bag of square, non-overlapping image patches, the jigsaw puzzle problem. Completing jigsaw puzzles is challenging and requires expertise even for humans, and is known to be NP-complete. We depart from previous methods that treat the problem as a constraint satisfaction problem and develop a graphical model to solve it. Each patch location is a node and each patch is a label at nodes in the graph. A graphical model requires a pairwise compatibility term, which measures an affinity between two neighboring patches, and a local evidence term, which we lack. This paper discusses ways to obtain these terms for the jigsaw puzzle problem. We evaluate several patch compatibility metrics, including the natural image statistics measure, and experimentally show that the dissimilarity-based compatibility - measuring the sum-of-squared color difference along the abutting boundary - gives the best results. We compare two forms of local evidence for the graphical model: a sparse-and-accurate evidence and a dense-and-noisy evidence. We show that the sparse-and-accurate evidence, fixing as few as 4 - 6 patches at their correct locations, is enough to reconstruct images consisting of over 400 patches. To the best of our knowledge, this is the largest puzzle solved in the literature. We also show that one can coarsely estimate the low resolution image from a bag of patches, suggesting that a bag of image patches encodes some geometric information about the original image.
Efficientnet: Rethinking model scaling for convolutional neural networks
Jan 2019
6105
Tan
Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders
Jan 2022
Zheng
Denoising diffusion probabilistic models
Jan 2020
6840
Ho
Atiss: Autoregressive transformers for indoor scene synthesis