Serge Belongie’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (8)


The iMet Collection 2019 Challenge Dataset
  • Preprint

June 2019

·

37 Reads

enyang Zhang

·

Christine Kaeser-Chen

·

Grace Vesom

·

[...]

·

Serge Belongie

Existing computer vision technologies in artwork recognition focus mainly on instance retrieval or coarse-grained attribute classification. In this work, we present a novel dataset for fine-grained artwork attribute recognition. The images in the dataset are professional photographs of classic artworks from the Metropolitan Museum of Art, and annotations are curated and verified by world-class museum experts. In addition, we also present the iMet Collection 2019 Challenge as part of the FGVC6 workshop. Through the competition, we aim to spur the enthusiasm of the fine-grained visual recognition research community and advance the state-of-the-art in digital curation of museum collections.


Figure 4: Samples of annotated images in DOTA. We show three samples per each category, except six for large-vehicle. 
Figure 5: Statistics of instances in DOTA. AR denotes the aspect ratio. (a) The AR of horizontal bounding box. (b) The AR of oriented bounding box. (c) Histogram of number of annotated instances per image. 
Figure 6: Visualization results of testing on DOTA using well-trained Faster R-CNN. TOP and Bottom respectively illustrate the results for HBB and OBB in cases of orientation, large aspect ratio, and density.
DOTA: A Large-scale Dataset for Object Detection in Aerial Images
  • Preprint
  • File available

November 2017

·

2,162 Reads

·

1 Citation

Object detection is an important and challenging problem in computer vision. Although the past decade has witnessed major advances in object detection in natural scenes, such successes have been slow to aerial imagery, not only because of the huge variation in the scale, orientation and shape of the object instances on the earth's surface, but also due to the scarcity of well-annotated datasets of objects in aerial scenes. To advance object detection research in Earth Vision, also known as Earth Observation and Remote Sensing, we introduce a large-scale Dataset for Object deTection in Aerial images (DOTA). To this end, we collect 2806 aerial images from different sensors and platforms. Each image is of the size about 4000-by-4000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes. These DOTA images are then annotated by experts in aerial image interpretation using 15 common object categories. The fully annotated DOTA images contains 188,282 instances, each of which is labeled by an arbitrary (8 d.o.f.) quadrilateral To build a baseline for object detection in Earth Vision, we evaluate state-of-the-art object detection algorithms on DOTA. Experiments demonstrate that DOTA well represents real Earth Vision applications and are quite challenging. Data are available at http://captain.whu.edu.cn/DOTAweb or https://captain-whu.github.io/DOTA/ .

Download

Fig. 3: Summary of PR curves of the top-10 submissions. Each curve represents a team. Viewed in color. 
Fig. 4: Example detections from the submissions. Green polygons are correctly detected. Yellow ones are false detections.
Fig. 5: Examples of recognition. Red characters are recognized wrongly.
ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW-17)

August 2017

·

1,103 Reads

·

17 Citations

Chinese is the most widely used language in the world. Algorithms that read Chinese text in natural images facilitate applications of various kinds. Despite the large potential value, datasets and competitions in the past primarily focus on English, which bares very different characteristics than Chinese. This report introduces RCTW, a new competition that focuses on Chinese text reading. The competition features a large-scale dataset with over 12,000 annotated images. Two tasks, namely text localization and end-to-end recognition, are set up. The competition took place from January 20 to May 31, 2017. 23 valid submissions were received from 19 teams. This report includes dataset description, task definitions, evaluation protocols, and results summaries and analysis. Through this competition, we call for more future research on the Chinese text reading problem.



Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization

March 2017

·

460 Reads

·

3,047 Citations

Gatys et al. recently introduced a neural algorithm that renders a content image in the style of another image, achieving so-called style transfer. However, their framework requires a slow iterative optimization process, which limits its practical application. Fast approximations with feed-forward neural networks have been proposed to speed up neural style transfer. Unfortunately, the speed improvement comes at a cost: the network is usually tied to a fixed set of styles and cannot adapt to arbitrary new styles. In this paper, we present a simple yet effective approach that for the first time enables arbitrary style transfer in real-time. At the heart of our method is a novel adaptive instance normalization (AdaIN) layer that aligns the mean and variance of the content features with those of the style features. Our method achieves speed comparable to the fastest existing approach, without the restriction to a pre-defined set of styles. In addition, our approach allows flexible user controls such as content-style trade-off, style interpolation, color & spatial controls, all using a single feed-forward neural network.


Figure 1. Illustration of SegLink. The first row shows an image with two words of different scales and orientations. (a) Segments (yellow oriented boxes) detected on the image. (b) Links (green lines) detected between pairs of adjacent segments. (c) Segments connected by links are combined into whole words. (d-f) The SegLink strategy is able to detect text in arbitrary orientation and long lines of text in non-Latin scripts. 
Table 1 . Results on ICDAR 2015 Incidental Text 
Table 2 . Results on MSRA-TD500 
Detecting Oriented Text in Natural Images by Linking Segments

March 2017

·

1,351 Reads

·

437 Citations

Most state-of-the-art text detection methods are specific to horizontal text in Latin scripts and are not fast enough for real-time applications. We introduce Segment Linking (SegLink), an oriented text detection method. The main idea is to decompose text into two locally detectable elements, namely segments and links. A segment is an oriented bounding box that covers a part of a word or text line; A link connects two adjacent segments, indicating that they belong to the same word or line. Both elements are detected densely at multiple scales by an end-to-end trained, fully-convolutional neural network. Final detections are the combinations of segments that are connected by links. Compared with previous methods, our method improves along the dimensions of accuracy, speed and ease of training. It achieves an f-measure of 75.0% on the standard ICDAR 2015 Incidental (Challenge 4) benchmark, outperforming the previous best by a large margin. It runs at over 20 FPS on 512x512 input images. In addition, our method is able to detect non-Latin text in long lines.


Residual Networks Behave Like Ensembles of Relatively Shallow Networks

May 2016

·

2,085 Reads

·

1,118 Citations

Advances in Neural Information Processing Systems

In this work, we introduce a novel interpretation of residual networks showing they are exponential ensembles. This observation is supported by a large-scale lesion study that demonstrates they behave just like ensembles at test time. Subsequently, we perform an analysis showing these ensembles mostly consist of networks that are each relatively shallow. For example, contrary to our expectations, most of the gradient in a residual network with 110 layers comes from an ensemble of very short networks, i.e., only 10-34 layers deep. This suggests that in addition to describing neural networks in terms of width and depth, there is a third dimension: multiplicity, the size of the implicit ensemble. Ultimately, residual networks do not resolve the vanishing gradient problem by preserving gradient flow throughout the entire depth of the network - rather, they avoid the problem simply by ensembling many short networks together. This insight reveals that depth is still an open research question and invites the exploration of the related notion of multiplicity.


Analyzing sedentary behavior in life-logging images

January 2015

·

40 Reads

·

7 Citations

We describe a study that aims to understand physical activity and sedentary behavior in free-living settings. We employed a wearable camera to record 3 to 5 days of imaging data with 40 participants, resulting in over 360,000 images. These images were then fully annotated by experienced staff with a rigorous coding protocol. We designed a deep learning based classifier in which we adapted a model that was originally trained for ImageNet [1]. We then added a spatio-temporal pyramid to our deep learning based classifier. Our results show our proposed method performs better than the state-of-the-art visual classification methods on our dataset. For most of the labels our system achieves more than 90% average accuracy across different individuals for frequent labels and more than 80% average accuracy for rare labels.

Citations (7)


... Especially, Faster-RCNN (Ren et al., 2015) proposes the region proposal network (RPN) to localize possible object instead of traditional sliding window search methods and achieves the state-of-the-art performance in different datasets in terms of accuracy. However, these existing state-of-the-art detectors cannot be directly applied to detect vehicles in aerial images, due to the different characteristics of ground view images and aerial view images (Xia et al., 2017). The appearance of the vehicles are monotone, as shown in Figure 1. ...

Reference:

Vehicle Detection in Aerial Images
DOTA: A Large-scale Dataset for Object Detection in Aerial Images

... Subsequent advancements, such as TextBoxes [10] and its extension TextBoxes++ [11,12], adapted the convolutional neural network (CNN) architecture to better detect long text lines by changing the convolution kernel size. SegLink [13] improved upon this by predicting text candidate boxes and merging them through an area link algorithm, allowing for the detection of text lines at various angles. ...

Detecting Oriented Text in Natural Images by Linking Segments
  • Citing Conference Paper
  • July 2017

... Section 3.1 provides a comprehensive overview of Mini-InternVL. Then, Section 3.2 Laion [63], COYO [64], GRIT [39], COCO [65], LVIS [66], Objects365 [67], Flickr30K [68], VG [69], All-Seeing [61,62], MMInstruct [70], LRV-Instruction [71] OCR TextCaps [72], Wukong-OCR [73], CTW [74], MMC-Inst [75], LSVT [76], ST-VQA [77], RCTW-17 [78], ReCTs [79], ArT [80], SynthDoG [81], LaionCOCO-OCR [82], COCO-Text [83], DocVQA [84], TextOCR [85], LLaVAR [86], TQA [87], SynthText [88] DocReason25K [89], Common Crawl PDF Chart AI2D [90], PlotQA [91], InfoVQA [92], ChartQA [30], MapQA [93], FigureQA [94], IconQA [95], MMC-Instruction [96] Multidisciplinary CLEVR-Math/Super(en) [97,98], GeoQA+ [99], UniChart [100], ScienceQA [101], Inter-GPS [102], UniGeo [103], PMC-VQA [104], TabMWP [105], MetaMathQA [106] Other Stanford40 [107], GQA [108], MovieNet [109], KonIQ-10K [110], ART500K [111], ViQuAE [112] details InternViT-300M, a lightweight vision model developed through knowledge distillation, which inherits the strengths of a powerful vision encoder. Finally, Section 3.3 describes a transfer learning framework designed to enhance the model's adaptation to downstream tasks. ...

ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW-17)

... Here, f (·) is the style transfer operation. Specifically, we use the real-time style transfer approach, AdaIN [69], pretrained using microsoft common objects in context (MS-COCO) [70] and WikiArt [71]. Note that none of these datasets overlap with the datasets we use for continual learning (see Section V-A for the details about our datasets). ...

Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization
  • Citing Article
  • March 2017

... More clearly, text regions always form a long rectangle. For the same rational, TextBoxes [12] used long convolutional filters and SegLink [27] applied segment and link concepts to detect and connect multiple boxes for text detection. These methods take the property that text is usually a long rectangular region into account for enhancing text detection performance. ...

Detecting Oriented Text in Natural Images by Linking Segments

... The layer skip attack is designed to bypass the layers where MLPs have been optimized for activation redirection. We hypothesize that this attack is effective due to the ensemble nature of transformer architectures (Veit et al., 2016;Chen et al., 2024), where the final prediction can theoretically be interpreted as an ensemble of diverse computational paths. Empirically, a number of recent works use layer skipping to accelerate inference speed (Chen et al., 2020;Fan et al., 2020;Elhoushi et al., 2024). ...

Residual Networks Behave Like Ensembles of Relatively Shallow Networks

Advances in Neural Information Processing Systems

... 6 516 Table 4 indicates that we analyzed how often each type of data has been captured. 517 For example, 56 out of 86 publications captured at least one kind of individual data 518 (65%). Also, Table 4 shows absolute and relative frequency of inferences for each 519 construct domain in relation to the overall number of publications (e.g., 15%, 13 out 520 of 86 publications, made an inference relevant to the task context). ...

Analyzing sedentary behavior in life-logging images
  • Citing Article
  • January 2015