Xianrui Liu’s research while affiliated with Shanghai University of Engineering Science and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (3)


A cell can be viewed as a directed acyclic graph (DAG), where each intermediate node is a latent representation and each directed edge (i,j) in the DAG represents the application of an operation o between two nodes while the different colors of the edges represent different operations
The macro-structure of the network that contains 8 cells, where we set Reduction Cell in the 3rd, 5th, 7th of cells
Triplet loss is described by a 3-tuple (a,p,n), where a indicates an anchor sample, p is a positive sample that has the same ID as a, and n indicates a negative sample. Before learning, anchor sample a and positive sample p are a pair of positive samples with huge distance during the feature space while negative sample n is near to anchor sample a. The features of the pictures with the same ID gradually form clusters in the feature space by reducing the distance between a and p and increasing the distance between a and n after training with the triplet loss
An instance of the architectures that were searched by using the DARTS method during our experiments, which contains lots of skip-connection operations
Architecture parameters α is considered as the edges of the DAG. At each decision epoch, we then decide the candidate’s pool of the edges according to the topological order and then choose one edge (i⁺,j⁺) from the candidates by using the Decision Criterion. Using greedy optimal choice we then select the edge that determines the operation by replacing ô(i.j)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \hat {o}^{(i.j)} $\end{document} with o(i.j)=argmaxo∈Oαo(i+,j+)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$ {o}^{(i.j)} = argmax_{o \in O}\alpha ^{(i^{+},j^{+})}_{o} $\end{document}. After making the greedy decision, we prune the unchosen weights from w and remove α(i+,j+)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$ \alpha ^{(i^{+},j^{+})} $\end{document} from α

+1

TGAS-ReID: Efficient architecture search for person re-identification via greedy decisions with topological order
  • Article
  • Publisher preview available

July 2022

·

43 Reads

·

3 Citations

Applied Intelligence

Shengbo Chen

·

Kai Jiang

·

Xianrui Liu

·

[...]

·

Person Re-Identification (Re-ID) technology is being developed rapidly due to the successful application of deep convolutional neural networks. However, the prevailing Re-ID models are usually built upon manually design backbones. In this paper, we propose using the TGAS-ReID which is automatically designed convolutional network backbones for Re-ID to substitute the backbones originally designed for classification such as ResNet and VGG. In the Re-ID tasks to search for a cell structure, greedy decisions are made instead of deriving the architecture after comprehensive training. In other words, at each decision epoch, according to the topological order, we first decide the candidates’ pool of the edges to progressively reduce the coupling of the internal nodes of the DAG. An edge is then selected based on edge importance, edge certainty, and selection stability. We then make a greedy optimal choice for the selected edge and prune the relevant parameters. To further improve the backbone’s representation capability of the features, we further introduce the triplet loss with batch hard mining as the retrieval loss. Extensive experiments demonstrate that the searched structure of the backbones reaches a performance level close to the previous work with a 20.8% shorter searching time. The proposed method also prevents the final CNNs network from suffering the well-known performance collapse by avoiding aggregation of the skip-connections.

View access options

A Sparse Transformer-Based Approach for Image Captioning

September 2020

·

739 Reads

·

8 Citations

IEEE Access

Image Captioning is the task of providing a natural language description for an image. It has caught significant amounts of attention from both computer vision and natural language processing communities. Most image captioning models adopt deep encoder-decoder architectures to achieve state-of-the-art performances. However, it is difficult to model knowledge on relationships between input image region pairs in the encoder. Furthermore, the word in the decoder hardly knows the correlation to specific image regions. In this paper, a novel deep encoder-decoder model is proposed for image captioning which is developed on sparse Transformer framework. The encoder adopts a multi-level representation of image features based on self-attention to exploit low-level and high-level features, naturally the correlations between image region pairs are adequately modeled as self-attention operation can be seen as a way of encoding pairwise relationships. The decoder improves the concentration of multi-head self-attention on the global context by explicitly selecting the most relevant segments at each row of the attention matrix. It can help the model focus on the more contributing image regions and generate more accurate words in the context. Experiments demonstrate that our model outperforms previous methods and achieves higher performance on MSCOCO and Flickr30k datasets. Our code is available at https://github.com/2014gaokao/ImageCaptioning.


Video Synopsis Based on Attention Mechanism and Local Transparent Processing

May 2020

·

354 Reads

·

10 Citations

IEEE Access

The increased number of video cameras makes an explosive growth in the amount of captured video, especially the increase of millions of surveillance cameras that operate 24 hours a day. Since video browsing and retrieval is time consuming, while video synopsis is one of the most effective ways for browsing and indexing such video that enables the review of hours of video in just minutes. How to generate the video synopsis and preserve the essential activities in the original video is still a costly and labor-intensive and time-intensive work. This paper proposes an approach to generating video synopsis with complete foreground and clearer trajectory of moving objects. Firstly, the one-stage CNN-based object detecting has been employed in object extraction and classification. Then, combining integrating the attention-RetinaNet with Local Transparency-Handling Collision (LTHC) algorithm is given out which results in the trajectory combination optimization and makes the trajectory of the moving object more clearly. Finally, the experiments show that the useful video information is fully retained in the result video, the detection accuracy is improved by 4.87% and the compression ratio reaches 4.94, but the reduction of detection time is not obvious.

Citations (3)


... Their work emphasizes the importance of learning shared information between different modalities in existing representation learning methods, which primarily aim to improve feature extraction. Chen et al. [50] introduced a NAS approach for person reID task. This method involved searching for an optimal cell structure by making greedy decisions during the search process. ...

Reference:

MNASreID: grasshopper optimization based neural architecture search for motorcycle re-identification
TGAS-ReID: Efficient architecture search for person re-identification via greedy decisions with topological order

Applied Intelligence

... Image captioning is the process of characterizing visual content with words and phrases [1]. Much like other domains within Machine Learning (ML), the significant advancements in image captioning can be ascribed to recent developments in deep learning, especially graphical processing units, quicker hardware, more refined datasets and improved algorithms [2]. ...

A Sparse Transformer-Based Approach for Image Captioning

IEEE Access

... Offline algorithms begin by extracting all the tubes from the input video. Subsequently, adopting a global optimization strategy, they determine the start time of each tube [10], [13]- [19]. Pritch et al. [13] was one of the first researchers to devise a global energy function to compute the tube's start time. ...

Video Synopsis Based on Attention Mechanism and Local Transparent Processing

IEEE Access