Article

Deep-learning-based recognition of symbols and texts at an industrially applicable level from images of high-density piping and instrumentation diagrams

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Piping and instrumentation diagrams (P&IDs) are commonly used in the process industry as a transfer medium for the fundamental design of a plant and for detailed design, purchasing, procurement, construction, and commissioning decisions. The present study proposes a method for symbol and text recognition for P&ID images using deep-learning technology. Our proposed method consists of P&ID image pre-processing, symbol and text recognition, and the storage of the recognition results. We consider the recognition of symbols of different sizes and shape complexities in high-density P&ID images in a manner that is applicable to the process industry. We also standardize the training dataset structure and symbol taxonomy to optimize the developed deep neural network. A training dataset is created based on diagrams provided by a local Korean company. After training the model with this dataset, a recognition test produced relatively good results, with a precision and recall of 0.9718 and 0.9827 for symbols and 0.9386 and 0.9175 for text, respectively.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... For instance, deep learning methods were used for symbol digitisation in other types of engineering diagrams [15,16]. This was considered a difficult task for multiple reasons, including the numerous symbols present in each diagram [15], relatively small symbol sizes [15,16] and use of non-standard symbols [17,18]. ...
... Although construction drawings are more complex compared to other types of engineering drawings, methods for their digitisation have received considerably less attention than those for other engineering drawing types, such as Piping and Instrumentation Diagrams (P&IDs) [10,[15][16][17][19][20][21]. One reason that construction drawings are more complex is that they are typically composed of multiple drawing layers. ...
... The literature on symbol digitisation in engineering drawings covers a range of drawing types, with a particular focus on P&IDs [10,[15][16][17][19][20][21]. For instance, Elyan et al. [15] created a YOLO-based method to detect symbols in P&IDs. ...
Article
Full-text available
Construction drawings are frequently stored in undigitised formats and consequently, their analysis requires substantial manual effort. This is true for many crucial tasks, including material takeoff where the purpose is to obtain a list of the equipment and respective amounts required for a project. Engineering drawing digitisation has recently attracted increased attention, however construction drawings have received considerably less interest compared to other types. To address these issues, this paper presents a novel framework for the automatic processing of construction drawings. Extensive experiments were performed using two state-of-the-art deep learning models for object detection in challenging high-resolution drawings sourced from industry. The results show a significant reduction in the time required for drawing analysis. Promising performance was achieved for symbol detection across various classes, with a mean average precision of 79% for the YOLO-based method and 83% for the Faster R-CNN-based method. This framework enables the digital transformation of construction drawings, improving tasks such as material takeoff and many others.
... Piping and instrumentation diagrams (P&IDs) are technical drawings that are used to operate and maintain process systems [1]. They show the piping and related components and how they are interconnected [2,3]. ...
... For a completed project, P&IDs are used as a reference for understanding the layout and operation of the process system, which can be helpful when performing maintenance or repairs [3,5]. And for ongoing projects, procurement teams use P&IDs to identify components and their quantities needed for the project [1]. This information is critical for preparing bill-of-quantities, placing a purchase order, developing a work schedule, or performing resource allocation. ...
... Specialized authoring programs are used to create P&IDs, but due to contractual reasons and intellectual property concerns, they are often shared as rasterized images or PDFs [1,6,7]. Additionally, for many existing facilities, P&IDs were often created manually and are stored as PDFs of scanned paper drawings [3,4,8]. ...
Article
Current computer vision methods for symbol detection in piping and instrumentation diagrams (P&IDs) face limitations due to the manual data annotation resources they require. This paper introduces a versatile two-stage symbol detection pipeline that optimizes efficiency by (1) labeling only data samples with minimal cumulative informational redundancy, (2) restricting annotation to the minimal effective training dataset size, and (3) expanding the training dataset using pseudo-labels. In Stage-1, the method performs generic symbol detection, while Stage-2 focuses on symbol differentiation through metric learning. To enhance robustness and general-izability, the model is trained on a diverse dataset collected from both industry sources and web scraping. The achieved Top-1 accuracy is 85.39%, with a Top-5 accuracy of 95.19% on a test dataset containing 102 symbol classes. These results suggest the potential for a shift from resource-intensive supervised learning approaches to a more efficient semi-supervised paradigm.
... With recent advancements in deep learning, several cases have been reported wherein a CNN-based deep learning technology has been applied for diagram recognition (Fu & Kara, 2011;Rahul et al., 2019;Yu et al., 2019;Yun et al., 2020;Kim et al., 2021b). However, the study by Fu and Kara (2011) applied this technology to a general and simple logic diagram. ...
... In the study by Yun et al. (2020), symbols in P&IDs were detected using an R-CNN; however, the recognition of texts or lines was not considered. Kim et al. (2021b) proposed a recognition method wherein symbols and texts are recognized in an extremely dense and complex image P&ID using a deep neural network. Moon et al. (2021) developed a deep learning-based method to detect various line signs and flow arrows in an image P&ID. ...
... Previous studies on the recognition of objects in P&IDs using a deep neural network faced limitations in the digitization of image P&IDs currently used in the field, as listed in Table 1. First, most studies, except for those by Kim et al. (2021b) and Moon et al. (2021), recognized approximately 10 symbols only and failed to consider lines and symbols of various shapes and sizes that are currently used in industries. Second, previous studies did not recognize all object types (symbols, texts, and lines) that are present in most P&IDs. ...
Article
Full-text available
This study proposes an end-to-end digitization method for converting piping and instrumentation diagrams (P&IDs) in the image format to digital P&IDs. Automating this process is an important concern in the process plant industry because presently image P&IDs are manually converted into digital P&IDs. The proposed method comprises object recognition within the P&ID images, topology reconstruction of recognized objects, and digital P&ID generation. A dataset comprising 75,031 symbol, 10,073 text, and 90,054 line data was constructed to train the deep neural networks used for recognizing symbols, text, and lines. Topology reconstruction and digital P&ID generation were developed based on traditional rule-based approaches. Five test P&IDs were digitalized in the experiments. The experimental results for recognizing symbols, text, and lines showed good precision and recall performance, with averages of 96.65%/96.40%, 90.65%/92.16%, and 95.25%/87.91%, respectively. The topology reconstruction results showed an average precision of 99.56% and recall of 96.07%. The digitization was completed in less than three and a half hours (8488.2 s on average) for five test P&IDs.
... To date, several studies [33][34][35] have reported the recognition of various types of diagrams, such as electrical diagrams, engineering diagrams, logic diagrams, and piping and instrumentation diagrams (P&ID). With recent developments in deep learning algorithms, there has been active research on the application of CNN-based deep learning methods in the process of diagram recognition [36][37][38][39][40]. In [36] recognition of a simple logic diagram with general application targets was done. ...
... In [39], R-CNN was employed to recognize symbols in P&IDs, but there was no discussion on the recognition of texts or lines. In [40], a method for the recognition of various types of symbols and texts was presented for high-density P&IDs. Similar studies related to line recognition in P&ID include [25,38]. ...
... In the proposed line recognition method, the results of recognizing symbols and texts included in P&ID are used as input. Therefore, using the method in [40], the symbols and texts included in the test P&IDs were recognized prior to the experiment. Table 2 shows the results of recognition performance for each test P&ID. ...
Article
Full-text available
As part of research on technology for automatic conversion of image-format piping and instrumentation diagram (P&ID) into digital P&ID, the present study proposes a method for recognizing various types of lines and flow arrows in image-format P&ID. The proposed method consists of three steps. In the first step of preprocessing, the outer border and title box in the diagram are removed. In the second step of detection, continuous lines are detected, and then line signs and flow arrows indicating the flow direction are detected. In the third step of post-processing, using the results of line sign detection, continuous lines that require changing of the line type are determined, and the line types are adjusted accordingly. Then, the recognized lines are merged with flow arrows. For verification of the proposed method, a prototype system was used to conduct an experiment of line recognition. For the nine test P&IDs, the average precision and recall were 96.14% and 89.59%, respectively, showing high recognition performance.
... In the work of Kim et al. [16] the interpretation of EDs is conducted using high-resolution images through specialized models for symbol detection and text recognition after preprocessing the images. The selection of symbols into labels makes object identification a clearer task to be performed. ...
... Specialized models such as the Character Region Awareness For Text detection (CRAFT) proposed by Baek et al. [17] and the Convolutional Recurrent Neural Network (CRNN) presented by Shi, Bai, and Yao [18] stand out for text recognition. With this, Kim et al. [16] showed that to have a better performance in the analysis of EDs, it is required to use specific techniques for text detection, such as CRAFT and CRNN, and other techniques for object identification, such as GFL. Even with a large number of classes, this approach results in a complete and robust solution for interpreting EDs. ...
Chapter
Relay-based Railways Interlocking Systems (RRIS) carry out critical functions to control stations. Despite being based on old and hard-to-maintain electro-mechanical technology, RRIS are still pervasive. A powerful CAD modeling and analysis approach based on symbolic logic has been recently proposed to support the re-engineering of relay diagrams into more maintainable computer-based technologies. However, the legacy engineering drawings that need to be digitized consist of large, hand-drawn diagrams dating back several decades. Manually transforming such diagrams into the format of the CAD tool is labor-intensive and error-prone, effectively a bottleneck in the reverse-engineering process. In this paper, we tackle the problem of automatic digitalization of RRIS schematics into the corresponding CAD format with an integrative Artificial Intelligence approach. Deep learning-based methods, segment detection, and clustering techniques for the automated digitalization of engineering schematics are used to detect and classify the single elements of the diagram. These elementary elements can then be aggregated into more complex objects leveraging the domain ontology. First results of the method’s capability of automatically reconstructing the engineering schematics are presented.
... In addition, various studies on digital transformation have been conducted to increase work efficiency [29][30][31][32]. These are required to digitize unstructured analog data, such as image format drawings [25,[33][34][35][36]. ...
... For decades, many studies have begun to leverage deep learning (DL) technology for symbol recognition in image-format drawings [25,[33][34][35][36]. Generally, they train the DL model to recognize symbols by using a symbol image dataset. ...
Article
Full-text available
With the advancement of deep learning (DL), researchers and engineers in the marine industry are exploring the application of DL technologies to their specific applications. In general, the accuracy of inference using DL technologies is significantly dependent on the number of training datasets. Unfortunately, people in marine science and engineering environments are often reluctant to share their documents (i.e., P&ID) with third-party manufacturers or public clouds to protect their proprietary information. Despite this, the demand for object detection using DL technologies in image-formatted files (i.e., jpg, png, or pdf format) is steadily growing. In this paper, we propose a new mechanism, called a no-training object picker (NoOP), which efficiently recognizes all objects (e.g., lines, tags, and symbols) in image-formatted P&ID documents. Notably, it can recognize objects without any training dataset, thus reducing the time and effort required for training and collection of unpublished datasets. To clearly present the effectiveness of NoOP, we evaluated NoOP using a real P&ID document. As a result, we confirmed that all objects in the image-formatted P&ID file are successfully detected over a short time (only 7.11 s on average).
... In contrast to traditional Bayesian methods, DL models offer superior intelligent modeling efficiencies and are skewing reliability studies (and applications) on industrial cyber-physical systems (ICPSs) as a consequence of the increasing complexity of processes and systems and the inherent necessity to model them [5]. Process efficiencies, equipment condition monitoring, spatiotemporal forecasting, and many other solutions have been recently improved as a result of this shift to DL-based support [21][22][23][24][25]; however, some issues remain, including over-fitting and interpretability issues, optimal hyperparameter selection/optimization, standardized weight initialization paradigm, and discovering the optimal decision criteria between power consumption and performance [4,26]. In spite of this, given the necessity of providing accurate real-time solutions for ICPS components, and especially considering the growing need for uncertainty modeling, sensor data discrepancies, and dynamic operating conditions, DL techniques remain preferable even at the expense of computational power. ...
... In spite of this, given the necessity of providing accurate real-time solutions for ICPS components, and especially considering the growing need for uncertainty modeling, sensor data discrepancies, and dynamic operating conditions, DL techniques remain preferable even at the expense of computational power. In this light, numerous DL-based algorithms have been developed over the years, including, but not limited to the RNNs and echo state networks for time-series forecasting [21][22][23], CNNs for discriminative modeling/diagnostics [5,24], and the multi-purpose DNNs [5,11]. Most of these algorithms are stand-alone models that obviously come with their shortcomings and may be component-specific and/or application-specific. Moreso, the task presented herein clearly points at the CNNs and MLPs as possible solutions considering that their architecture are fundamentally designed for discriminative modeling and/or classification purposes (even though MLPs can also model regression/forecasting problems accurately). ...
Article
Full-text available
Despite the increasing digitalization of equipment diagnostic/condition monitoring systems, it remains a challenge to accurately harness discriminant information from multiple sensors with unique spectral (and transient) behaviors. High-precision systems such as the automatic regrinding in-line equipment provide intelligent regrinding of micro drill bits; however, immediate monitoring of the grinder during the grinding process has become necessary because ignoring it directly affects the drill bit’s life and the equipment’s overall utility. Vibration signals from the frame and the high-speed grinding wheels reflect the different health stages of the grinding wheel and can be exploited for intelligent condition monitoring. The spectral isolation technique as a preprocessing tool ensures that only the critical spectral segments of the inputs are retained for improved diagnostic accuracy at reduced computational costs. This study explores artificial intelligence-based models for learning the discriminant spectral information stored in the vibration signals and considers the accuracy and cost implications of spectral isolation of the critical spectral segments of the signals for accurate equipment monitoring. Results from one-dimensional convolutional neural networks (1D-CNN) and multi-layer perceptron (MLP) neural networks, respectively, reveal that spectral isolation offers a higher condition monitoring accuracy at reduced computational costs. Experimental results using different 1D-CNN and MLP architectures reveal 4.6% and 7.5% improved diagnostic accuracy by the 1D-CNNs and MLPs, respectively, at about 1.3% and 5.71% reduced computational costs, respectively.
... When comparing techniques used to analyze EDs, several authors use convolutional neural networks (CNNs), which are promising given their capabilities to deal with non-linear information and big data. As presented by Kang et al. (2019), other approaches can assist in this analysis, such as the sliding window method and aspect ratio calculation, or according to Kim et al. (2021) using generalized focal loss (GFL). In , they showed how promising it is to divide object detection tasks into stages depending on their class. ...
Preprint
Full-text available
Engineering drawings of the railway interlocking systems come often from a legacy since the railway networks were built several years ago. Most of these drawings remained archived on handwritten sheets and need to be digitalized to continue updating and safety checks. This digitalization task is challenging as it requires major manual labor, and standard machine learning methods may not perform satisfactorily because drawings can be noisy and have poor sharpness. Considering these challenges, this paper proposes to solve this problem with a hybrid method that combines machine learning models, clustering techniques, computer vision, and ruled-based methods. A fine-tuned deep learning model is applied to identify symbols, letters, numbers, and specified objects. The lines representing electrical connections are determined using a combination of probabilistic Hough transform and clustering techniques. The identified letters are joined to create the labels by applying rule-based methods, and electrical connections are attached to symbols in a graph structure. A readable output is created for a drawing interface using the edges from the graph structure and the position of the detected objects. The method proposed in this paper can be applied to other engineering drawings and is a generalizable solution to the challenge of digitizing engineering schemes.
... Cropping the appropriate region 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 of interest (ROI) that includes the text of interest is the primary step in preprocessing for Optical character recognition (OCR). We extract textual information and their locations by applying EasyOCR [50]. EasyOCR is an open-source Python tool used to extract text from images. ...
Preprint
Full-text available
83% of the world's population owned a smartphone today. The use of smartphones as personal assistants is also emerging. This article proposes a new video dataset suitable for few-shot or zero-shot learning. The dataset contains handheld product videos captured using a handheld smartphone by visually impaired (VI) people. With the ultimate goal of improving assistive technology for the VI, the dataset is designed to facilitate question-answering based on both textual and visual features. One of the objectives of such video analytics is to develop assistive technology for visually impaired people for day-to-day activity management and also provide an independent shopping experience. This article highlights the limitations of existing deep learning-based approaches when applied to the dataset, suggesting that they pose novel challenges for computer vision researchers. We propose a zero-shot VQA for the problem. Despite the current approaches' poor performance, they foster a training-free zero-shot approach, providing a baseline for visual question-answering towards the foundation for future research. We believe the dataset provides new challenges and attracts many computer vision researchers. This dataset will be available.
... LWDP requires additional support from other text recognition technologies as its performance can degrade based on environmental conditions. Kim et al. [39] proposed a method for recognizing symbols and text in a P&ID image by employing deep learning technology. Symbols are recognized based on the GFL method, while text is recognized through the EasyOCR framework. ...
Article
Full-text available
The engineering sector is undergoing digital transformation (DT) alongside shifts in labor patterns. This study concentrates on piping design within plant engineering, aiming to develop a system for optimal piping route design using artificial intelligence (AI) technology. The objective is to overcome limitations related to time and costs in traditional manual piping design processes. The ultimate aim is to contribute to the digitalization of engineering processes and improve project performance. Initially, digital image processing was utilized to digitize piping and instrument diagram (P&ID) data and establish a line topology set (LTS). Subsequently, three-dimensional (3D) modeling digital tools were employed to create a user-friendly system environment that visually represents piping information. Dijkstra’s algorithm was implemented to determine the optimal piping route, considering various priorities during the design process. Finally, an interference avoidance algorithm was used to prevent clashes among piping, equipment, and structures. Hence, an auto-routing system (ARS), equipped with a logical algorithm and 3D environment for optimal piping design, was developed. To evaluate the effectiveness of the proposed model, a comparison was made between the bill of materials (BoM) from Company D’s chemical plant project and the BoM extracted from the ARS. The performance evaluation revealed that the accuracy in matching pipe weight and length was 105.7% and 84.9%, respectively. Additionally, the accuracy in matching the weight and quantity of fittings was found to be 99.7% and 83.9%, respectively. These findings indicate that current digitalized design technology does not ensure 100% accurate designs. Nevertheless, the results can still serve as a valuable reference for attaining optimal piping design. This study’s outcomes are anticipated to enhance work efficiency through DT in the engineering piping design sector and contribute to the sustainable growth of companies.
... Kim et al. [6] proposed a method using deep learning to detect symbols and text in dense image P&IDs. In this method, two networks based on generalized focal loss (GFL) [21] with ResNet50 [22] as the backbone were used to detect small and large symbols effectively during symbol recognition. ...
Article
Full-text available
In recent years, with the use of computer-aided design software for most plant projects, intelligent piping and instrumentation diagrams (P&IDs) have become the default format for P&IDs. However, most of the previous P&IDs are often in image format, and it is necessary to convert them to intelligent P&IDs. In this study, we solve the problem of classifying the functional types of lines in the process of converting image P&IDs into intelligent P&IDs. The challenge is to classify whether each line in a P&ID is a piping or signal line. To solve this, the objects connected to the line, not the type of line, need to be considered. First, the connection relationships between symbols and lines in a P&ID are represented as a graph. The problem is then modeled and solved as a node classification problem using a graph neural network based on the mean aggregator convolutional layer of GraphSAGE. In addition, a dataset was generated from 19 real-world P&ID drawings to train the graph neural network and the optimal model was selected through hyperparameter tuning. The implementation and experiments of the proposed method demonstrate an accuracy of 99.53%.
... In [30], a DCNN and graph convolutional network (GCN) were combined to extract and recognize geological map symbols. In [31], deep learning technology was applied to piping and instrumentation diagrams (P&IDs) to recognize symbols and text at an industrially applicable level. In such optical character recognition methods, the foreground and background differ considerably and mostly have a regular shape. ...
Article
Full-text available
Point symbols on a scanned topographic map (STM) provide crucial geographic information. However, point symbol recognition entails high complexity and uncertainty owing to the stickiness of map elements and singularity of symbol structures. Therefore, extracting point symbols from STMs is challenging. Currently, point symbol recognition is performed primarily through pattern recognition methods that have low accuracy and efficiency. To address this problem, we investigated the potential of a deep learning-based method for point symbol recognition and proposed a deep convolutional neural network (DCNN)-based model for this task. We created point symbol datasets from different sources for training and prediction models. Within this framework, atrous spatial pyramid pooling (ASPP) was adopted to handle the recognition difficulty owing to the differences between point symbols and natural objects. To increase the positioning accuracy, the k-means++ clustering method was used to generate anchor boxes that were more suitable for our point symbol datasets. Additionally, to improve the generalization ability of the model, we designed two data augmentation methods to adapt to symbol recognition. Experiments demonstrated that the deep learning method considerably improved the recognition accuracy and efficiency compared with classical algorithms. The introduction of ASPP in the object detection algorithm resulted in higher mean average precision and intersection over union values, indicating a higher recognition accuracy. It is also demonstrated that data augmentation methods can alleviate the cross-domain problem and improve the rotation robustness. This study contributes to the development of algorithms and the evaluation of geographic elements extracted from STMs.
... Industrial Applications of OCR The study by Kim et al. [12] proposes a method of symbols and text in Piping and Instrumentation Diagrams (P&ID) within the process industry. The need for digitizing P&IDs arises as many P&IDs especially for older plants exist in image form. ...
Conference Paper
In this study, an OCR system based on deep learning techniques was deployed to digitize scanned agricultural regulatory documents comprising of certificates and labels. Recognition of the certificates and labels is challenging as they are scanned images of the hard copy form and the layout and size of the text as well as the languages vary between the various countries (due to diverse regulatory requirements). We evaluated and compared between various state-of-the-art deep learning-based text detection and recognition model as well as a packaged OCR library - Tesseract. We then adopted a two-stage approach comprising of text detection using Character Region Awareness For Text (CRAFT) followed by recognition using OCR branch of a multi-lingual text recognition algorithm E2E-MLT. A sliding windows text matcher is used to enhance the extraction of the required information such as trade names, active ingredients and crops. Initial evaluation revealed that the system performs well with a high accuracy of 91.9% for the recognition of trade names in certificates and labels and the system is currently deployed for use in Philippines, one of our collaborator’s sites.KeywordsDeep learningText detectionOptical character recognitionRegulatory document
... Against the traditional methods which rely on hand-crafted feature engineering, these DL methods are fully automated in their architecture to perform both feature engineering and classification (and regression) tasks efficiently. This transition to DL-based models has greatly improved process efficiencies [7], equipment condition monitoring [4], spatiotemporal forecasting [8], and a host of many other solutions [5,[9][10][11]; however, they are faced with challenges ranging from over-fitting, interpretability, optimal hyperparameter selection (and optimization), standardized weight initialization paradigm, and finding the optimal decision criteria between power consumption and performance [1,12]. Nonetheless, considering the need for accurate real-time solutions for ICPS components especially with the growing need for uncertainty modeling, sensor data discrepancies, dynamic environmental and operating conditions, etc., DL methods remain preferable even at the cost of computational power. ...
Article
Full-text available
This paper develops a novel hybrid feature learner and classifier for vibration-based fault detection and isolation (FDI) of industrial apartments. The trained model extracts high-level discriminative features from vibration signals and predicts equipment state. Against the limitations of traditional machine learning (ML)-based classifiers, the convolutional neural network (CNN) and deep neural network (DNN) are not only superior for real-time applications, but they also come with other benefits including ease-of-use, automated feature learning, and higher predictive accuracies. This study proposes a hybrid DNN and one-dimensional CNN diagnostics model (D-dCNN) which automatically extracts high-level discriminative features from vibration signals for FDI. Via Softmax averaging at the output layer, the model mitigates the limitations of the standalone classifiers. A diagnostic case study demonstrates the efficiency of the model with a significant accuracy of 92% (F1 score) and extensive comparative empirical validations.
Article
Full-text available
This paper presents a review of deep learning on engineering drawings and diagrams. These are typically complex diagrams, that contain a large number of different shapes, such as text annotations, symbols, and connectivity information (largely lines). Digitising these diagrams essentially means the automatic recognition of all these shapes. Initial digitisation methods were based on traditional approaches, which proved to be challenging as these methods rely heavily on hand-crafted features and heuristics. In the past five years, however, there has been a significant increase in the number of deep learning-based methods proposed for engineering diagram digitalisation. We present a comprehensive and critical evaluation of existing literature that has used deep learning-based methods to automatically process and analyse engineering drawings. Key aspects of the digitisation process such as symbol recognition, text extraction, and connectivity information detection, are presented and thoroughly discussed. The review is presented in the context of a wide range of applications across different industry sectors, such as Oil and Gas, Architectural, Mechanical sectors, amongst others. The paper also outlines several key challenges, namely the lack of datasets, data annotation, evaluation and class imbalance. Finally, the latest development in digitalising engineering drawings are summarised, conclusions are drawn, and future interesting research directions to accelerate research and development in this area are outlined.
Article
Full-text available
p>Because of the rapid growth in technology breakthroughs, including multimedia and cell phones, Telugu character recognition (TCR) has recently become a popular study area. It is still necessary to construct automated and intelligent online TCR models, even if many studies have focused on offline TCR models. The Telugu character dataset construction and validation using an Inception and ResNet-based model are presented. The collection of 645 letters in the dataset includes 18 Achus, 38 Hallus, 35 Othulu, 34×16 Guninthamulu, and 10 Ankelu. The proposed technique aims to efficiently recognize and identify distinctive Telugu characters online. This model's main pre-processing steps to achieve its goals include normalization, smoothing, and interpolation. Improved recognition performance can be attained by using stochastic gradient descent (SGD) to optimize the model's hyperparameters.</p
Article
Symbol detection methods for Piping and Instrumentation Diagram (P &ID) have been continuously developed over the past few decades, evolving from traditional methods to convolutional neural networks (CNN). This study aims to compare the performance of tag classification and detection in terms of both accuracy and time consumption between traditional methods, a designed-from-scratch CNN, and ResNet50 transfer learning using the same dataset. The results show that ResNet50 transfer learning achieves the highest F1 at 85.9%, but takes the longest execution time at 175.7 s per diagram. Meanwhile, most of the errors in the traditional method and the designed-from-scratch CNN were false positives in tag detection for the diagram description on the right pane. However, after applying a rule to crop the diagram picture before classification and detection, the designed-from-scratch CNN exhibits the best performance, achieving the highest F1 at 89.6%, with a running time of 118.4 s per diagram, which is comparable to the traditional method.
Article
Digitizing image-format piping and instrumentation diagrams (P&IDs) consists of a step for detecting the information objects that constitute P&IDs, which identifies connection relationships between the detected objects, and a step for creating digital P&IDs. This paper presents a P&ID line object extraction method that uses an improved continuous line detection algorithm to extract the information objects that constitute P&IDs. The improved continuous line detection algorithm reduces the time spent performing line extraction by edge detection that employs a differential filter. It is also used to detect continuous lines in the vertical, horizontal, and diagonal directions. Additionally, it processes diagonal continuous lines after performing image differentiation to handle short continuous lines, which are a major cause of misdetection when detecting diagonal continuous lines. The P&ID line object extraction method that incorporates this algorithm consists of three steps. The preprocessing step removes the diagram’s outline borders and heading areas. Second, the detection step detects continuous lines and then detects the special signs that are needed to distinguish different types of lines. Third, the postprocessing step uses the detected line signs to identify detected continuous lines, which must be converted to other types of lines, and their types are changed. Finally, the lines and the flow arrow detection information are merged. To verify the proposed method, an image-format P&ID line extraction system prototype was implemented, and line extraction tests were conducted. In nine test P&IDs, the overall average precision and recall were 95.26 % and 91.25 %, respectively, demonstrating good line extraction performance.
Article
Advances in deep convolutional neural networks led to breakthroughs in many computer vision applications. In chemical engineering, a number of tools have been developed for the digitization of Process and Instrumentation Diagrams. However, there is no framework for the digitization of process flow diagrams (PFDs). PFDs are difficult to digitize because of the large variability in the data, e.g., there are multiple ways to depict unit operations in PFDs. We propose a two-step framework for digitizing PFDs: (i) unit operations are detected using deep learning powered object detection model, (ii) the connectivities between unit operations are detected using a pixel-based search algorithm. To ensure robustness, we collect and label over 1,000 PFDs from diversified sources including various scientific journals and books. To cope with the high intra-class variability in the data, we define 47 distinct classes that account for different drawing styles of unit operations. Our algorithm delivers accurate and robust results on an independent test set. We report promising results for line and unit operation detection with an Average Precision at 50 percent (AP50) of 88% and an Average Precision (AP) of 68% for the detection of unit operations.
Article
The development of machine learning and deep learning provided solutions for predicting microbiota response on environmental change based on microbial high-throughput sequencing. However, there were few studies specifically clarifying the performance and practical of two types of binary classification models to find a better algorithm for the microbiota data analysis. Here, for the first time, we evaluated the performance, accuracy and running time of the binary classification models built by three machine learning methods - random forest (RF), support vector machine (SVM), logistic regression (LR), and one deep learning method - back propagation neural network (BPNN). The built models were based on the microbiota datasets that removed low-quality variables and solved the class imbalance problem. Additionally, we optimized the models by tuning. Our study demonstrated that dataset pre-processing was a necessary process for model construction. Among these 4 binary classification models, BPNN and RF were the most suitable methods for constructing microbiota binary classification models. Using these 4 models to predict multiple microbial datasets, BPNN showed the highest accuracy and the most robust performance, while the RF method was ranked second. We also constructed the optimal models by adjusting the epochs of BPNN and the n_estimators of RF for six times. The evaluation related to performances of models provided a road map for the application of artificial intelligence to assess microbial ecology.
Article
It is important to understand the expiration date. However, it is challenging for machines to understand it. Most previous methods recognize expiration dates in limited conditions. To address this problem, a generalized framework for detecting and understanding expiration dates has been proposed. This framework handles challenging cases and distinguishes 13 different date formats. Unlike previous methods, a neural network-based date parser is adopted in the framework to understand the meaning of an expiration date by identifying the day, month, and year. The experimental results demonstrate the proposed framework achieves 97.74% recognition accuracy for expiration dates in various formats and challenging cases. Since there is no publicly available dataset of expiration dates, a novel dataset collection named ExpDate was created and opened.
Article
Full-text available
Piping and instrument diagrams (P&IDs) are a key component of the process industry; they contain information about the plant, including the instruments, lines, valves, and control logic. However, the complexity of these diagrams makes it difficult to extract the information automatically. In this study, we implement an object-detection method to recognize graphical symbols in P&IDs. The framework consists of three parts—region proposal, data annotation, and classification. Sequential image processing is applied as the region proposal step for P&IDs. After getting the proposed regions, the unsupervised learning methods, k-means, and deep adaptive clustering are implemented to decompose the detected dummy symbols and assign negative classes for them. By training a convolutional network, it becomes possible to classify the proposed regions and extract the symbolic information. The results indicate that the proposed framework delivers a superior symbol-recognition performance through dummy detection.
Article
Full-text available
A piping and instrumentation diagram (P&ID) is a key drawing widely used in the energy industry. In a digital P&ID, all included objects are classified and made amenable to computerized data management. However, despite being widespread, a large number of P&IDs in the image format still in use throughout the process (plant design, procurement, construction, and commissioning) are hampered by difficulties associated with contractual relationships and software systems. In this study, we propose a method that uses deep learning techniques to recognize and extract important information from the objects in the image-format P&IDs. We define the training data structure required for developing a deep learning model for the P&ID recognition. The proposed method consists of preprocessing and recognition stages. In the preprocessing stage, diagram alignment, outer border removal, and title box removal are performed. In the recognition stage, symbols, characters, lines, and tables are detected. The objects for recognition are symbols, characters, lines, and tables in P&ID drawings. A new deep learning model for symbol detection is defined using AlexNet. We also employ the connectionist text proposal network (CTPN) for character detection, and traditional image processing techniques for P&ID line and table detection. In the experiments where two test P&IDs were recognized according to the proposed method, recognition accuracies for symbol, characters, and lines were found to be 91.6%, 83.1%, and 90.6% on average, respectively.
Article
Full-text available
In the Fourth Industrial Revolution, artificial intelligence technology and big data science are emerging rapidly. To apply these informational technologies to the engineering industries, it is essential to digitize the data that are currently archived in image or hard-copy format. For previously created design drawings, the consistency between the design products is reduced in the digitization process, and the accuracy and reliability of estimates of the equipment and materials by the digitized drawings are remarkably low. In this paper, we propose a method and system of automatically recognizing and extracting design information from imaged piping and instrumentation diagram (P&ID) drawings and automatically generating digitized drawings based on the extracted data by using digital image processing techniques such as template matching and sliding window method. First, the symbols are recognized by template matching and extracted from the imaged P&ID drawing and registered automatically in the database. Then, lines and text are recognized and extracted from in the imaged P&ID drawing using the sliding window method and aspect ratio calculation, respectively. The extracted symbols for equipment and lines are associated with the attributes of the closest text and are stored in the database in neutral format. It is mapped with the predefined intelligent P&ID information and transformed to commercial P&ID tool formats with the associated information stored. As illustrated through the validation case studies, the intelligent digitized drawings generated by the above automatic conversion system, the consistency of the design product is maintained, and the problems experienced with the traditional and manual P&ID input method by engineering companies, such as time consumption, missing items, and misspellings, are solved through the final fine-tune validation process.
Article
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).
Conference Paper
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300×300300 \times 300 input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512×512512 \times 512 input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https:// github. com/ weiliu89/ caffe/ tree/ ssd.
Conference Paper
Full-text available
We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multi-language text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8, 35] by a large margin. The CTPN is computationally efficient with 0.14 s/image, by using the very deep VGG16 model [27]. Online demo is available: http:// textdet. com/ .
Article
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of bounding box priors over different aspect ratios and scales per feature map location. At prediction time, the network generates confidences that each prior corresponds to objects of interest and produces adjustments to the prior to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that requires object proposals, such as R-CNN and MultiBox, because it completely discards the proposal generation step and encapsulates all the computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on ILSVRC DET and PASCAL VOC dataset confirm that SSD has comparable performance with methods that utilize an additional object proposal step and yet is 100-1000x faster. Compared to other single stage methods, SSD has similar or better performance, while providing a unified framework for both training and inference.
Article
Full-text available
Feature extraction and representation is a crucial step for multimedia processing. How to extract ideal features that can reflect the intrinsic content of the images as complete as possible is still a challenging problem in computer vision. However, very little research has paid attention to this problem in the last decades. So in this paper, we focus our review on the latest development in image feature extraction and provide a comprehensive survey on image feature representation techniques. In particular, we analyze the effectiveness of the fusion of global and local features in automatic image annotation and content based image retrieval community, including some classic models and their illustrations in the literature. Finally, we summarize this paper with some important conclusions and point out the future potential research directions.
Article
Full-text available
Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.
Article
Full-text available
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Article
Full-text available
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Article
Full-text available
The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Article
Deep learning is a popular direction in computer vision and digital image processing. It is widely utilized in many fields, such as robot navigation, intelligent video surveillance, industrial inspection, and aerospace. With the extensive use of deep learning techniques, classification and object detection algorithms have been rapidly developed. In recent years, with the introduction of the concept of "unmanned retail", object detection and image classification play a central role in unmanned retail applications. However, open source datasets of traditional classification and object detection have not yet been optimized for application scenarios of unmanned retail. Currently, classification and object detection datasets do not exist that focus on unmanned retail solely. Therefore, in order to promote unmanned retail applications by using deep learning-based classification and object detection, we collected more than 30,000 images of unmanned retail containers using a refrigerator affixed with different cameras under both static and dynamic recognition environments. These images were categorized into 10 kinds of beverages. After manual labeling, the dataset images contained 155,153 instances, each of which was annotated with a bounding box. We performed extensive experiments on this dataset using 10 state-of-the-art deep learning-based models. Experimental results indicate great potential of using these deep learning-based models for real-world smart unmanned vending machines (UVMs).
Article
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
Article
In object detection, an intersection over union (IoU) threshold is required to define positives and negatives. An object detector, trained with low IoU threshold, e.g. 0.5, usually produces noisy detections. However, detection performance tends to degrade with increasing the IoU thresholds. Two main factors are responsible for this: 1) overfitting during training, due to exponentially vanishing positive samples, and 2) inference-time mismatch between the IoUs for which the detector is optimal and those of the input hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, is proposed to address these problems. It consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector. The resampling of progressively improved hypotheses guarantees that all detectors have a positive set of examples of equivalent size, reducing the overfitting problem. The same cascade procedure is applied at inference, enabling a closer match between the hypotheses and the detector quality of each stage. A simple implementation of the Cascade R-CNN is shown to surpass all single-model object detectors on the challenging COCO dataset. Experiments also show that the Cascade R-CNN is widely applicable across detector architectures, achieving consistent gains independently of the baseline detector strength. The code will be made available at https://github.com/zhaoweicai/cascade-rcnn.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates backslashemphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
We present YOLO, a unified pipeline for object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is also extremely fast; YOLO processes images in real-time at 45 frames per second, hundreds to thousands of times faster than existing detection systems. Our system uses global image context to detect and localize objects, making it less prone to background errors than top detection systems like R-CNN. By itself, YOLO detects objects at unprecedented speeds with moderate accuracy. When combined with state-of-the-art detectors, YOLO boosts performance by 2-3% points mAP.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates \emph{deep recurrent neural networks}, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We propose a novel approach to annotating weakly labelled data. In contrast to many existing approaches that perform annotation by seeking clusters of self-similar exemplars (minimising intra-class vari- ance), we perform image annotation by selecting exemplars that have never occurred before in the much larger, and strongly annotated, nega- tive training set (maximising inter-class variance). Compared to existing methods, our approach is fast, robust, and obtains state of the art results on two challenging data-sets { voc2007 (all poses), and the msr2 action data-set, where we obtain a 10% increase. Moreover, this use of nega- tive mining complements existing methods, that seek to minimize the intra-class variance, and can be readily integrated with many of them.
Article
In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone-Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in n. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.
Article
This paper presents a recursive morphological operation developed in order to perform efficient shape representation. This operation uses a structuring element as a geometrical primitive to evaluate the shape of an object. It results in a set of loci of the translated structuring elements that are included in the object but which do not overlap. The analysis of its computational complexity shows that it is usually less time-consuming than morphological erosion. By using this operation, an object decomposition algorithm is then developed for shape representation. It decomposes an object into a union of simple and non-overlapping object components. The object is represented by the sizes and loci of its object components. This representation is information preserving, shift and scale invariant, and non-rebundant. It has been compared with skeletons, morphological decomposition, chain codes and quatrees in terms of compression ability and image processing facility. Experimental results shows that it is very compact, especially if information loss is allowed. Because of the non-overlap between object components, many image processing tasks can be easily performed by directly using this shape representation.
Article
A new method is proposed for curve detection. For a curve with n parameters, instead of transforming one pixel into a hypersurface of the n-D parameter space as the HT and its variants do, we randomly pick n pixels and map them into one point in the parameter space. In comparison with the HT and its variants, our new method has the advantages of small storage, high speed, infinite parameter space and arbitrarily high resolution. The preliminary experiments have shown that the new method is quite effective.
Conference Paper
In electric generation p lants, facility administration is going to be computerized for secure and efficient maintenance, and it is needed to input many already existing plant diagrams (so called piping & instrument diagrams) into computer systems. Then it is required t o recognize plant diagrams, whose drawing quality is so tough for recognition, that the drawn elements, namely, symbols, lines and c haracters, often h ave contacts with each other. Thus we developed an automatic recog- nition system for plant d iagrams, which enabled to input such diagrams at recognition rate over 80 %. And together with an effi- cient verification process, the total time to input a diagram is reduced to be under 30 %, compared to conventional CAD-system input.
Article
An automatic understanding system using the techniques of image processing, pattern recognition, and artificial intelligence has been developed for electronic circuit diagrams. Part of the system is presented to extract three categories of essential components: circuit symbols, characters, and connection lines. Each essential component consists of a set of picture segments which are appropriately detected by a segment tracking algorithm. A heuristic piecewise linear approximation algorithm is proposed to approximate picture segments for primitive recognition. On the basis of topological context, a one-pass manner called the relational best search method applies a depth first search technique uniting a set of specified rules during the traversal of a circuit diagram. This method combines the constituents of each circuit symbol or character into a cluster. All the clusters together with the remaining components are extracted and grouped into the three categories as soon as the traversal is finished. A variety of electronic circuit diagrams have been used for testing the component extractor. So far, the present extractor has shown favorable results.
Article
We present a computational recognition approach to convert network-like, image-based engineering diagrams into engineering models with which computations of interests, such as CAD modeling, simulation, information retrieval and semantic-aware editing, are enabled. The proposed approach is designed to work on diagrams produced using computer-aided drawing tools or hand sketches, and does not rely on temporal information for recognition. Our approach leverages a Convolutional Neural Network (CNN) as a trainable engineering symbol recognizer. The CNN is capable of learning the visual features of the defined symbol categories from a few user-supplied prototypical diagrams and a set of synthetically generated training samples. When deployed, the trained CNN is applied either to the entire input diagram using a multi-scale sliding window or, where applicable, to each isolated pixel cluster obtained through Connected Component Analysis (CCA). Then the connectivity between the detected symbols are analyzed to obtain an attributed graph representing the engineering model conveyed by the diagram. We evaluate the performance of the approach with benchmark datasets and demonstrate its utility in different application scenarios, including the construction and simulation of control system or mechanical vibratory system models from hand-sketched or camera-captured images, content-based image retrieval for resonant circuits and sematic-aware image editing for floor plans.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
Edges characterize boundaries and are therefore a problem of fundamental importance inimage processing. Image Edge detection significantly reduces the amount of data and filtersout useless information, while preserving the important structural properties in an image.Since edge detection is in the forefront of image processing for object detection, it is crucial tohave a good understanding of edge detection algorithms. In this paper the comparativeanalysis of various Image Edge Detection techniques is presented. The software isdeveloped using MATLAB 7.0. It has been shown that the Canny’s edge detection algorithmperforms better than all these operators under almost all scenarios. Evaluation of the imagesshowed that under noisy conditions Canny, LoG( Laplacian of Gaussian), Robert, Prewitt,Sobel exhibit better performance, respectively. 1. It has been observed that Canny’s edgedetection algorithm is computationally more expensive compared to LoG( Laplacian ofGaussian), Sobel, Prewitt and Robert’s operator.