Conference Paper

Activity Recognition of Construction Equipment Using Generated Sound Data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... As seen in Fig. 6, the most employed data collection method was "camera & smartphone photography" (e.g., [44,45]) with 34 of 140 studies in total, followed by "fixed sensors / scanners" (e.g., [46,47]) with 30 studies, "video recordings" (e.g., [37,39]) with 23 studies and "UGV / UAV hardware" (e.g., [48,49]) with 22 studies. "Event logs / BIM documents" (e.g., [50,51]), "wearable sensors" (e.g., [41,52]) and "audio recordings" (e.g., [53,54]) are the next most utilized data collection ways by the authors. In total, 8 studies collected data using more than one of the mentioned methods in different combinations (e.g., [5,55]). ...
... Studies that process information about the construction site by determining the location of a worker, vehicle, equipment, or material have been categorized in this area. 7 studies work by "sound recognition or classification" (e.g., [53,54]). Of the 4 studies using "NLP" (e.g., [64]), 3 of them also utilized "computer vision" technology together (e.g., [55]). ...
... On the other hand, articles collecting data via camera / smartphone photography and video recordings have predominantly employed computer vision for data processing (e.g., [67,68]). Similarly, articles using wearable sensors have typically processed data using non-imagery machine learning (e.g., [52]), those utilizing event logs/BIM documents have processed data using BIM models (e.g., [50]), and those collecting data through audio recordings have relied on sound recognition and classification methods for processing (e.g., [54]). ...
Article
Construction exhibits relatively low efficiency and a modest annual productivity growth rate despite being one of the largest industries. As digitalizing the construction progress monitoring process and ensuring the objectivity of the obtained data is considered as a critical area for enhancing efficiency of the AEC (Architecture, Engineering, and Construction) industry, AEC companies invest in resources to extract and report construction data obtained from the site. This paper examines digitalization efforts in construction progress monitoring, focusing particularly on studies proposing new monitoring models. Articles focusing on studies proposing a specific model for digitalization of progress monitoring was searched in databases. A framed literature review study is conducted, bibliometric analysis is performed via keywords and the Vosviewer tool and finally semantic and methodological variations of models studied. An in-depth analysis of these studies reveals that the proposed models fundamentally consist of four different elements defined as: data collection methods, data processing methodologies, the specific construction areas they focus on, and the primary input objects they address. All studies were examined and categorized under these four distinct headings and mapped accordingly. The relationships between the emerging methods and their frequency of preference were also analyzed. Both more frequently and less commonly employed approaches have been identified in the literature. Additionally, this paper highlights potential research areas on possible data collection and processing ways, specific construction areas and input sources to focus in the literature by emphasizing the existing relationships and correlations between the mentioned categories.
... Scarpiniti et al. demonstrated up to 98% accuracy in categorizing more than 10 different types of construction equipment and device using a DBN [22]. Sherafat et al. developed CNN models for recognizing multiple-equipment activities [23,40]. They used a two-level multi-label sound classification scheme that enables concurrent detection of the device kind and their associated activities. ...
Article
Full-text available
Automated construction monitoring assists site managers in managing safety, schedule, and productivity effectively. Existing research focuses on identifying construction sounds to determine the type of construction activity. However, there are two major limitations: the inability to handle a mixed sound environment in which multiple construction activity sounds occur simultaneously, and the inability to precisely locate the start and end times of each individual construction activity. This research aims to fill this gap through developing an innovative deep learning-based method. The proposed model combines the benefits of Convolutional Neural Network (CNN) for extracting features and Recurrent Neural Network (RNN) for leveraging contextual information to handle construction environments with polyphony and noise. In addition, the dual threshold output permits exact identification of the start and finish timings of individual construction activities. Before training and testing with construction sounds collected from a modular construction factory, the model has been pre-trained with publicly available general sound event data. All of the innovative designs have been confirmed by an ablation study, and two extended experiments were also performed to verify the versatility of the present model in additional construction environments or activities. This model has great potential to be used for autonomous monitoring of construction activities.
Conference Paper
Full-text available
The sound event detection is a reasonable choice for several application domains like cattle shed, dense forest, or any dark environment where the visual object usually obscured or unseen. The aim of this study is the development of an autonomous monitoring system for welfare management in large cow farms based on sound characteristics. In this paper, we prepare a cow sound artificial dataset and develop a sound event annotation tool for annotation of data. We propose a convolutional neural network (CNN) architecture for rare sound event detection. The applied object detection method achieves a higher quantitative evaluation score and a more precise qualitative result than the past related study. Finally, we conclude that the CNN based architecture for rare sound object detection can be one solution for domestic welfare management. Indeed, the artificial data preparation strategy can be a way to deal with the data scarcity problem and annotation difficulties for rare sound event detection. Keywords-rare sound event detection, audio event annotator, convolutional neural network, domestic cow welfare management
Conference Paper
Full-text available
Dynamic and complex construction sites including incomplete structures and unsecured resources are among the most vulnerable environments to windstorms such as hurricanes. To better secure unstructured construction sites, this paper aims at proposing a new vision-based method to analyze potential risk of wind-induced damage in construction sites. First, by leveraging large-scale images collected from drones, we reconstruct a 3D point cloud model of construction sites and perform the semantic segmentation to categorize potential wind-borne debris. Then, we identify the positions of the potential wind-borne debris given wind speeds and perform the volumetric measurement on such vulnerable objects. Finally, building on the position and the volume of the potential wind-borne debris, we quantify the associated threat level in the context of their kinetic energy in wind situations. A case study was conducted on a real construction site to validate the proposed method. The proposed Imaging-to-Simulation framework enables practitioners to automatically flag vulnerable objects/areas in construction sites with respect to the severity of wind events, which helps better secure their jobsites in a timely manner before potential extreme wind events in order to minimize the associated damage.
Conference Paper
Full-text available
Several studies have been conducted to automatically recognize activities of construction equipment using their generated sound patterns. Most of these studies are focused on single-machine scenarios under controlled environments. However, real construction job sites are more complex and often consist of several types of equipment with different orientations, directions, and locations working simultaneously. The current state-of-research for recognizing activities of multiple machines on a job site is hardware-oriented, on the basis of using microphone arrays (i.e., several single microphones installed on a board under specific geometric layout) and beamforming principles for classifying sound directions for each machine. While effective, the common hardware-approach has limitations and using microphone arrays is not always a feasible option at ordinary job sites. In this paper, the authors proposed a software-oriented approach using Deep Neural Networks (DNNs) and Time-Frequency Masks (TFMs) to address this issue. The proposed method requires using single microphones, as the sound sources could be differentiated by training a DNN. The presented approach has been tested and validated under simulated job site conditions where two machines operated simultaneously. Results show that the average accuracy for soft TFM is 38% higher than binary TFM.
Article
Full-text available
In this paper, a new monaural singing voice separation algorithm is presented. This field of signal processing provides important information in many areas dealing with voice recognition, data retrieval, and singer identification. The proposed approach includes a sparse and low-rank decomposition model using spectrogram of the singing voice signals. The vocal and non-vocal parts of a singing voice signal are investigated as sparse and low-rank components, respectively. An alternating optimization algorithm is applied to decompose the singing voice frames using the sparse representation technique over the vocal and non-vocal dictionaries. Also, a novel voice activity detector is presented based upon the energy of the sparse coefficients to learn atoms related to the non-vocal data in the training step. In the test phase, the learned non-vocal atoms of the music instrumental part are updated according to the non-vocal components captured from the test signal using domain adaptation technique. The proposed dictionary learning process includes two coherence measures: atom–data coherence and mutual coherence to provide a learning procedure with low reconstruction error along with a proper separation in the test step. The simulation results using different measures show that the proposed method leads to significantly better results in comparison with the earlier methods in this context and the traditional procedures.
Article
Full-text available
Automatically recognizing and tracking construction equipment activities is the first step towards performance monitoring of a job site. Recognizing equipment activities helps construction managers to detect the equipment downtime/idle time in a real-time framework, estimate the productivity rate of each equipment based on its progress, and efficiently evaluate the cycle time of each activity. Thus, it leads to project cost reduction and time schedule improvement. Previous studies on this topic have been based on single sources of data (e.g., kinematic, audio, video signals) for automated activity-detection purposes. However, relying on only one source of data is not appropriate, as the selected data source may not be applicable under certain conditions and fails to provide accurate results. To tackle this issue, the authors propose a hybrid system for recognizing multiple activities of construction equipment. The system integrates two major sources of data—audio and kinematic—through implementing a robust data fusion procedure. The presented system includes recording audio and kinematic signals, preprocessing data, extracting several features, as well as dimension reduction, feature fusion, equipment activity classification using Support Vector Machines (SVM), and smoothing labels. The proposed system was implemented in several case studies (i.e., ten different types and equipment models operating at various construction job sites) and the results indicate that a hybrid system is capable of providing up to 20% more accurate results, compared to cases using individual sources of data.
Article
Full-text available
In this paper, we present an end-to-end approach for environmental sound classification based on a 1D Convolution Neural Network (CNN) that learns a representation directly from the audio signal. Several convolutional layers are used to capture the signal’s fine time structure and learn diverse filters that are relevant to the classification task. The proposed approach can deal with audio signals of any length as it splits the signal into overlapped frames using a sliding window. Different architectures considering several input sizes are evaluated, including the initialization of the first convolutional layer with a Gammatone filterbank that models the human auditory filter response in the cochlea. The performance of the proposed end-to-end approach in classifying environmental sounds was assessed on the UrbanSound8k dataset and the experimental results have shown that it achieves 89% of mean accuracy. Therefore, the proposed approach outperforms most of the state-of-the-art approaches that use handcrafted features or 2D representations as input. Moreover, the proposed approach outperforms all approaches that use raw audio signal as input to the classifier. Furthermore, the proposed approach has a small number of parameters compared to other architectures found in the literature, which reduces the amount of data required for training.
Article
Full-text available
Various activities of construction equipment are associated with distinctive sound patterns (e.g., excavating soil, breaking rocks, etc.). Considering this fact, it is possible to extract useful information about construction operations by recording the audio at a jobsite and then processing this data to determine what activities are being performed. Audio-based analysis of construction operations mainly depends on specific hardware and software settings to achieve satisfactory performance. This paper explores the impacts of these settings on the ultimate performance on the task of interest. To achieve this goal, an audio-based system has been developed to recognize the routine sounds of construction machinery. The next step evaluates three types of microphones (off-the-shelf, contact, and a multichannel microphone array) and two installation settings (microphones placed in machines’ cabin and installed on the jobsite in relatively proximity to the machines). Two different jobsite conditions have been considered: (1) jobsites with single machines and (2) jobsites with multiple machines operating simultaneously. In terms of software settings, two different SVM classifiers (RBF and linear kernels) and two common frequency feature extraction techniques (STFT and CWT) were selected. Experimental data from several jobsites was gathered and the results depict an accuracy over 85% for the proposed audio-based recognition system. To better illustrate the practical value of the proposed system, a case study for calculating productivity rates of a sample piece of equipment is presented at the end.
Article
Full-text available
Over the last decade, researchers have explored various technologies and methodologies to enhance worker safety at construction sites. The use of advanced sensing technologies mainly has focused on detecting and warning about safety issues by directly relying on the detection capabilities of these technologies. Until now, very little research has explored methods to quantitatively assess individual workers’ safety performance. For this, this study uses a tracking system to collect and use individuals’ location data in the proposed safety framework. A computational and analytical procedure/model was developed to quantify the safety performance of individual workers beyond detection and warning. The framework defines parameters for zone-based safety risks and establishes a zone-based safety risk model to quantify potential risks to workers. To demonstrate the model of safety analysis, the study conducted field tests at different construction sites, using various interaction scenarios. Probabilistic evaluation showed a slight underestimation and overestimation in certain cases; however, the model represented the overall safety performance of a subject quite well. Test results showed clear evidence of the model’s ability to capture safety conditions of workers in pre-identified hazard zones. The developed approach presents a way to provide visualized and quantified information as a form of safety index, which has not been available in the industry. In addition, such an automated method may present a suitable safety monitoring method that can eliminate human deployment that is expensive, error-prone, and time-consuming.
Conference Paper
Full-text available
A large portion of the expenses in a construction project are allocated towards the capital and operating costs of heavy equipment. Most of construction heavy equipment and tools carry out activities in the form of repetitive cycles (e.g., a cycle of digging, swinging, loading). Precisely estimating cycle times for those operations is a crucial step toward productivity analysis, cost estimation, and scheduling of a construction project. The traditional approaches for estimating cycle times of construction cyclic activities are twofold: (1) based on direct observations and recordings; and (2) using available graphs and approximate formulas for estimations. The first approach is time consuming and labor intensive and the second one might not be sufficiently accurate and realistic. To tackle the above-mentioned issues, this paper proposes an automated, Bayesian system for estimating cycle times of construction heavy equipment. Considering that construction equipment usually produces distinct acoustic patterns while performing various tasks, the main input for the system is recorded audio data. The presented system includes a de-noising algorithm for enhancing the quality of audio data as well as a short-time Fourier transform (STFT) and support vector machines (SVM) for classifying various activities in a primary stage. A Markov chain model for activity transitions is calculated from ground truth data and used to code an adaptive filter that converts SVM-labeled time-frequency bins into higher-level labels of the full period for each activity. Preliminary results show that, through this system, the accuracy of predicting cycle times could be as high as 90%.
Conference Paper
Full-text available
Fall from heights is one of the most lethal incidents in the construction industry. To mitigate the risk of fall hazards, safety managers need to continuously monitor jobsite conditions to identify potential hazardous situations. While, previous studies develop algorithms to automatically analyze a building to detect fall hazards; their application is limited as the real-time data cannot be collected during the dynamic nature of construction processes. One of the emerging technologies that can address this limitation is an unmanned aerial system (UAS). UASs can provide several advantages for safety managers as they can move faster than humans, reach inaccessible areas of jobsites, and can be equipped with video cameras, wireless sensors, radars, or different communication hardware to transfer real-time data. This research study aims to provide a proof of concept of potential application of UASs in developing an automated aerial system to collect, identify, and assess fall hazards in construction projects. The objective of the study is achieved by collecting real-time video feed of the construction jobsite using UASs, generating point cloud data using image/videogrammetric techniques, and developing an algorithm to process spatial point cloud data to identify fall hazards. The algorithm would ultimately search the point cloud data to detect location of the current guardrails and openings and then checks if they are safety-approved. This paper proposes a workflow for identifying fall hazard using unmanned aerial systems and later will present how some parts of the workflow was implemented and tested in a pilot study.
Article
Full-text available
Purpose Sound pressure level (SPL) measurement of voice and speech is often considered a trivial matter, but the measured levels are often reported incorrectly or incompletely, making them difficult to compare among various studies. This article aims at explaining the fundamental principles behind these measurements and providing guidelines to improve their accuracy and reproducibility. Method Basic information is put together from standards, technical, voice and speech literature, and practical experience of the authors and is explained for nontechnical readers. Results Variation of SPL with distance, sound level meters and their accuracy, frequency and time weightings, and background noise topics are reviewed. Several calibration procedures for SPL measurements are described for stand-mounted and head-mounted microphones. Conclusions SPL of voice and speech should be reported together with the mouth-to-microphone distance so that the levels can be related to vocal power. Sound level measurement settings (i.e., frequency weighting and time weighting/averaging) should always be specified. Classified sound level meters should be used to assure measurement accuracy. Head-mounted microphones placed at the proximity of the mouth improve signal-to-noise ratio and can be taken advantage of for voice SPL measurements when calibrated. Background noise levels should be reported besides the sound levels of voice and speech.
Conference Paper
Full-text available
The term Deep Learning or Deep Neural Network refers to Artificial Neural Networks (ANN) with multi layers . Over the last few decades, it has been considered to be one of the most powerful tools, and has become very popular in the literature as it is able to handle a huge amount of data. The interest in having deeper hidden layers has recently begun to surpass classical methods performance in different fields; especially in pattern recognition. One of the most popular deep neural networks is the Convolutional Neural Network (CNN). It take this name from mathematical linear operation between matrixes called convolution. CNN have multiple layers; including convolutional layer, non-linearity layer, pooling layer and fully-connected layer. The convolutional and fully- connected layers have parameters but pooling and non-linearity layers don't have parameters. The CNN has an excellent performance in machine learning problems. Specially the applications that deal with image data, such as largest image classification data set (Image Net), computer vision, and in natural language processing (NLP) and the results achieved were very amazing . In this paper we will explain and define all the elements and important issues related to CNN, and how these elements work. In addition, we will also state the parameters that effect CNN efficiency. This paper assumes that the readers have adequate knowledge about both machine learning and artificial neural network.
Conference Paper
Full-text available
The efficient utilization of construction equipment plays a significant role in engineering and building operations, since the utilization information is instrumental for construction equipment management and project cost control. Therefore, it is necessary to monitor the utilization rate of onsite equipment in order to maintain the efficiency of construction operations. In current practices, the monitoring of the utilization rate of onsite construction equipment mainly relies on the manual observations by field engineers, which is time-consuming and labor-intensive. The main objective of this paper is to present a vision-based method to automate the monitoring of the utilization rate of onsite construction equipment. Under this method, the equipment of interest is firstly localized and tracked in video frames. Also, the positions of the equipment in video frames are extracted. Then, the location information of equipment is analyzed in comparison with the location information of work zones at the jobsite. Such way, the utilization rate of construction equipment in each work zone can be measured. Accordingly, construction professionals can conduct the further analysis, in terms of productivity analysis and equipment rental expense calculation. The method has been tested on the construction site of Poste De Lorimier in Canada. The test results demonstrated the capability and effectiveness of the method to monitor the utilization rate of construction equipment in an automatic manner.
Article
Full-text available
Underground pipelines suffered severe external breakage caused by excavation devices due to arbitral road excavation. Acoustic signal-based recognition has recently shown effectiveness in underground pipeline network surveillance. However, merely relying on recognition may lead to a high false alarm rate. The reason is that underground pipelines are generally paved along a fixed direction and excavations out of the region also trigger the surveillance system. To enhance the reliability of the surveillance system, the direction-of-arrival (DOA) estimation of target sources is combined into the recognition algorithm to reduce false detections in this paper. Two hybrid recognition algorithms are developed. The first one employs extreme learning machine (ELM) for acoustic recognition followed by a focusing matrix-based multiple signal classification algorithm (ELM-MUSIC) for DOA estimation. The second introduces a decision matrix (DM) to characterize the statistic distribution of results obtained by ELM-MUSIC. Real acoustic signals collected by a cross-layer sensor array are conducted for performance comparison. Four representative excavation devices working in a metro construction site are used to generate the signal. Multiple scenarios of the experiments are designed. Comparisons show that the proposed ELM-MUSIC and DM algorithms outperform the conventional focusing matrix based MUSIC (F-MUSIC). In addition, the improved DM method is capable of localizing multiple devices working in order. Two hybrid acoustic signal recognition and source direction estimation algorithms are developed for excavation device classification in this paper. The novel recognition combining DOA estimation scheme can work efficiently for underground pipeline network protection in the real-world complex environment.
Article
In this paper, we propose a Deep Belief Network (DBN) based approach for the classification of audio signals to improve work activity identification and remote surveillance of construction projects. The aim of the work is to obtain an accurate and flexible tool for consistently executing and managing the unmanned monitoring of construction sites by using distributed acoustic sensors. In this paper, ten classes of multiple construction equipment and tools, frequently and broadly used in construction sites, have been collected and examined to conduct and validate the proposed approach. The input provided to the DBN consists in the concatenation of several statistics evaluated by a set of spectral features, like MFCCs and mel-scaled spectrogram. The proposed architecture, along with the preprocessing and the feature extraction steps, has been described in details while the effectiveness of the proposed idea has been demonstrated by some numerical results, evaluated by using real-world recordings. The final overall accuracy on the test set is up to 98% and is a significantly improved performance compared to other state-of-the-are approaches. A practical and real-time application of the presented method has been also proposed in order to apply the classification scheme to sound data recorded in different environmental scenarios.
Article
In recent years, computer vision algorithms have shown to effectively leverage visual data from jobsites for video-based activity analysis of construction equipment. However, earthmoving operations are restricted to site work and surrounding terrain, and the presence of other structures, particularly in urban areas, limits the number of viewpoints from which operations can be recorded. These considerations lower the degree of intra-activity and interactivity category variability to which said algorithms are exposed, hindering their potential for generalizing effectively to new jobsites. Secondly, training computer vision algorithms is also typically reliant on large quantities of hand-annotated ground truth. These annotations are burdensome to obtain and can offset the cost-effectiveness incurred from automating activity analysis. The main contribution of this paper is a means of inexpensively generating synthetic data to improve the capabilities of vision-based activity analysis methods based on virtual, kinematically articulated three-dimensional (3D) models of construction equipment. The authors introduce an automated synthetic data generation method that outputs a two-dimensional (2D) pose corresponding to simulated excavator operations that vary according to camera position with respect to the excavator and activity length and behavior. The presented method is validated by training a deep learning-based method on the synthesized 2D pose sequences and testing on pose sequences corresponding to real-world excavator operations, achieving 75% precision and 71% recall. This exceeds the 66% precision and 65% recall obtained when training and testing the deep learning-based method on the real-world data via cross-validation. Limited access to reliable amounts of real-world data incentivizes using synthetically generated data for training vision-based activity analysis algorithms.
Article
The improvement of the performance of online separating speech and music is an NP problem and the separation optimization increases the complexity of the method in a Robust Principal Component Analysis (RPCA) method which is time consuming in big size matrix computations. This paper presents a RPCA-based speech and music separation method to reduce the amount of computational complexity and be robust to artificial noise by proposing two novel algorithms. The key idea of our real-time method is designing a novel random singular value decomposition algorithm in a non-convex optimization environment to significantly decrease the complexity of previous RPCA methods from min(mn2,m2n) flops to mnr flops where r≪min(m,n) to obtain better performance and get qualified results. Experimental results of different datasets compared with the best state-of-the-art method show that the proposed method is more reliable and achieves an average 339% speedup by the significant reduction of computational complexity, increases the quality of the speech signal by 295%, improves the quality of the music signal by 244% and the robustness of artificial noise without needing any learning technique or requiring particular features.
Article
Emerging vision-based frameworks have demonstrated the great potential to robustly perform volumetric measurements on point cloud models, which has several applications for site material management (e.g., during earthworks). However, prevalent vision-based frameworks to date involve human interventions to manually trim objects of interest from point cloud models, which would be time-consuming and labor-intensive. In addition, point cloud models for volumetric measurements are often incomplete and noisy. To address such challenges, we automatically detect and segment target objects in point cloud models via a deep learning-based approach and then map the semantic values onto point cloud models for 3D semantic segmentation. Once target objects are segmented, the associated volumes are quantified through the proposed vision-based computational process. For evaluation, case studies were performed on material piles in the real-world. The proposed method has the potential to enhance vision-based volumetric measurements, which supports systematic decision-making for material management in jobsites.
Article
The sounds of work activities and equipment operations at a construction site provide critical information regarding construction progress, task performance, and safety issues. The construction industry, however, has not investigated the value of sound data and their applications, which would offer an advanced approach to unmanned management and remote monitoring of construction processes and activities. For analyzing sounds emanating from construction work activities and equipment operations, which generally have complex characteristics that entail overlapping construction and environmental noise, a highly accurate sound classifier is imperative for data analysis. To establish the robust foundation for sound recognition, analysis, and monitoring frameworks, this research study examines diverse classifiers and selects those that accurately identify construction sounds. Employing nine types of sounds from about 100 sound data originating from construction work activities, we assess the accuracy of 17 classifiers and find that sounds can be classified with 93.16% accuracy. A comparison with deep learning technology has been also provided, obtaining results similar to the best ones of the traditional machine learning methods. The outcomes of this study are expected to help enhance advanced processes for audio-based construction monitoring and safety surveillance by providing appropriate classifiers for construction sound data analyses.
Article
Modular construction is an attractive building method due to its advantages over traditional stick-built methods in terms of reduced waste and construction time, more control over resources and environment, and easier implementation of novel techniques and technologies in a controlled factory setting. However, efficient and timely decision-making in modular factories requires spatiotemporal information about the resources regarding their locations and activities which motivates the necessity for an automated activity identification framework. Thus, this paper utilizes sound, a ubiquitous data source present in every modular construction factory, for the automatic identification of commonly performed manual activities such as hammering, nailing, sawing, etc. To develop a robust activity identification model, it is imperative to engineer the appropriate features of the data source (i.e., traits of the signal) that provides a compact yet descriptive representation of the parameterized audio signal based on the nature of the sound, which is very dependent on the application domain. In-depth analysis regarding appropriate features selection and engineering for audio-based activity identification in construction is missing from current research. Thus, this research extensively investigates the effects of various features extracted from four different domains related to audio signals (time-, time-frequency-, cepstral-, and wavelet-domains), in the overall performance of the activity identification model. The effect of these features on activity identification performance was tested by collecting and analyzing audio data generated from manual activities at a modular construction factory. The collected audio signals were first balanced using time-series data augmentation techniques and then used to extract a 318-dimensional feature vector containing 18 different feature sets from the abovementioned four domains. Several sensitivity analyses were performed to optimize the feature space using a feature ranking technique (i.e., Relief algorithm), and the contribution of features in the top feature sets using a support vector machine (SVM). Eventually, a final feature space was designed containing a 130-dimensional feature vector and 0.5-second window size yielding about 97% F-1 score for identifying different activities. The contributions of this study are two-fold: 1. A novel means of automated manual construction activity identification using audio signal is presented; and 2. Foundational knowledge on the selection and optimization of the feature space from four domains is provided for future work in this research field. The result of this study demonstrates the potential of the proposed system to be applied for automated monitoring and data collection in modular construction factory in conjunction with other activity recognition frameworks based on computer vision (CV) and/or inertial measurement units (IMU).
Article
Equipment and workers are two important resources in the construction industry. Performance monitoring of these resources would help project managers improve the productivity rates of construction jobsites and discover potential performance issues. A typical construction workface monitoring system consists of four major levels: location tracking, activity recognition, activity tracking, and performance monitoring. These levels are employed to evaluate work sequences over time and also assess the workers’ and equipment’s well-being and abnormal edge cases. Results of an automated performance monitoring system could be used to employ preventive measures to minimize operating/repair costs and downtimes. The authors of this paper have studied the feasibility of implementing a wide range of technologies and computational techniques for automated activity recognition and tracking of construction equipment and workers. This paper provides a comprehensive review of these methods and techniques as well as describes their advantages, practical value, and limitations. Additionally, a multifaceted comparison between these methods is presented, and potential knowledge gaps and future research directions are discussed.
Article
Construction activities have become a common cause of underground pipeline accidents. The detection of construction activities near pipelines based on acoustic signals can effectively reduce pipeline damage. In the framework of “collecting data—feature extraction—classifier training—monitoring scheme”, this paper presents a construction sound monitoring system to prevent the underground pipeline damage caused by construction. The construction sounds of electric hammers, road cutters and excavator breaking hammers are collected as well as environmental noise. A double-layer identification scheme consisting of two random forest-based classifiers makes up the core of the system. The classifier-1 is first established to detect suspicious sounds. Cutting out the suspected segments as inputs, the classifier-2 is then trained to correct the results of the classifier-1 and further identify the types of construction activities. In the test, 95.59% of the construction sounds near pipelines are detected, while only 1.12% of environmental noise causes false positives.
Article
As the construction industry experiences a high rate of casualties and significant economic loss associated with accidents, safety has always been a primary concern. In response, several studies have attempted to develop new approaches and state-of-the-art technology for conducting autonomous safety surveillance of construction work zones such as vision-based monitoring. The current and proposed methods including human inspection, how- ever, are limited to consistent and real-time monitoring and rapid event recognition of construction safety issues. In addition, the health and safety risks inherent in construction projects make it challenging for construction workers to be aware of possible safety risks and hazards according to daily planned work activities. To address the urgent demand of the industry to improve worker safety, this study involves the development of an audio- based event detection system to provide daily safety issues to laborers and through the rapid identification of construction accidents. As an evidence-driven approach, the proposed framework incorporates the occupational injury and illness manual data, consisting of historical construction accident data classified by types of sources and events, into an audio-based safety event detection framework. This evidence-driven framework integrated with a daily project schedule can automatically provide construction workers with prenotifications regarding safety hazards at a pertinent work zone as well as consistently contribute to enhanced construction safety monitoring by audio-based event detection. By using a machine learning algorithm, the framework can clearly categorize the narrowed-down sound training data according to a daily project schedule and dynamically restrict sound classification types in advance. The proposed framework is expected to contribute to an emerging knowledge base for integrating an automated safety surveillance system into occupational accident data, significantly improving the accuracy of audio-based event detection.
Article
Construction sites are among the most hazardous places with various safety issues. The high rate of hazards on construction sites can be attributed to the dynamic and complex characteristics of construction-related entities, such as the movement of construction equipment and workers as well as the interactions among them. Tracking construction equipment and workers can help avoid potential collisions and other accidents to achieve safer on-site conditions. As construction equipment (e.g. excavators, trucks, cranes, and bulldozers) plays a significant role in construction projects, it is important to track the location, pose and movement of construction equipment. Currently, with the wide installation of surveillance cameras on construction sites, computer vision techniques are explored to process the captured surveillance videos and images, such as to monitor the site conditions and prevent potential hazards. Previous studies have attempted to identify and locate different types of construction equipment on construction sites based on surveillance videos using computer vision techniques. However, there are limited studies that automatically estimate the full body pose and movement of on-site construction equipment, which can greatly influence the safety condition of construction sites and the utilization of the equipment itself. In this study, a methodology framework is developed for automatically estimating the poses of different construction equipment in videos captured on construction sites using computer vision and deep learning techniques. Firstly, keypoints of equipment are defined, based on which the images collected from the surveillance cameras are annotated to generate the ground truth labels. 70%, 10%, and 20% of the annotated image dataset are used for training, validation and testing, respectively. Then, the architectures of three types of deep learning networks i.e. Stacked Hourglass Network (HG), Cascaded Pyramid Network (CPN), and an ensemble model (HG-CPN) integrating Stacked Hourglass and Cascaded Pyramid Network are constructed and trained in the same training environment. After training, the three models are evaluated on the testing dataset in terms of normalized errors (NE), percentage of correct keypoints (PCK), area under the curve (AUC), detection speed, and training time. The experiment results demonstrate the promising performance of our proposed methodology framework for automatically estimating different full body poses of construction equipment with high accuracy and fast speed. It is indicated by experiments that both HG and CPN can achieve relative high accuracy, with a PCK value of 91.19% and 91.78% respectively for estimating the equipment full body poses. In addition, the ensemble model with online data augmentation can further improve the accuracy, achieving a NE of 14.57 × 10⁻³, a PCK of 93.43%, and an AUC of 39.72 × 10⁻³ at the detection speed of 125 millisecond (ms) per image. This study lays the foundation for applying computer vision and deep learning techniques in the full body pose estimation of construction equipment, which can contribute to the real-time safety monitoring on construction sites.
Article
A novel differential received signal strength (RSS) positioning algorithm is proposed in the paper. Different from traditional methods to find the relationship between RSS and distance, this new positioning approach is based on linear regression of angle and differential received signal strengths. The advantages of this positioning algorithm is robust to heterogeneity of RFID tags as well as direction between tag and reader. Several experiments are firstly carried out in open environment and then a further experiment is conducted in a LNG training centre to validate our proposed algorithm. The results show that our proposed algorithm can achieve better accuracy than existing RFID positioning approaches for equipment tracking.
Article
Automated, real-time, and reliable equipment activity recognition on construction sites can help to minimize idle time, improve operational efficiency, and reduce emissions. Previous efforts in activity recognition of construction equipment have explored different classification algorithms anm accelerometers and gyroscopes. These studies utilized pattern recognition approaches such as statistical models (e.g., hidden-Markov models); shallow neural networks (e.g., Artificial Neural Networks); and distance algorithms (e.g., K-nearest neighbor) to classify the time-series data collected from sensors mounted on the equipment. Such methods necessitate the segmen-tation of continuous operational data with fixed or dynamic windows to extract statistical features. This heuristic and manual feature extraction process is limited by human knowledge and can only extract human-specified shallow features. However, recent developments in deep neural networks, specifically recurrent neural network (RNN), presents new opportunities to classify sequential time-series data with recurrent lateral connections. RNN can automatically learn high-level representative features through the network instead of being manually designed , making it more suitable for complex activity recognition. However, the application of RNN requires a large training dataset which poses a practical challenge to obtain from real construction sites. Thus, this study presents a data-augmentation framework for generating synthetic time-series training data for an RNN-based deep learning network to accurately and reliably recognize equipment activities. The proposed methodology is validated by generating synthetic data from sample datasets, that were collected from two earthmoving operations in the real world. The synthetic data along with the collected data were used to train a long short-term memory (LSTM)-based RNN. The trained model was evaluated by comparing its performance with traditionally used classification algorithms for construction equipment activity recognition. The deep learning framework presented in this study outperformed the traditionally used machine learning classification algorithms for activity recognition regarding model accuracy and generalization.
Conference Paper
Automated monitoring of construction operations, especially operations of equipment and machines, is an essential step toward cost-estimating, and planning of construction projects. In recent years, a number of methods were suggested for recognizing activities of construction equipment. These methods are based on processing single types of data (audio, visual, or kinematic data). Considering the complexity of construction jobsites, using one source of data is not reliable enough to cover all conditions and scenarios. To address the issue, we utilized a data fusion approach: this approach is based on collecting audio and kinematic data, and includes the following steps: 1) recording audio and kinematic data generated by machines, 2) preprocessing data, 3) extracting time- and frequency-domain-features, 4) feature-fusion, and 5) categorizing activities using a machine-learning algorithm. The proposed approach was implemented on multiple machines and the experiments show that it is possible to get up to 25% more-accurate results compared to cases of using single-data-sources.
Conference Paper
Automated monitoring of construction operations, especially operations of equipment and 17 machines, is an essential step toward cost-estimating, and planning of construction projects. In 18 recent years, a number of methods were suggested for recognizing activities of construction 19 equipment. These methods are based on processing single types of data (audio, visual, or kinematic 20 data). Considering the complexity of construction jobsites, using one source of data might not be 21 reliable enough to cover all conditions and scenarios. To address the issue, we utilized a data fusion 22 approach: This approach is based on collecting audio and kinematic data and includes the 23 following steps: 1) recording audio and kinematic data generated by machines, 2) preprocessing 24 data, 3) extracting time- and frequency-domain-features, 4) feature-fusion, and 5) categorizing 25 activities using a machine-learning algorithm. The proposed approach was implemented on 26 multiple machines and the experiments show that it is possible to get up-to 25% more-accurate 27 results compared to cases of using single-data-sources.
Article
Although video surveillance systems have shown potential for analyzing jobsite contexts, the necessity of a complex multi-camera surveillance system or workers' privacy issues remain as substantive hurdles to adopt such systems in practice. To address such issues, this study presents a non-intrusive earthmoving productivity analysis method using imaging and simulation. The site access log of dump trucks is used to infer earthmoving contexts, which is produced by analyzing videos recorded at the entrance and the exit of a construction site. An algorithm for license plate detection and recognition in an uncontrolled environment is developed to automatically produce the site access log, by leveraging video deinterlacing, a deep convolutional network, and rule-based post-processing. The experimental results show the effectiveness of the proposed method for producing the site access log. Based on the site access log, simulation-based productivity analysis is conducted to produce a daily productivity report, which can provide the basis for earthmoving resource planning. It is expected that the resulting daily productivity report promotes data-driven decision-making for earthmoving resource allocation, thereby improving potential for saving cost and time for earthworks with an updated resource allocation plan.
Article
In recent years, emerging mobile devices and camera-equipped platforms have offered a great convenience to visually capture and constantly document the as-is status of construction sites. In this regard, visual data are regularly collected in the form of numerous photos or lengthy videos. However, massive amounts of visual data that are being collected from jobsites (e.g., data collection on daily or weekly bases by Unmanned Aerial Vehicles, UAVs) has provoked visual data overload as an inevitable problem to face. To address such data overload issue in the construction domain, this paper aims at proposing a new method to automatically retrieve photo-worthy frames containing construction-related contents that are scattered in collected video footages or consecutive images. In the proposed method, the presence of objects of interest (i.e., construction-related contents) in given image frames are recognized by the semantic segmentation, and then scores of the image frames are computed based on the spatial composition of the identified objects. To improve the filtering performance, high-score image frames are further analyzed to estimate their likelihood to be intentionally taken. Case studies in two construction sites have revealed that the accuracy of the proposed method is close-to-human judgment in filtering visual data to retrieve photo-worthy image frames containing construction-related contents. The performance metrics demonstrate around 91% of accuracy in the semantic segmentation, and we observed enhanced human-like judgment in filtering construction visual data comparing to prior works. It is expected that the proposed automated method enables practitioners to assess the as-is status of construction sites efficiently through selective visual data, thereby facilitating data-driven decision making at the right time.
Article
In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic and applicable to any array structures, robust to unseen DOA values, reverberation, and low SNR scenarios. The proposed method achieved a consistently higher recall of the estimated number of DOAs across datasets in comparison to the best baseline. Additionally, this recall was observed to be significantly better than the best baseline method for a higher number of overlapping sound events.
Conference Paper
The sound recognition technology, which has been adopted in diverse disciplines, has not received much attention in the construction industry. Since each working and operation activity on a construction site generates its distinct sound, its identification provides imperative information regarding work processes, task performance, and safety relevant issues. Thus, the accurate analysis of construction sound data is vital for construction project participants to monitor project procedures, make data-driven decisions, and evaluate task productivities. To accomplish this objective, this paper investigates the sound recognition technology for construction activity identification and task performance analyses. For sound identification, Mel-frequency cepstral coefficients are extracted as the features of the six types of sound data. In addition, a supervised machine learning algorithm called Hidden Markov Model is used to perform sound classification. The research findings show that the maximum classification accuracy is 94.3% achieved by a 3-state HMM. This accuracy of the adopted technique is expected to reliably execute the construction sound recognition, which significantly leverage construction monitoring, performance evaluation, and safety surveillance approaches.
Article
Computer vision approaches have been widely used to automatically recognize the activities of workers from videos. While considerable advancements have been made to capture complementary information from still frames, it remains a challenge to obtain motion between them. As a result, this has hindered the ability to conduct real-time monitoring. Considering this challenge, an improved convolutional neural network (CNN) that integrates Red-Green-Blue (RGB), optical flow, and gray stream CNNs, is proposed to accurately monitor and automatically assess workers’ activities associated with installing reinforcement during construction. A database containing photographs of workers installing reinforcement is created from activities undertaken on several construction projects in Wuhan, China. The database is then used to train and test the developed CNN network. Results demonstrate that the developed method can accurately detect the activities of workers. The developed computer vision-based approach can be used by construction managers as a mechanism to assist them to ensure that projects meet pre-determined deliverables.
Article
This paper proposes a deep learning method for intra prediction. Different from traditional methods utilizing some fixed rules, we propose using a fully connected network to learn an end-to-end mapping from neighboring reconstructed pixels to the current block. In the proposed method, the network is fed by multiple reference lines. Compared with traditional single line-based methods, more contextual information of the current block is utilized. For this reason, the proposed network has the potential to generate better prediction. In addition, the proposed network has good generalization ability on different bitrate settings. The model trained from a specified bitrate setting also works well on other bitrate settings. Experimental results demonstrate the effectiveness of the proposed method. When compared with HEVC reference software HM-16.9, our network can achieve an average of 3.4% bitrate saving. In particular, the average result of 4K sequences is 4.5% bitrate saving, where the maximum one is 7.4%.
Article
Analyzing and measuring construction equipment operation are key tasks for managing construction projects. In monitoring construction equipment operation, the cycle-time provides fundamental information. Traditional cycle-time measurement methods have been limited by requiring significant efforts such as additional observers, time, and cost. Thus, this study investigates the feasibility of measuring cycle times by using inertial measurement units (IMUs) embedded in a smartphone. Because the mixed activities of construction equipment involve simultaneous actions of multiple parts, they cause low accuracy in equipment activity classification and cycle-time measurement. To enhance the recognition of these mixed activities and translate the results into reliable cycle time measurements, a dynamic time warping (DTW) algorithm was applied and the DTW distances of IMU signals were used as additional features in activity classification. To test its feasibility, data was collected on-site and the excavator's operation was recorded via IMUs embedded in a smartphone attached to a cabin. Using DTW, the suggested method achieved 91.83% accuracy for cycle-time measurement. This result demonstrates an opportunity to use operators' prevalent mobile devices to measure and report their equipment's cycle times in a cost-effective and continuous manner.
Chapter
Spectral decomposition by nonnegative matrix factorisation (NMF) has become state-of-the-art practice in many audio signal processing tasks, such as source separation, enhancement or transcription. This chapter reviews the fundamentals of NMF-based audio decomposition, in unsupervised and informed settings. We formulate NMF as an optimisation problem and discuss the choice of the measure of fit. We present the standard majorisation-minimisation strategy to address optimisation for NMF with the common β\beta -divergence, a family of measures of fit that takes the quadratic cost, the generalised Kullback-Leibler divergence and the Itakura-Saito divergence as special cases. We discuss the reconstruction of time-domain components from the spectral factorisation and present common variants of NMF-based spectral decomposition: supervised and informed settings, regularised versions, temporal models.
Article
In the construction industry, especially for civil infrastructure projects, a large portion of overall project expenses are allocated towards various costs associated with heavy equipment. As a result, continuous tracking and monitoring of tasks performed by construction heavy equipment is vital for project managers and jobsite personnel. The current approaches for automated construction equipment monitoring include both location and action tracking methods. Current construction equipment action recognition and tracking methods can be divided into two major categories: 1) using active sensors such as accelerometers and gyroscopes and 2) implementing computer vision algorithms to extract information by processing images and videos. While both categories have their own advantages, the limitations of each mean that the industry still suffers from the lack of an efficient and automatic solution for the construction equipment activity analysis problem. In this paper we propose an innovative audio-based system for activity analysis (and tracking) of construction heavy equipment. Such equipment usually generates distinct sound patterns while performing certain tasks, and hence audio signal processing could be an alternative solution for solving the activity analysis problem within construction jobsites. The proposed system consists of multiple steps including filtering the audio signals, converting them into time-frequency representations, classifying these representations using machine learning techniques (e.g., a support vector machine), and window filtering the output of the classifier to differentiating between different patterns of activities. The proposed audio-based system has been implemented and evaluated using multiple case studies from several construction jobsites and the results demonstrate the potential capabilities of the system in accurately recognizing various actions of construction heavy equipment.
Conference Paper
Construction machines and devices often generate distinct sound patterns while performing different tasks, making it possible to extract useful information about jobsites by placing microphones and recording and processing the generated audio files. This paper presents the results of current studies conducted by the authors on the necessary hardware and software for audio modeling of construction jobsites. As the first step, an audio-based system for recognizing activities of construction equipment has been devised. The presented system includes a de-noising algorithm for enhancing the quality of audio files as well as a short-time Fourier transform (STFT) and support vector machines (SVM) for classifying various activities. In the second step, three types of audio recorders (off-the-shelf microphones, contact microphones, and multichannel microphone arrays) and two types of installation settings (microphones mounted on board vs. installed on the job site) have been selected and several experiments were conducted to optimize the hardware settings, tune the algorithmic parameters, and evaluate the different approaches. The results show that for several different types of machines, the accuracy of the audio-based activity recognition system can exceed 85%.
Article
We propose an unsupervised speech separation framework for mixtures of two unseen speakers in a singlechannel setting based on deep neural networks (DNNs). We rely on a key assumption that two speakers could be well segregated if they are not too similar to each other. A dissimilarity measure between two speakers is first proposed to characterize the separation ability between competing speakers. We then show that speakers with the same or different genders can often be separated if two speaker clusters, with large enough distances between them, for each gender group could be established, resulting in four speaker clusters. Next, a DNN-based gender mixture detection algorithm is proposed to determine whether the two speakers in the mixture are females, males or from different genders. This detector is based on a newly proposed DNN architecture with four outputs, two of them representing the female speaker clusters and the other two characterizing the male groups. Finally we propose to construct three independent speech separation DNN systems, one for each of the female-female, malemale and female-male mixture situations. Each DNN gives dual outputs, one representing the target speaker group and the other characterizing the interfering speaker cluster. Trained and tested on the Speech Separation Challenge corpus, our experimental results indicate that the proposed DNN-based approach achieves large performance gains over the state-of-the-art unsupervised techniques without using any specific knowledge about the mixed target and interfering speakers being segregated.
Conference Paper
Acoustic signal based recognition for excavation devices has been investigated in the past due to its significance in preventing the underground cables from being destroyed during the ground excavation. However, existing excavation devices classification algorithms have paid little attention to the reduction of background noises which usually severely degrade the recognition performance. This paper utilizes a cross microphone array to record the acoustic signals of excavation devices, which are then filtered by the minimum variance distortionless response (MVDR) beamforming algorithm to reduce the environment noises and enhance the desired signals. The filtered signals are then fed into the feature extraction and classifier learning. To show the effectiveness of the proposed method, we collected the real acoustics of two representative devices in a construction site to make the performance testing. Experiments show that, compared with conventional recognition methods the performance of the proposed method is significantly improved.
Conference Paper
In this paper, a new voice activity detection method is proposed. It is based on the total spectrum energy in the overlapping speech window frames. The noise energy from the higher frequency band is subtracted from the noisy speech spectrum in the lower frequency band. In addition, a moving average filter is used to smooth the speech spectrum energy waveform. The proposed strategy for voice activity detection is robust and works well for a variety of signal to noise ratio levels.
Article
An efficient algorithm for earthmoving device recognition is essential for underground high voltage cable protection in the mainland of China. Utilizing acoustic signals generated either by engine or the clash during operations, an intelligent classification system for four representative excavation equipments (namely, electric hammers, hydraulic hammers, cutting machines, and excavators) is developed in this paper. A benchmark acoustic wave database collecting from a real construction site is first established. Then, an improved feature extraction approach based on the Mel-Frequency Cepstrual Coefficients (MFCC) which can efficiently describe the dynamics of acoustics wave is developed. The recent fast and effective extreme learning machine is employed as the classifier in the proposed classification system. Experiments on real collected signals and field testings using our developed software platform are provided to demonstrate the efficiency of the proposed classification system.
Article
In this report we present an overview of the approaches and techniques that are used in the task of automatic audio segmentation. Audio segmentation aims to find changing points in the audio content of an audio stream. Initially, we present the basic steps in an automatic audio segmentation procedure. Afterwards, the basic categories of segmentation algorithms, and more specific the unsupervised, the data-driven and the mixed algorithms, are presented. For each of the categorizations the segmentation analysis is followed by details about proposed architectural parameters, such us the audio descriptor set, the mathematical functions in unsupervised algorithms and the machine learning algorithms of data-driven modules. Finally a review of proposed architectures in the automatic audio segmentation literature appears, along with details about the experimenting audio environment (heading of database and list of audio events of interest), the basic modules of the procedure (categorization of the algorithm, audio descriptor set, architectural parameters and potential optional modules) along with the maximum achieved accuracy.