Content uploaded by Haifeng Wang
Author content
All content in this area was uploaded by Haifeng Wang on Jun 29, 2021
Content may be subject to copyright.
ABSTRACT: Accurate and efficient evaluation of wind loads is critical for safe and cost-effective designs of wind-sensitive
structures. Wind tunnel testing is considered as one of the most reliable ways to acquire wind loads on structures, however, the
limited ability to reproduce transient winds, complex surroundings and high Reynolds number effects in the laboratory is often
detrimental to the experimental accuracy. On the other hand, the field measurement of wind loads has the advantage of high
accuracy. However, the field-measurement approach is very expensive due to the implementation of a large number of wind
pressure sensors. Recent advances in computer vision techniques shed light on an indirect way to acquire wind loads from camera
videos. In this study, a multi-camera video-based wind load identification framework is proposed to reliably obtain a large amount
of wind load data with low costs. Specifically, a camera array is utilized to simultaneously capture the motion videos of the target
structure. The pixel motions related to structural response are extracted through the phase-based motion extraction technique. The
motions extracted from camera-array videos is then fused with knowledge-enhanced deep learning to achieve high-accuracy
response data. At last, the wind load is identified from the obtained structural response based on the inverse method. A case study
is conducted to present the efficacy of the multi-camera video-based, deep learning-enhanced wind load identification framework.
The identified wind loads match well with the ground-truth data. With the advantages of low cost, quick deployment, and
automatic data processing, the proposed wind load identification scheme presents great promise in engineering applications.
KEY WORDS: Wind load identification; Multi-camera video; Phase-based motion extraction.
1 INTRODUCTION
The evaluation of wind loads is the base of designing wind-
sensitive structures. While wind tunnel testing is considered as
one of the most reliable ways to acquire wind loads on
structures, the limited ability to reproduce transient winds,
complex surrounding terrain and high Reynolds number effects
in the laboratory is often detrimental to the experimental
accuracy. With rapid increase of building height and bridge
span, there is a continuously growing need to comprehensively
validate model-scale wind loads with sufficient full-scale
measurements. However, it is difficult and costly to directly
measure the wind load on a structure since an invisible wind
pressure field needs to be measured. Compared with measuring
the wind load, it is easier to measure the wind-induced
responses and then identify the wind load from the measured
response. The conventional response measurement methods,
including accelerometers and laser vibrometers, only capture
responses at very limited locations, making it difficult to
accurately identify the wind load from responses [1]–[3]. On
the other hand, video cameras are capable of obtaining high
density spatial data, where the motion of the entire structure is
captured. Additionally, as a non-contact measurement method,
the video-based measurement has distinct advantages in terms
of cost and labor.
To identify the structural motion from captured videos, it is
critical to identify the pixels that correspond to the target
structure. Hand labeling pixels from images is time consuming,
and the issue becomes worse when a large number of images
needs to be labeled, e.g., many frames in a video. To this end,
the deep learning-based video instance segmentation is utilized
in this study. Specifically, the MaskR-CNN architecture is
utilized here for the video segmentation [4]. The trained model
shows high segmentation performance in terms of both
accuracy and efficiency. With the identified pixels that belong
to the structure, the pixel motions can be used for extracting the
structural motion.
The wind-induced motion of structures is usually small in the
captured videos, especially when the camera is placed at a
distance to capture the entire structure. The small motion makes
it difficult to extract responses from videos. To address this
issue, the phase-based motion extraction technique is utilized
in this study [5], [6]. The phase-based motion extraction
technique can extract structural responses from videos with
small motions. Since the phase-based motion extraction
technique is affected by a number of external disturbances, it is
inevitable that the extracted responses are contaminated by
noises. To address this issue, the motion video is captured by a
camera array, where multiple cameras are used to
simultaneously capture the building motion. Then, the deep
learning-based data fusion technique is utilized for effectively
extracting the building motion. The fusing strategy based on the
deep neural network can adaptively fuse the extracted building
motion signal and hence achieve higher accuracy[7], [8](Snaiki
& Wu, 2019; Wang & Wu, 2020).
The deep learning-based data fusion essentially extracts the
underlying information from time series. In this study, a long
short-term memory (LSTM)-convolutional neural network
(CNN) architecture is proposed to simultaneously fuse and
extract response time series. LSTM, which introduces the
forget- and input-gate mechanisms into the recurrent neural
Deep Learning-enhanced Wind Load Identification with Multi-
Camera Videos
Haifeng Wang1, Teng Wu1
1Department of Civil, Structural and Environnemental Engineering, University at Buffalo, 14260, Buffalo, NY, USA
email: hwang48@buffalo.edu, tengwu@buffalo.edu
Proceedings of the 10th International Conference on Structural Health Monitoring of Intelligent Infrastructure, SHMII 10
Porto, Portugal, 30 June - 2 July 2021
A. Cunha, E. Caetano (eds.)
1
network, has been successfully implemented in time series
modeling [9]–[11]. CNN is known to has the advantage of
feature extraction [12], [13]. Accordingly, the combination of
CNN and LSTM presents good performance in terms of fusing
time series [14], [15]. In this study, a LSTM-CNN architecture
is proposed to fuse the response data. In terms of training the
proposed neural network, the KEDL strategy is utilized to
achieve a data-efficient and robust training process [7], [8].
The paper presents the video-based wind load identification
framework. The deep learning-based data fusion is combined
with the phase-based motion extraction technique to obtain the
wind-induced response data from camera-array videos. The
response data is then used to identify the wind load using the
inverse method. A case study is conducted to demonstrate the
efficacy of the proposed framework. The proposed scheme
presents great promise in engineering applications.
2 VIDEO-BASED WIND LOAD IDENTIFICATION
FRAMEWORK
The proposed wind load identification framework is composed
of four sequential modules, i.e., video segmentation, motion
extraction, data fusion, and wind load identification. In this
section, each module of the proposed video-based wind load
identification framework is presented.
Video segmentation
Video segmentation, where the pixels that belong to the target
building are identified, is the first step towards the motion
extraction and the following load identification. Specifically,
the video segmentation assigns labels to every pixel in a video
frame, and hence the pixels that belongs to the building of
interest can be identified. The video segmentation is one of the
fundamental topics in computer vision. Traditional
segmentation algorithms are typically based on the information
of contours and edges [16]. With the development of deep
convolutional neural network, the performance of video
segmentation on benchmark tasks gained significant
improvement [17]. In this study, the Mask R-CNN architecture
is utilized to performed the segmentation task [4]. Figure 1
schematically depicts the Mask R-CNN architecture.
Figure 1. Mask R-CNN architecture.
Response extraction from videos with small motions
The wind-induced structural response is typically small. In the
case when cameras are placed at a long distance to capture the
whole structure, the structure motions captured in videos are
even smaller. Actually, the structure motion in captured videos
can be sub-pixel level. The small motion makes it difficult to
estimate the structural motion from videos. Structural vibration
measured in videos can be measured in terms of the temporally
displaced or translated image intensity ܸሺݔǡݕǡ ݐሻ. Compared
with the widely used image correlation method [18] and the
optical flow method [19], the phase-based motion extraction
method provides a better motion identification. Specifically,
the phase-based method converts the image intensity into the
combination of local amplitude and phase, and the pixel motion
is well represented by the phase information. The phase
information has been utilized for the motion magnification and
presented better results as compared with pixel-based methods
[20]. Chen [6] utilized the phase-based method to achieve
pixel-level motion identification for videos with small motions.
In the phase-based method, the grayscale information (image
intensity) ܸሺݔǡݕǡݐሻ is first decomposed into multiple spatial
scales with specific orientations, where ݔ, ݕ and ݐ represent the
horizontal location, vertical location and time, respectively. For
each spatial scale with a specific orientation, the spatially
localized signal can be expressed as:
ܣ
௦ǡఏ
ሺݔǡݕǡݐሻ
థ
ೞǡഇ
ሺ௫ǡ௬ǡ௧ሻ
ൌܩ
௦ǡఏ
ܸٔሺݔǡݕǡݐሻ (1)
where ٔ denotes convolution; ܣ
௦ǡఏ
and ߶
௦ǡఏ
are respectivaly
the spatially localized amplitude and phase; ܩ
௦ǡఏ
is the complex
filter with spatial scale ݏ and orientation ߠ. For example, if the
horizontal movement of a 10-pixel object is of interest, a filter
with ߠൌͲ (horizonal orientation) and a scale ݏ corresponding
to 10 pixels can be used. Fleet and Jepson [5] showed that the
local phase contains the velocity information as:
ቂ
డథ
ೞǡഇ
ሺ௫ǡ௬ǡ௧ሻ
డ௫
ǡ
డథ
ೞǡഇ
ሺ௫ǡ௬ǡ௧ሻ
డ௬
ǡ
డథ
ೞǡഇ
ሺ௫ǡ௬ǡ௧ሻ
డ௧
ቃڄሾݑሶ ǡݒሶǡͳሿൌͲ (2)
where “ڄ” operator denotes dot product; ݑሶ and ݒሶ are pixel
velocities in ݔ and ݕ directions, respectively. For filters
with horizontal orientations (ߠൌͲ), the vertical velocity
is approximately zero, i.e., ݒሶ ൎͲ. Accordingly, the
horizontal pixel velocity ݑሶ can be obtained as:
ݑሶ ൌെ
డథ
ೞǡబ
ሺ௫ǡ௬ǡ௧ሻ
డ௧
ቂ
డథ
ೞǡబ
ሺ௫ǡ௬ǡ௧ሻ
డ௫
ቃ
ିଵ
(3)
The vertical pixel velocity can be obtained with a filter with
vertical orientation similarly. In this study, the horizontal
direction response is of concern. Accordingly, the horizontal
filter is utilized. The displacement and acceleration information
can be obtained from the velocity.
Data fusion with knowledge-enhanced deep learning
Due to the external disturbances, e.g., the motion of the camera
and the lighting condition variation, the extracted velocity can
be contaminated by noise. The situation becomes even worse
for the extraction of small motions. The existence of noise
results in biased estimations of structural response and wind
load. The conventional denoising techniques, such as band-pass
filtering, thresholding method and Kalman filtering, present
various drawbacks, and pose limitations regarding the analyzed
signals. The advances in machine learning technology shed
light on utilizing deep learning for conducting the denoising
with less restrictions and higher efficiency. The noise issue is
essentially approached by capturing more information. First,
the data source is extended to a camera array, where multiple
cameras are simultaneously utilized to capture the structure
motion. The knowledge-enhanced deep learning then is utilized
to fuse the extracted response signal for obtaining a high-
Proceedings of the 10th International Conference on Structural Health Monitoring of Intelligent Infrastructure, SHMII 10
2
accuracy structural response. In this study, the LSTM-CNN
architecture is proposed and trained for data fusion. Figure 2
presents the proposed LSTM-CNN architecture. The
dimension of the input layer is determined by the number of
time steps and the number of cameras utilized. The three LSTM
layers are stacked to memorize the previous time steps and
hence enable better estimation of the current time step. The
convolution layers are then used to fuse the output of the LSTM
layers. In this study, a 1D convolution kernel is utilized so that
the time dimension of the input layer and the output layer are
identical.
Figure 2. Proposed LSTM-CNN architecture.
One of the challenges in applying deep learning to
engineering problems is the difficulty of obtaining large
amounts of high-quality training datasets. For example, it is
very expensive to simultaneously obtain the contaminated and
ground-truth response data for training the neural network. To
this end, the KEDL strategy [7], [8], where the prior knowledge
is integrated into the loss function, is utilized in this study.
Specifically, the dynamic equilibrium equations are integrated
into the loss function. An advantage of using KEDL is that the
deep network training can be accomplished with less data and
become more robust. Figure 3 schematically presented the
KEDL strateg y. As presented in the figure, the update of neural
network parameters (red lines) is guided by minimizing the
both the knowledge loss ܮ
and the data loss ܮ
.
Figure 3. Knowledge-enhanced deep learning.
Wind load identification
With the extracted wind-induced response of a structure, the
wind load is identified by the inverse method (Hwang et al.,
2011). The wind-induced response of a structure can be
represented as:
ࡹ࢛ሷ࢛
ሶࡷ࢛ൌࢌ (4)
where ࡹ, and ࡷ are respectively the mass, damping and
stiffness matrices;࢛ሷ, ࢛ሶ and ࢛ are respectively the acceleration,
velocity, and displacement vectors; and ࢌൌሾ݂
ଵ
ǡ݂
ଶ
ǡڮሿ
is the
wind load vector with denoting transpose. The response can
be approximated as the linear combination of multiple modes:
࢛
ൈଵ
ൎࢶ
ൈ
ࢁ
ൈଵ
(5)
where is the number of the response measurement locations;
ݍ is the number of accounted modes; ࢶ is the mode shape
matrix with dimension ൈݍ; and ࢁ is the modal response
vector. Accordingly, the model response can be estimated from
the measurement with ࢶ
ି
. Presenting the ݅
୲୦
modal response
as:
ࢄ
ൌൣܷ
ǡܷ
ሶ
൧
(6)
The discrete-time response can be expressed as:
ࢄ
ሺ݇ͳ
ሻൌࢸ
ࢄ
ሺ݇ሻ ࢣ
ܨ
ሺ݇ሻ (7)
where ݇ is the discrete time; ࢸ
ൌ
௱௧
; ࢣൌ
݁
ሺ௱௧ିఛሻ
௱௧
݀߬;
ൌቂሾͲǡͳሿǡሾെ߱
ଶ
ǡെʹߞ
߱
ሿቃ
; ൌሾͲǡͳሿ
;
ܨ
is the modal load; ߞ
is the damping ratio; and ȟݐ is the
sampling interval. As a result, the modal load can be estimated
as:
ܨ
ሺ݇ሻൌࢣ
ିଵ
ሾࢄ
ሺ݇ ͳሻ െ ࢸ
ࢄ
ሺ݇ሻሿ (8)
where ࢣ
ିଵ
is the pseudo inverse of ࢣ. Then, the wind load of
each measurement location can be determined with ࢌൌࢶࡲ.
3 CASE STUDY
Model set up
A case study is conducted to present the effectiveness of the
proposed analysis framework. The input building motion video
is generated via Blender, which is a powerful open-source 3D
modeling software [21]. The height, width and depth of the 20-
story building are respectively 20 m, 20 m and 72 m. The
attacking wind is set to be from the west of the building (Fig.
4). The camera array, which has two cameras with identical
parameters, is placed at the south of the building. The center of
the camera array to the center of the building is 120 m. The
camera has a focal length of 50 mm. The video capture frame
rate is 60 Hz. The original resolution is 1920 × 1920. Figure 5
depicts the selected frame of the generated videos. The
rendered image generally captured the building geometry and
texture features.
Figure 4. Schematic case study set up.
Proceedings of the 10th International Conference on Structural Health Monitoring of Intelligent Infrastructure, SHMII 10
3
The building motion is animated based on the structural
response simulation. The building dynamics are modeled with
OpenSees, an open-source software framework designed for
finite element modeling of structures. In this study, the
structure is simplified as a lumped mass model. The first three
natural frequency of the building is set to be 0.65 Hz, 4.34 Hz
and 12.16 Hz. The wind load is obtained based on the quasi-
steady theory, where the wind field is simulated using the
Hilbert-wavelet-based method [22]. In this way, the building
motion video, the underlying building motion and the wind
load are simultaneously achieved, which is necessary for
testing the proposed wind load identification framework.
Figure 5. Selected frame of generated building motion videos.
Motion extraction
Figure 6 shows the segmentation results. It can be observed that
the pixels that correspond to the target building has been
identified. The motions of the identified pixels can be used for
extracting the building motions. According to the video
segmentation results, the video is cropped to a resolution of 500
× 1750. The black rectangle in Figure 6 depicts the crop result.
Figure 6. Segmentation results.
For each frame, the horizontal pixel velocities are extracted.
Scaled by the actual size of each pixel, the pixel motion is
converted to motion of the target building. Assuming the
building motion in horizontal direction is uniform, the along-
wind response of the building is estimated by averaging the
extracted motions in horizontal direction. Figure 7 shows the
extracted velocities of at the height of 36 m (half height of the
building). As shown in the figure, the extracted motions are
contaminated by noise, which will lead to wrongly estimated
wind loads. To address this issue, the proposed LSTM-CNN
architecture is trained to fuse the extracted responses and
achieve the high-accuracy response.
Figure 7. Motion extracted from camera-array videos.
The neural network model is set up and trained with the
widely used machine learning platform Tensorflow [23], where
the rectified linear unit (ReLU) activation function is adopted
[24], and the optimization scheme AdaMax is employed with
the adaptive learning rate [25]. A total number of 20,000
samples are generated and used to train the neural network with
the proposed LSTM-CNN architecture. Figure 8 shows the
fusion result from the neural network. As shown in the figure,
the error is reduced as compared with Figure 7. It should be
noted that the performance of the data fusion can be increased
by increasing the scale of the neural network and enlarging the
size of the training data.
Figure 8. Data fusion result
With the extracted motion of different heights, the first modal
load is identified following Equation 8. The identified load and
the actual wind load are presented in Figure 9. As shown in the
figure, the modal load is successfully identified. Considering
the wind-induced motion is usually dominated by the first
mode, the extracted modal load can be used for identifying the
actual wind load in structures.
Proceedings of the 10th International Conference on Structural Health Monitoring of Intelligent Infrastructure, SHMII 10
4
Figure 9: Identified modal load
4 CONCLUDING REMARKS
In this study, a video-based wind-load identification framework
was proposed. Based on this method, the small structural
motions captured by a camera array was uti lized for i dentifying
the wind load. Case study was conducted to demonstrate the
efficacy of the proposed framework. With the advantages of
low cost, easy deployment and high spatial resolution, the
proposed video-based wind load identification method
presented great promising in engineering applications. In the
future study, the response estimation accuracy will be further
increased by proposing better filters, increasing the number of
cameras, increasing the performance of the data-fusion neural
network.
REFERENCES
[1] J.-S. Hwang, A. Kareem, and H. Kim, “Wind load identification using
wind tunnel test data by inverse analysis,” Journal of Wi nd Engineering
and Industrial Aerodynamics, vol. 99, no. 1, pp. 18–26, Jan. 2011, doi:
https://doi.org/10.1016/j.jweia.2010.10.004.
[2] Y. Che n, D. Jof fre, and P. Avitabile, “Underwater dynamic response at
limited points expanded to full-field strain response,” Journal of
Vibration and Acoustics, vol. 140, no. 5, 2018.
[3] Y. Chen, P. Loga n, P. Avitabile, and J. Dodson, “Non-model based
expansion from limited points to an augmented set of points using
Chebyshev polynomials,” Experimental Techniques, vol. 43, no. 5, pp.
521–543, 2019.
[4] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Proceedings o f the IEEE international conference on computer vision,
2017, pp. 2961–2969.
[5] D. J. Fleet and A. D. Jepson, “Computation of component image
velocity from local phase information,” International Journal of
Computer Vision, vol. 5, no. 1, pp. 77–104, Aug. 1990, doi:
10.1007/BF00056772.
[6] J. G. Chen, N. Wadhwa, Y.-J. Cha, F. Durand, W. T. Freeman, and O.
Buyukozturk, “Modal identification of simple structures with high-
speed video using motion magnificati on,” Journal of Sound and
Vibrat ion, vol. 345, pp. 58–71, 2015.
[7] R. Snaiki and T. Wu, “Knowledge-enhanced deep learning for
simulation of tropical cyclone boundary-layer winds,” Journal of Wind
Engineering and Industrial Aerodynamics, vol. 194, p. 103983, 2019.
[8] H. Wang and T. Wu, “Knowledge-enhanced deep learning for wind-
induced nonlinear structural dynamic analysis,” Journal of Structural
Engineering, vol. 146, no. 11, p. 04020235, 2020.
[9] S. Hochreiter and J. Sch midhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi:
10.1162/neco.1997.9.8.1735.
[10] T. Li, T. Wu, and Z. Liu, “Nonlinear unsteady bridge aerodynamics:
Reduced-order modeling based on deep LSTM networks,” Journal of
Wind Engineer ing and Industrial Aerodynamics, vol. 198, p. 104116,
2020.
[11] S. Siami-Namini, N. Tavakoli, and A. S. Namin, “A comparison of
ARIMA and LSTM in forecasting time series,” in 2018 17th IEEE
International Conference on Machine Learning and Applications
(ICMLA), 2018, pp. 1394–1401.
[12] K. Fukushima and S. Miyake, “Neocognitron: A self-organizing neural
network model for a mechanism of visual pattern recognition,” in
Competition and cooperation in neural nets, Springer, 1982, pp. 267–
285.
[13] S. Alb awi, T. A. Moh ammed, and S. Al-Zawi, “Understanding of a
convolutional neural network,” in 2017 International Conference on
Engineering and Technology (ICET), 2017, pp. 1–6.
[14] I. E. Livieris, E. Pintelas, and P. Pintelas, “A CNN–LSTM model for
gold price ti me-series forecasting,” N eural computing and
applications, vol. 32, no. 23, pp. 17351–17360, 2020.
[15] T. Kim and H. Y. Kim, “Forecasting stock prices with a feature fusion
LSTM-CNN model using different representations of the same data,”
PloS one, vol. 14, no. 2, p. e0212320, 2019.
[16] M. Sonka, V. Hlavac, and R. Boyle, Image processing, analysis, and
machine vision. Nelson Education, 2014.
[17] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, “A review of semantic
segmentation using deep neural networks,” International journal of
multimedia information retrieval, vol. 7, no. 2, pp. 87–93, 2018.
[18] M. A. Sutton, J. J. Orteu, and H. Schreier, Image correlation for shape,
motion and defor mation measurements: basic concepts, theory and
applications. Springer Sci ence & Business M edia, 2009.
[19] B. K. Horn and B. G. Schunck, “Determining opti cal flow,” Artificial
intelligence, vol. 17, no. 1–3, pp. 185–203, 1981.
[20] N. Wadhwa, M. Rubinstein, F. Durand, and W. T. Freeman, “Phase-
based video motion processing,” ACM Transactions on Graphics
(TOG), vol. 32, no. 4, p. 80, 2013.
[21] B. Community, “Blender–a 3D modelling and rendering package.,”
2018.
[22] H. Wang and T. Wu, “Fast Hilbert-Wavelet Simulation of
Nonstationary Wind Field Using Noniterative Simultaneous Matrix
Diagonalization,” Journa l of Engineering Mechanics, vol. 147, no. 3,
p. 04020153, 2021.
[23] M. Abadi et al., “Tens orflow: a system for large-scale machine
learning,” 2016, pp. 265–283.
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
surpassing human-level performance on {ImageNet} classification,”
eprint arXiv:1502.01852, p. arXiv:1502.01852, 2015.
[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” arXiv preprint arXiv:1412.6980, 2014.
Proceedings of the 10th International Conference on Structural Health Monitoring of Intelligent Infrastructure, SHMII 10
5