Conference PaperPDF Available

ALMA engineering fault detection framework

Authors:

Abstract and Figures

The Atacama Large Millimeter/Submillimeter Array (ALMA) Observatory, with its 66 individual radiotelescopes and other central equipment, generates a massive set of monitoring data everyday, collecting information on the performance of a variety of critical and complex electrical, electronic, and mechanical components. By using this crucial data, engineering teams have developed and implemented both model and machine learning-based fault detection methodologies that have greatly enhanced early detection or prediction of hardware malfunctions. This paper presents the results of the development of a fault detection and diagnosis framework and the impact it has had on corrective and predictive maintenance schemes.
Content may be subject to copyright.
ALMA engineering fault detection framework
Jos´e L. Ortizaand Rodrigo A. Carrascob
aAtacama Large Millimeter/submillimeter Array (ALMA), San Pedro de Atacama, Chile
bFaculty of Engineering and Sciences, Universidad Adolfo Ib´nez, Santiago, Chile
ABSTRACT
The Atacama Large Millimeter/Submillimeter Array (ALMA) Observatory, with its 66 individual radiotelescopes
and other central equipment, generates a massive set of monitoring data everyday, collecting information on the
performance of a variety of critical and complex electrical, electronic, and mechanical components. By using
this crucial data, engineering teams have developed and implemented both model and machine learning-based
fault detection methodologies that have greatly enhanced early detection or prediction of hardware malfunctions.
This paper presents the results of the development of a fault detection and diagnosis framework and the impact
it has had on corrective and predictive maintenance schemes.
Keywords: fault detection, fault diagnosis, framework, automation, predictive maintenance
1. INTRODUCTION
The start of full operations of the ALMA Observatory, coupled with its very complex technical and environ-
mental conditions, introduces the demand for increased efficiency of its maintenance strategies. In this sense,
having refined tools for hardware performance evaluation is an obligation, starting with device malfunction early
detection and anticipation.
In this work, we describe our implementation of a fault detection and diagnosis framework (FDDF) tested on
several study cases deemed critical by ALMA engineers. The tools provided by the FDDF assist in every step
of the detection process, from early data exploration to detection results and reporting. Results obtained show
that the developed framework can be of great assistance in the maintenance decision-making process, providing
valuable and timely information to engineering teams. Furthermore, the tools implemented show they are flexible
enough to adapt for several different settings and faults, allowing a quick detection and diagnosis monitoring
system.
1.1 Atacama Large Millimeter/submillimeter Array
The Atacama Large Millimeter/submillimeter Array (ALMA) is a revolutionary instrument operating in the
very thin and dry air of northern Chile’s Atacama desert, at an altitude of 5,000 meters above sea level. ALMA
is composed of an array of 66 high-precision antennas working together at the millimeter and submillimeter
wavelengths, corresponding to frequencies from about 30 to 950 GHz. These 7 and 12 meters parabolic antennas,
extremely precise in surface accuracy, can be moved around on the high altitude plateau to provide different
array configurations, ranging in size from about 150 meters up to 15 kilometers. The ALMA Observatory is
an international partnership between Europe, North America, and Japan, in cooperation with the Republic of
Chile.1
Further author information:
E-mail: jortiz@alma.cl, Telephone +56 (2) 2467-6244
E-mail: rax@uai.cl, Telephone: +56 (2) 2331-1594
1.2 ALMA Full Operations and the Need for Condition-based and Predictive
Maintenance
The ALMA Observatory is entering its full operations steady-state phase. Pressure to keep downtime to a
minimum is high, both array and antenna-wise. The latter given the need for more fully operational antennas on
each new science cycle. Further complicating the problem, engineering time, used for testing and maintenance,
is constantly reduced as more time is allocated to achieve 24/7 observations.
ALMA is a very complex instrument. Each telescope is composed by hundreds of individual electronic and
mechanical assemblies, each carefully calibrated, set up, and interconnected. This complexity is multiplied by
66, the number of single antennas, with the additional particularities contributed by the four distinct antenna
designs developed by three different vendors. Added to the mix are the central equipments, correlators and
central local oscillator, which allow the whole array to perform as a single instrument through interferometry.
Although not part of the telescope per se, ancillary or infrastructural systems, such as weather stations and
power plants, are critical to the attainment of any and all scientific objectives.
On top of the aforementioned technical intricacy is the observatory’s setting. The Chajnantor Plateau, with
its perfect skies for astronomical observation, is also known for its extreme weather and oxygen-deprived air,
conditions that severely diminish troubleshooting and decision-making skills of human operators. Remote and
automated tasks execution and investigation of arisen problems is a must, to the maximum possible extent.2
In this complex, harsh, and time-restricted scenario, enhanced maintenance planning and execution is a
necessity, making intelligent use of the immense collection of available monitoring, logging, and configuration
data, combined with the significant experience accumulated over several years of operations by the ALMA
Department of Engineering (ADE). Condition-based maintenance (CBM) aims to act on equipment based on
proper observation of the state of systems. This data-driven approach, fed by data sources as mentioned above,
is intended to complement traditional planned and corrective maintenance schemes. In some implementations, it
can even lead to predicting potential equipment malfunction, improving overall system reliability and decreasing
maintenance costs.
In line with these goals, this paper presents a framework with the purpose of early detecting and even
predicting equipment failure which could lead to unplanned operational downtime.
2. DATA SCIENCE PROCESS APPROACH
Development of a fault detection framework must keep in mind its final objective: triggering maintenance actions
based on analysis of available system data, supporting the decision-making process of engineering management
and staff. The nature of the problem being solved and the copious sources of available data, coupled with the
expertise and skills of the ADE personnel, makes this a task that can be modeled and designed as a formal
data science process. As such, the steps needed to implement a FDDF can be defined as shown in Figure 1and
detailed in the following sections. All steps of this iterative process need to be designed and implemented by
domain specialists.
Figure 1: Data science process
2.1 Objective definition
The final goal of each fault detection case study is to trigger well-advised and targeted maintenance actions
based on an advanced understanding of the status of involved systems (insight). A refined understanding of such
systems is in itself an objective as well, given the complexity of the components involved in every troubleshooting
effort and the intricate nature of the maintenance tasks performed by ALMA engineering teams. These goals
shall be achieved through this specialized software framework, under the assumption that all needed hardware
elements are already in place.
For each fault detection case added to the framework, a domain specialist must define the particular fault
detection objective. For example, for a certain hardware variable this can be a threshold that must not exceeded, a
maximum rate of change, a stability figure or a combination of similar indicators. Section 3.1 Fault cases provides
actual examples.
2.2 Data acquisition
ALMA has already implemented a very rich software infrastructure collecting a vast amount of data from its
hardware, software and scientific tools, components tracking and configuration, user interactions and others.
Many of these tools have readily available interfaces as needed for proper integration with the FDD framework.
ALMA engineers have identified and evaluated several data sources as relevant for the FDDF. Only offline
data sources are considered as inputs to this framework, as access to online resources needs a totally different
approach in terms of availability, planning, safety and costs. In any case, many online data sources become
available offline with a latency in the order of minutes or tens of minutes. Currently, the FDDF mainly employs
data from the ALMA Monitoring system, but plans are underway to further incorporate relevant data from other
data sources.
2.2.1 ALMA Monitoring System
In order to ensure a correct performance, ALMA collects a massive amount of data from its sensors and actuators.
Each antenna hardware component communicates abundant information that include sensor measurements,
status updates, settings, and even serial number of components via the Antenna Master Bus (AMB), a CAN-
based digital data bus.
The ALMA Monitoring System is a service that permanently queries and collects all this data from every
subsystem in every antenna, plus central equipment and other auxiliary hardware. It then stores it in a central
databases for offline use. Currently, around 30 GB of monitoring data is collected each day from a total of 140,000
monitor points. Of the total number of monitor points, about 80% belong to hardware located in antennas, with
the rest coming from equipment in the central building.3This information is crucial for hardware malfunction
investigation and is made available to the engineering teams in an indexed fashion based on plain-text files.
Storage is currently being migrated to a database cluster based on Apache Cassandra.
For the FDDF presented in this paper, software modules were developed to query the ALMA monitoring
system for the data needed for each fault analysis.
2.2.2 Other data sources
Experience has shown that many hardware failures can be diagnosed with information from data sources other
than hardware monitoring points. These sources may reveal, for example, undesired human intervention of
devices or widespread disturbing events (e.g., power outages, general software shutdown, etc.). The following
list presents some of the proposed additional information sources:
Software logs: 50 million log entries every 24 hours.4
Shift Log Tool: history of all the observing activities in ALMA and problems logging.
Smartsheet: work management and automation platform for daily tasks allocation.
Computerized Maintenance Management System: tracking of components location, identification and sta-
tus, maintenance work orders.
(a) GUI (b) Generated plot
Figure 2: MonitorPlotter application
Revision control repository: tracking of of hardware calibration and configuration files.
ALMA Science Data Model database: metadata of each astronomical observation.
2.3 Data preparation
Data is made available to engineering teams downstream, meaning that there is little or no control over how the
data is collected. Therefore, any existing data quality issues need to be detected and corrected in preparation for
the following data analysis phase. Domain knowledge is essential to make informed decisions on how to handle
incomplete or incorrect data. Failing to do so can lead to erroneous conclusions in the end. This is a two step
process: data exploration and data pre-processing.
2.3.1 Data exploration
A preliminary analysis is needed to gain a better understanding of the specific characteristics of the data on each
study case. This is essential to make the most effective use of the selected data and define if more sources are
needed. Data exploration may include:
Basic visualization: assess trends, outliers, and/or errors.
Descriptive statistics: review mean, standard deviation, minimum, maximum, and other statistical mea-
sures, as a basic summary of the data.
Exploration of hardware monitoring data is done through the MonitorPlotter application, specifically devel-
oped for this purpose by ADE engineers. MonitorPlotter retrieves raw data from the ALMA Monitoring System
database and other sources, such as the ALMA Science Data Model database, and allows users to select, filter,
and further process data as needed to generate meaningful time series plots or output processed data to other
formats for further analysis. Figure 2shows the application in use.
Given than any particular examination done on one ALMA antenna can be generalized to any other element of
the array, a very useful feature has been added allowing to compare the performance of hardware devices between
antennas. This plotting style, known as multiplot mode, takes one existing graph as a template and extends it to
all or a subset of antennas to easily create a web dashboard, as seen in Figure 3. The MonitorPlotter application
is extensively used on a daily basis for general troubleshooting by engineers and technicians, accounting for
more than 65,000 user generated plots since mid-2014. Multiplot dashboards have been available since late
2016 and, just as a visualization tool, have contributed significantly to performance characterization and visual
inspection-based fault detection by expert users.
In the context of the FDDF, the MonitorPlotter is essential to understand the behavior of the analyzed
variables and evaluate the best next-steps for robust fault detection: time range needed, data cleaning and filters
to apply, and candidate detection algorithms to try, among others. In addition, Multiplot dashboards are the
base for fault detection reports as shown in Section 2.5 Insight reporting.
Figure 3: MonitorPlotter in multiplot mode
2.3.2 Data pre-processing
Data needs to be first cleaned to address data quality issues and then transformed to facilitate the next analysis
stage. Cleaning can consist of removing (or handling):
Inconsistent values
Duplicate records
Missing values
Invalid data
Outliers
The method of choice for handling these particular cases requires specific domain knowledge of the problem
at hand. For example, when analyzing Intermediate Frequency Processors (IFP) amplifiers degradation by
examining its 6.5/8/10V currents (see Section 3.1 Fault cases), it is known that the information of interest
is located mainly on the trend of the resulting curves. Hence, outliers can be dropped. On the contrary,
when investigating Antenna Calibration Device (ACD) cabling damages by examining ACD loads temperatures
readings, information is in the outliers, as they are a symptom of communication errors by defective cables.
Once data is clean, transformation can consist of:
Conversion of raw or unstructured data into structured objects: for example, into Pandas library timestamp-
indexed data-frames.
Scaling
Data aggregation and resampling
Filtering
Feature selection
Dimensionality reduction
Other data manipulation techniques
Some data transformation techniques, such as data aggregation, result in less detailed data. Domain knowl-
edge is needed to evaluate if the current intended application of the data can bear the cost of such conversion.
The MonitorPlotter package includes utilities for data cleaning and transformation and are used by the FDDF.
2.4 Data analysis
Data analysis involves building and validating a model from the dataset. Depending on the data and the defined
objective, there are different analysis techniques than can be applied to build such models. The FDDF provides
several fault detection algorithms ready for use with any fault detection case, and is easily extensible to add new
detection routines. In the following subsection, we describe some of the implemented algorithms and the related
methodology.
2.4.1 Fault detection methodologies
Linear regression rule-based fault detection This technique is useful for faults that trigger degradation
or a trend in the related measured variables. The basic implementation is done by using a linear regression over
the data (1st order polynomial fit), and the resulting linear equation is used as a detection parameter. If the
slope is bigger (or smaller) than a certain threshold, the fault is informed and the slope is used as a mechanism
to predict the number of days until failure. Its main drawback is that it is sensitive to big step changes in the
analyzed time-frame, resulting in false positives. Figure 4shows an example of this technique being applied.
Parameters:
faultTrend (fault is expected when rising [or falling] trend is observed)
slopeLimit (maximum [or minimum] slope or regression line or else failure)
avgLimit (mean value must below [or above] this or else failure)
daysMin (predicted days to failure must be above this value or else failure)
preFilterType: optionally apply this rolling window filter before linear regression (mean, median, sum, diff,
among others)
preFilterType: number of samples for rolling window
Figure 4: Linear regression example
Figure 5: Double exponential forecasting example
Double exponential smoothing rule-based fault detection This is a more sophisticated version of the
previous technique. Unlike the linear regression, the double exponential smoothing can be used to predict the
slope (trend) of a signal and adapts to it. In each time-step, the algorithm estimates the next value by separating
the trend and the mean, which are estimated separately, and then corrected according to the error observed. As
Figure 5shows, with the same data as Figure 4, the model adapts the trend and can estimate changes in the
slope of the signal much faster. We use the same classification criteria as in the linear regression method.
Parameters:
faultTrend: ’rising’ or ’falling’ for general indication of faulty slope
threshold: final value of filtered signal must be greater (faultTrend=’falling’) or lesser (faultTrend=’rising’)
than this value
daysMin: forecasted days to reach threshold must be greater than this value
alpha : parameter in estimation adjustment equation of double exponential smoothing
beta : parameter in trend adjustment equation of double exponential smoothing
preFilterType: optionally apply this rolling window filter before double exponential smoothing and fore-
casting (mean, median, sum, diff, among others)
preFilterType: number of samples for rolling window
Clustering based fault detection This technique is used to detect abnormal operation (signal patterns). We
assume that most antennas are in a correct working condition, hence, comparing equivalent signals from different
antennas, can lead us to identify abnormal operations. In particular, we compute two basic statistical features of
the signal (mean and standard deviation, by default), for a predetermined time window, and use those features
to classify the signals of all the antennas. We use k-means as a clustering algorithm. The cluster with the most
antennas is assumed to be the normal state of the signal, whereas the other clusters are considered faulty.
Parameters:
funcStats1: statistical function 1 (mean, median, std, min, max or count)
funcStats2: statistical function 2 (mean, median, std, min, max or count)
(a) Clustering by mean and std (b) Clustering by estimator: 0.1 * mean + 0.9 * std
Figure 6: Clustering examples
useEstimator: if true, use instead an estimator based on a weighted sum of functions above
weight1: weight of statistical function 1
weight2: weight of statistical function 2
Figure 6shows an example of two datasets classified using this technique. The abnormal signals are drawn
in red.
Model-based fault detection using Kalman filter This method was extensively explained in.2As shown
in Figure 7, it consists of a bank of Kalman filters, where each filter has an embedded model of the system. By
default, model M0describes the normal operation, whereas the other models represent one of the faulty modes
of operation. It is much harder to generalize, since it requires specific domain knowledge to model the normal
and faulty cases. On the other hand, its main advantage is that is very robust and the different models can be
used for fault diagnosis, identification, and prediction.
2.5 Insight reporting
Findings resulting from the analysis step need to be communicated to all stakeholders, highlighting the main
results and the value they provide to the initial objective that was set. It is important to include any inconclusive
or puzzling results too, as they may trigger further analysis as part of the following actions in the next stage.
For the FDDF, reporting is mainly done through a custom web dashboard developed for this purpose, as seen
in Figure 3. Each fault case analysis is automatically assigned a unique URL that interested parties can access
from within the ALMA network. Users can filter results by available detection algorithms and accompanying
Figure 7: Kalman filters based fault detection scheme
Figure 8: Fault detection dashboard showing filtered fault candidates
results, as shown in Figure 8. For each fault candidate, details on the decision taken by the algorithm can be
seen as well, like in Figure 12.
On each particular fault case dashboard, users can choose to subscribe by e-mail and/or Twitter and receive
fault detection reports. These are automatically generated, typically, on a daily basis.
2.6 Acting on insights
The main objective of the engineering fault detection framework is to generate well-advised maintenance actions
based on data analysis. For each specific target case, insight obtained in this data science process should be
converted into specific maintenance actions, if results are conclusive. If not, an action could be to refine the
process and revisit the selected data sources or analysis techniques.
Conclusions and recommended actions are delivered to ADE Array Maintenance Group supervisors for them
to plan and allocate resources. If the process is mature and reliable enough, the framework could provide tools
that automate the actions planning by directly creating corresponding JIRA (issue tracking software) tickets or
Computerized Maintenance Management System (CMMS) work orders.
Some of the actions that can result from FDDF outcomes are:
Line Replaceable Units (LRU) repairs/replacements
LRU configuration files optimization/tuning
Further data analysis for more conclusive results
Online testing for improved understanding of devices behavior (noise temperature, iv curves, total power
detectors stability, etc.)
Figure 9: Data for one IFP displaying degradation
3. EXPERIMENTAL RESULTS
3.1 Fault cases
In the following section, we present some of the fault study cases chosen for testing the FDDF performance.
3.1.1 Intermediate Frequency Processor (IFP) Amplifiers Slow Degradation - 1 year
The Intermediate Frequency Processors (IFPs) are responsible for the second down-conversion, signal filtering,
amplification, and total power measurement of sidebands and basebands, within the super-heterodyne receiver
that ALMA has. Each antenna has 2 IFPs, one for each polarization, and their performance affects directly the
accuracy of the measurements required by the scientists. For the past few years, a slow performance degradation
has been observed within the amplifiers that compose the IFP. This has direct implications in the signal levels
in the next steps of the signal processing scheme. Figure 9shows the monitoring data coming from one IFP of
antenna CM05.
Each IFP has sensors measuring current at three different voltage levels: 6.5, 8, and 10 volts. For each
voltage, four different currents are monitored every few seconds. Additionally, the serial number of the current
IFP installed in the antenna is also monitored, in case it is replaced. Through data analysis, engineers at ALMA
have identified that the IFP degradation fault is directly related to the currents of these three voltages, which
Figure 10: Model-based Fault Detection and Diagnosis Example
start to slowly decrease when the fault appears. In Figure 9it is possible to observe the degradation pattern that
appeared in three of the currents of 6.5V and 8V, whereas the 10V currents showed no degradation. Note that
the degradation can be very slow: in the case of the 6.5V currents, the drop is of less than 100mA in more than
two years, making it really difficult to observe over the noise level, unless several months of data are collected
and visually analyzed. In the case of Figure 9, after the IFP replacement, done in April of 2014, all currents
return to nominal values. The precise moment on which the IFP was changed can be observed in the serial
number graph, at the bottom of Figure 9.
Present practice by maintenance engineers is to periodically generate these long-term plots and examine for
currents exhibiting a steady decreasing trend. If one or more signals exhibit degradation, IFP is flagged for
replacement. Since the noise levels are high compared to the degradation rate, it is not possible, without further
filtering techniques, to determine when the current will be low enough to make the amplifier useless. This is
specially important to prioritize IFP maintenance.
For these types of faults, a model-based fault detection and diagnosis technique was used, as described in.2
The methodology is based on a bank of Kalman filters, and has proven to yield excellent detection results, as
seen in Figure 10.
3.1.2 Cryostat 4K temperature - last 30 days
Cryostat 4K-stage temperature for the last 30 days. The 4K-stage in the cryostat is where the most sensitive
electronic components, the receivers (SIS mixers), are located. If 4K-stage temperature rises above 4 [K], receivers
no longer conform to noise temperature requirements and astronomical observations are stopped or flagged.
Fault detection in this case was done using the linear regression algorithm, with results as shown in Figure
11.
3.1.3 CMPR supply pressures - 1 week
Cryostat compressor helium supply pressure for the last 7 days. Reference pressure is 2.0 [MPa]. Pressures below
reference can potentially lead to inability to achieve 4 K temperature in the cryostat.
An example of successful detection of anomalies using clustering is seen in Figure 8, with detailed detection
results in Figure 6a.
3.1.4 IFPs 7V power line - last 30 days
Intermediate Frequency Processors (IFPs, one for each polarization) 7V power line for the last 30 days. Low
voltage in this line indicates possible power supply distributor 7V fuse becoming resistive. As a consequence,
IFP disappears from the communications bus causing a fatal software exception during an observation. Critical
voltage was empirically estimated at 5.8 [V].
Figure 12 shows fault detection details from the FDDF dashboard using the double exponential algorithm
and 5.7 [V] as threshold. Failure prediction is included in this case.
4. IMPLEMENTATION DETAILS
The FDDF’s MonitorPlotter application described in Section 2.3.1 Data exploration was developed using Python
2.7, Pandas,Numpy (data structuring and analysis) and HDF5 (big data storage) libraries for the backend. For
the frontend, PyQt4, the Python binding for Qt4, was employed.
FDDF detection algorithms and base classes were developed using Python 3.6, again with Pandas,Numpy,
and HDF5 for data structure, analysis and storage, and scikit-learn for machine learning tools. SQLite was
used as database engine. FDDW web frontend was developed with the Flask framework.
(a) Dashboard filtered for 4K fault candidates
(b) Faulty DV13 antenna 4K fault detection de-
tails
Figure 11: Linear regression used to detect failures on 4K cryostat temperatures
5. CONCLUSIONS
In this work we have developed a fault detection and diagnosis framework (FDDF). This system uses offline
monitoring data to determine if faults or anomalies are present. Its findings are then presented to interested
parties in a variety of formats, the main one being a web dashboard for easy inspection. Using the FDDF, several
relevant fault study cases have been made available, each one automatically examined with one or more of the
available fault detection algorithms. The framework has been modeled following known data science flow and best
practices, being flexible to quickly incorporate new fault detection methods and study cases. Results obtained
have been very positive, displaying early detection or prediction of failures, presented in a timely manner and
directly assisting the decision-making process of condition-based and predictive maintenance schemes.
6. FUTURE WORK
It is planned to port this framework to an open-source or commercial data science framework, providing improved
scalability, security and collaboration features. Exploration of available options is currently ongoing.
New fault detection algorithms need to be developed, as well as improvement of existing ones for increased
robustness. For example, triple exponential smoothing is expected to render better detection results in those cases
where seasonality is observed. For those cases where fault detection is rule-based, decision tree classification could
be explored. Model-based fault detection using Kalman filters needs to be generalized, to the extent possible.
Figure 12: Double exponential smoothing predicting 7V power line failure
On the other hand, several data sources other than the ALMA monitoring database were mentioned. These
need to be integrated into the FDDW. In those cases were adequate interfaces are not available, they need to be
designed and developed.
ACKNOWLEDGMENTS
The Atacama Large Millimeter/submillimeter Array (ALMA), an international astronomy facility, is a part-
nership of the European Organisation for Astronomical Research in the Southern Hemisphere (ESO), the U.S.
National Science Foundation (NSF) and the National Institutes of Natural Sciences (NINS) of Japan in coopera-
tion with the Republic of Chile. ALMA is funded by ESO on behalf of its Member States, by NSF in cooperation
with the National Research Council of Canada (NRC) and the National Science Council of Taiwan (NSC) and
by NINS in cooperation with the Academia Sinica (AS) in Taiwan and the Korea Astronomy and Space Science
Institute (KASI).
ALMA construction and operations are led by ESO on behalf of its Member States; by the National Radio
Astronomy Observatory (NRAO), managed by Associated Universities, Inc. (AUI), on behalf of North America;
and by the National Astronomical Observatory of Japan (NAOJ) on behalf of East Asia. The Joint ALMA
Observatory (JAO) provides the unified leadership and management of the construction, commissioning and
operation of ALMA.
REFERENCES
[1] Ortiz, J. and Castillo, J., “Automating engineering verification in ALMA subsystems,” in [Observatory
Operations: Strategies, Processes, and Systems V], Peck, A. B., Benn, C. R., and Seaman, R. L., eds., Proc.
SPIE 9146 (2014).
[2] Ortiz, J. e. L. and Carrasco, R. A., “Model-based fault detection and diagnosis in ALMA subsystems,” in
[Observatory Operations: Strategies, Processes, and Systems VI], Peck, A. B., Benn, C. R., and Seaman,
R. L., eds., Proc. SPIE 9910 (2016).
[3] Shen, T.-C., Soto, R., Merino, P., Pena, L., Bartsch, M., Aguirre, A., and Ibsen, J., “Exploring No-SQL
alternatives for ALMA monitoring system,” in [Software and Cyberinfrastructure for Astronomy III], Chiozzi,
G. and Radziwill, N. M., eds., Proc. SPIE 9152 (2014).
[4] Gil, J. P., Tejeda, A., Shen, T.-C., and Saez, N., “Unveiling ALMA software behavior using a decoupled
log analysis framework,” in [Software and Cyberinfrastructure for Astronomy III], Chiozzi, G. and Radziwill,
N. M., eds., Proc. SPIE 9152 (2014).
... In the literature, a wide variety of methods used for fault detection can be classified into signal processing approaches [18][19][20][21][22][23], model-based approaches [23][24][25][26], knowledgebased approaches [18,[27][28][29], and data-driven approaches [18,23,[30][31][32][33][34][35][36]. With the arrival of technology and the advancement of computing methods, data-driven approaches are gaining attention in the last decades, where it is expected that the data will drive the identification of normal and faulty modes of operation. ...
... From an application perspective, fault detection systems have been developed in many areas such as rolling bearing, machines, industrial systems, mechatronics systems, industrial cyber-physical systems, and industrial-scale telescopes, to name a few [15,[23][24][25][26][33][34][35]37,38,41,42]. ...
... Double exponential smoothing: this filter [26,[60][61][62][63][64] is commonly used for forecasting in time series, but it can also be used for noise reduction. This method is particularly useful in time series to smooth its behavior, preserving the trend and without losing almost any information in the dynamics of the series. ...
Article
Full-text available
The prognostics and health management disciplines provide an efficient solution to improve a system’s durability, taking advantage of its lifespan in functionality before a failure appears. Prognostics are performed to estimate the system or subsystem’s remaining useful life (RUL). This estimation can be used as a supply in decision-making within maintenance plans and procedures. This work focuses on prognostics by developing a recurrent neural network and a forecasting method called Prophet to measure the performance quality in RUL estimation. We apply this approach to degradation signals, which do not need to be monotonical. Finally, we test our system using data from new generation telescopes in real-world applications.
... Fault detection tools are particularly useful in predictive maintenance systems to improve the use of expensive equipment. Fault detection procedures can be divided into three main categories: signal processing techniques [3]- [6], modelbased techniques [7]- [9], and data-driven ones [10]- [16]. In this work we will focus in the latter, where data will drive the identification of normal and faulty modes of operation. ...
... They also test the effectiveness of this approach and show that it is possible to achieve better detection by using more than one residual. In an application to modern telescopes, Ortiz and Carrasco [8], [9] developed a framework of fault detection and diagnosis, using a bank of Kalman filters to detect a specific type of slow degradation faults. They tested their scheme with real fault data from ALMA, showing excellent accuracy. ...
... The Chajnantor Plateau, with its perfect skies for astronomical observation, is also known for its extreme weather and oxygen-deprived air conditions that severely diminish troubleshooting and decision-making skills of human operators. Remote and automated tasks execution and investigation of arisen problems is a must, to the maximum possible extent [9]. Hence, developing automated systems that can reduce human intervention and detect possible failures ahead of time is extremely important. ...
Article
Full-text available
The ever increasing challenges posed by the science projects in astronomy have skyrocketed the complexity of the new generation telescopes. Due to the climate and sky requirements, these high-precision instruments are generally located in remote areas, suffering from the harsh environments around it. These modern telescopes not only produce massive amounts of scientific data, but they also generate an enormous amount of operational information. The Atacama Large Millimeter/submillimeter Array (ALMA) is one of these unique instruments, generating more than 50 Gb of operational data every day while functioning in conditions of extreme dryness and altitude. To maintain the array working under extreme conditions, the engineering teams must check over 130,000 monitoring points, combing through the massive datasets produced every day. To make this possible, predictive tools are needed to identify, hopefully beforehand, the occurrence of failures in all the different subsystems. This work presents a novel fault detection scheme for one of these subsystems, the Intermediate Frequency Processors (IFP). This subsystem is critical to process the information gathered by each antenna and communicate it, reliably, to the correlator for processing. Our approach is based on echo state networks, a configuration of artificial neural networks, used to learn and predict the signal patterns. These patterns are later compared to the actual signal, to identify failure modes. Additional preprocessing techniques were also added since the signal-to-noise ratio of the data used was very low. The proposed scheme was tested in over seven years of data from 132 IFPs at ALMA, showing an accuracy of over 70%. Furthermore, the detection was done several months earlier, on average, when compared to what human operators did. These results help the maintenance procedures, increasing reliability while reducing humans' exposure to the harsh environment where the antennas are. Although applied to a specific fault, this technique is broad enough to be applied to other types of faults and settings.
... There are few works related to natural disasters applied to industrial-scale telescopes, mostly focused on astronomical studies to analyze the phenomena and celestial bodies [18], [19], construction or partial or total improvement of the system [20]- [22], fault detection [23]- [26], prognosis [26], [27], and decision-making in maintenance [28], [29], and others [30]- [34]. ...
Article
Full-text available
Natural disasters have the potential to pose a significant threat to property, critical infrastructure, human health and safety, and some others. One of the most relevant natural disasters is earthquakes, which are high on the list of natural phenomena that most affect infrastructures and, at the same time, are the most unpredictable. This study uses the probabilistic seismic risk analysis method to estimate the condition of an industrial-scale telescope after the effect of an earthquake. The approach considers a seismic source model and ground motion prediction equations to evaluate the intensity measure of each telescope as a function of its location. The implemented simulation uses reliability via Monte Carlo method to discover the probability of failure of each telescope taking into consideration the fragility curves described by its natural structure. Their method incorporates terrain characteristics and component robustness into the analysis of telescope performance at an affordable computational cost. The proposed method was applied to the observatories using industrial-scale telescopes established in Chile, and it was confirmed that reliability is highly dependent on the robustness or fragility of the infrastructure.
... There are few works related to natural disasters applied to industrial-scale telescopes, mostly focused on astronomical studies to analyze the phenomena and celestial bodies (Ray et al., 2022;Yushchenko et al., 2022), construction or partial or total improvement of the system (Marchiori et al., 2018;Morzinski et al., 2014;Allekotte et al., 2013), fault detection (Ortiz and Carrasco, 2018;Cho et al., 2020;Wu et al., 2020;Roelf, 2022), prognosis Roelf, 2022), and decision-making in maintenance (Costa et al., 2021;, among others. ...
Preprint
Natural disasters have the potential to pose a significant threat to property, critical infrastructure, human health and safety, and some others. One of the most relevant natural disasters is earthquakes, which are high on the list of natural phenomena that most affect infrastructures and, at the same time, are the most unpredictable. This study uses the probabilistic seismic risk analysis method to estimate the condition of an industrial-scale telescope after the effect of an earthquake. The approach considers a seismic source model and ground motion prediction equations to evaluate the intensity measure of each telescope as a function of its location. Our simulation uses reliability via Monte Carlo to discover the probability of failure of each telescope taking into consideration the fragility curves described by its nature structure. Our method incorporates terrain characteristics and component robustness into the analysis of telescope performance at an affordable computational cost. We applied our method to the observatories that use industrial-scale telescopes established in Chile, and we showed that reliability is highly dependent on the robustness/fragility of the infrastructure. MSC Classification: 62P12 , 62P30 , 90B25
Thesis
Full-text available
Las áreas operativas en las organizaciones están bajo una presión cada vez mayor para mejorar su desempeño, impulsando a las empresas a ser cada vez más eficientes y efectivas con los recursos y activos que tienen. Esto ha agregado una enorme carga al área de mantenimiento, la cual debe mantener un delicado equilibrio entre los efectos de las fallas imprevistas y el costo de las medidas preventivas. En este mundo de creciente incertidumbre y con el aumento de la complejidad de los sistemas de producción actuales, este equilibrio es aún más desafiante, lo que hace que las políticas de mantenimiento basadas en condiciones sean difíciles de definir e implementar. Para hacer frente a estas dificultades, las áreas de mantenimiento han recurrido a los datos operativos para obtener una respuesta, aprovechando de muchos sensores y sistemas de telemetría que ahora están disponibles. Aquí, las herramientas de análisis predictivo han ayudado a convertir estos datos en información, transformando el flujo constante de los sensores y actuadores para detectar e incluso predecir cambios en el estado del sistema. La presente tesis desarrolla un framework en mantenimiento prescriptivo presentando las conexiones del proceso e identificando los módulos críticos que conforma el sistema. Estos módulos son: predicción de falla, pronóstico, y planificación de mantenimiento. Esta tésis presenta tres contribuciones. La primera, define un diseño para predicción de fallas usando Echo State Network como un componente para estimar la evolución del sistema en estado libre de falla y en tiempo desplazado. La segunda, desarrolla un método para la predicción de la distribución del Remaining-Useful-Life (RUL) de los componentes, usando un tipo de red neuronal recurrente conocido como Long-Short Term Memory (LSTM), que fueron ajustados a partir de un conjunto de señales run-to-failure. Finalmente, se construye un modelo de optimización estocástica para incluir información del RUL con incertidumbre vía chance constraints que permite balancear el costo entre correctivo y preventivo con el fin de proveer un programa (schedule) en mantenimiento.
Conference Paper
While the data mining intermediaries play a critical role in the rock drilling industry, they also tend to provide an optimized real-time model for the drilling systems. In addition, proper online tool condition monitoring (OTOM) methods can improve the drilling performance by accessing real-time data. Hence, OTOM methods assist depreciating error and detect unspecified faults at early stages. In this study, we proposed appropriate OTOM algorithms to develop and enhance the quality of real-time systems and provide a solution to detect and categorize various stages of drilling operation with the aid of vibration signals (especially in terms of acceleration or velocity). In particular, the proposed methods in this article perform based on statistical approaches. Therefore, in order to recognize the drilling stages, we measured the Root Mean Square (RMS) values corresponding to the acceleration signals. In the meantime, we also succeeded to distinguish the drilling stages by employing estimated power spectral density (PSD) in the frequency domain. The acquired results in this publication confirm the real-time prediction and classification potential of the proposed methods for the different drilling stages and especially for the rock drilling engineering.
Conference Paper
Stray current interference causes severe electrochemical corrosion to buried pipelines and metal structure in DC mass transit systems. The corrosion current density is an important parameter that affecting the characteristic of stray current corrosion (SCC). A method combining electrochemical experiment with the data mining technique was utilized in this research to study the corrosion current density during stray current process. An improved PSO-NN model was proposed to predict the corrosion current density during the SCC process. In order to improve the forecasting performance, the improved particle swarm optimization (IPSO) algorithm was used to optimize the updating of weights and biases in the neural network (NN). The data mining process was conducted based on the stray current corrosion database, which was recorded by electrochemical corrosion experiment. The data mining results show that the accuracy of IPSO-NN model is better than back-propagation neural network (BPNN) and PSO-NN model. In this way, an accurate mapping between input variables and corrosion current density is successfully established using IPSO-NN algorithm as a data mining tool. The presented model provides a new approach for the prediction and monitoring of the stray current corrosion in DC mass transit system.
Conference Paper
Full-text available
The Atacama Large Millimeter/submillimeter Array (ALMA) observatory, with its 66 individual telescopes and other central equipment, generates a massive set of monitoring data every day, collecting information on the performance of a variety of critical and complex electrical, electronic and mechanical components. This data is crucial for most troubleshooting efforts performed by engineering teams. More than 5 years of accumulated data and expertise allow for a more systematic approach to fault detection and diagnosis. This paper presents model-based fault detection and diagnosis techniques to support corrective and predictive maintenance in a 24/7 minimum-downtime observatory.
Conference Paper
Full-text available
ALMA Software is a complex distributed system installed in more than one hundred of computers, which interacts with more than one thousand of hardware device components. A normal observation follows a flow that interacts with almost that entire infrastructure in a coordinated way. The Software Operation Support team (SOFTOPS) comprises specialized engineers, which analyze the generated software log messages in daily basis to detect bugs, failures and predict eventual failures. These log message can reach up to 30 GB per day. We describe a decoupled and non-intrusive log analysis framework and implemented tools to identify well known problems, measure times taken by specific tasks and detect abnormal behaviors in the system in order to alert the engineers to take corrective actions. The main advantage of this approach among others is that the analysis itself does not interfere with the performance of the production system, allowing to run multiple analyzers in parallel. In this paper we'll describe the selected framework and show the result of some of the implemented tools.
Conference Paper
Full-text available
The Atacama Large Millimeter/submillimeter Array is an interferometer comprising 66 individual high precision antennas located over 5000 meters altitude in the north of Chile. Several complex electronic subsystems need to be meticulously tested at different stages of an antenna commissioning, both independently and when integrated together. First subsystem integration takes place at the Operations Support Facilities (OSF), at an altitude of 3000 meters. Second integration occurs at the high altitude Array Operations Site (AOS), where also combined performance with Central Local Oscillator (CLO) and Correlator is assessed. In addition, there are several other events requiring complete or partial verification of instrument specifications compliance, such as parts replacements, calibration, relocation within AOS, preventive maintenance and troubleshooting due to poor performance in scientific observations. Restricted engineering time allocation and the constant pressure of minimizing downtime in a 24/7 astronomical observatory, impose the need to complete (and report) the aforementioned verifications in the least possible time. Array-wide disturbances, such as global power interruptions and following recovery, generate the added challenge of executing this checkout on multiple antenna elements at once. This paper presents the outcome of the automation of engineering verification setup, execution, notification and reporting in ALMA and how these efforts have resulted in a dramatic reduction of both time and operator training required. Signal Path Connectivity (SPC) checkout is introduced as a notable case of such automation.
Conference Paper
Full-text available
The Atacama Large Millimeter /submillimeter Array (ALMA) will be a unique research instrument composed of at least 66 reconfigurable high-precision antennas, located at the Chajnantor plain in the Chilean Andes at an elevation of 5000 m. This paper describes the experience gained after several years working with the monitoring system, which has a strong requirement of collecting and storing up to 150K variables with a highest sampling rate of 20.8 kHz. The original design was built on top of a cluster of relational database server and network attached storage with fiber channel interface. As the number of monitoring points increases with the number of antennas included in the array, the current monitoring system demonstrated to be able to handle the increased data rate in the collection and storage area (only one month of data), but the data query interface showed serious performance degradation. A solution based on no-SQL platform was explored as an alternative to the current long-term storage system. Among several alternatives, mongoDB has been selected. In the data flow, intermediate cache servers based on Redis were introduced to allow faster streaming of the most recently acquired data to web based charts and applications for online data analysis.