Available via license: CC BY 4.0
Content may be subject to copyright.
MNRAS 000,1–15 (2022) Preprint 25 April 2022 Compiled using MNRAS L
A
T
EX style file v3.0
Machine Learning methods to estimate observational properties of
galaxy clusters in large volume cosmological N-body simulations
Daniel de Andres1,2, Gustavo Yepes1,2, Federico Sembolini1,3, Gonzalo Martínez-
Muñoz4, Weiguang Cui1,2,5, Francisco Robledo6,7, Chia-Hsun Chuang8,9, Elena
Rasia10,11
1Departamento de Física Teórica, M-8, Universidad Autónoma de Madrid, Cantoblanco 28049, Madrid, Spain
2Centro de Investigación Avanzada en Física Fundamental,(CIAFF), Universidad Autónoma de Madrid, Cantoblanco, 28049 Madrid, Spain
3Equifax Ibérica, Data & Analytics, Paseo de la Castellana 259D, Madrid, Spain
4Computer Science Department, Escuela Politécnica Superior, Universidad Autónoma de Madrid, Cantoblanco, 28049, Spain
5Institute for Astronomy, University of Edinburgh, Royal Observatory, Edinburgh EH9 3HJ, UK
6Departamento de Fundamentos del Análisis Económico II, Universidad del País Vasco/Euskal Herriko Unibertsitatea, Barrio Sarriena s/n,
48940 Leioa, Bizkaia, Spain
7Laboratoire de Mathématiques et de leurs Applications. Université de Pau et des Pays de l’Adour, Avenue de l’Université, BP 576, 64012 Pau, France
8Department of Physics and Astronomy, University of Utah, Salt Lake City, UT 84112, USA
9Kavli Institute for Particle Astrophysics and Cosmology, Stanford University, 452 Lomita Mall, Stanford, CA 94305, USA
10 INAF - Osservatorio Astronomico Trieste, via Tiepolo 11, 34123, Trieste 34123, Italy
11 Institute of Fundamental Physics of the Universe, via Beirut 2, 34151 Grignano, Trieste, Italy
Accepted: receives: in original form:
ABSTRACT
In this paper we study the applicability of a set of supervised machine learning (ML)
models specifically trained to infer observed related properties of the baryonic component
(stars and gas) from a set of features of dark matter only cluster-size halos. The training set is
built from The Three Hundred project which consists of a series of zoomed hydrodynamical
simulations of cluster-size regions extracted from the 1 Gpc volume Multidark dark-matter
only simulation (MDPL2). We use as target variables a set of baryonic properties for the intra
cluster gas and stars derived from the hydrodynamical simulations and correlate them with
the properties of the dark matter halos from the MDPL2 N-body simulation. The different
ML models are trained from this database and subsequently used to infer the same baryonic
properties for the whole range of cluster-size halos identified in the MDPL2. We also test the
robustness of the predictions of the models against mass resolution of the dark matter halos and
conclude that their inferred baryonic properties are rather insensitive to their DM properties
which are resolved with almost an order of magnitude smaller number of particles. We conclude
that the ML models presented in this paper can be used as an accurate and computationally
efficient tool for populating cluster-size halos with observational related baryonic properties
in large volume N-body simulations making them more valuable for comparison with full sky
galaxy cluster surveys at different wavelengths. We make the best ML trained model publicly
available.
Key words: cosmology: theory – cosmology:dark matter – cosmology:large-scale structure
of Universe – methods: numerical – galaxies: clusters: general –galaxies: halos
1 INTRODUCTION
Galaxy clusters are the largest gravitationally bound objects of
the Universe and constitute one of the best cosmological probes
to constrain cosmological parameters of the Universe. The main
component of galaxy clusters is dark matter (DM), which accounts
for 85 per cent of the total mass (for a full review see e.g. Allen
et al.,2011;Kravtsov & Borgani,2012). Although the existence
of DM is now widely accepted by the scientific community and
strongly supported by modern cosmological theories, it has not
been directly detected yet. To study galaxy clusters, we have
therefore to focus on their baryonic component, which represents
the remaining 15 per cent of the mass. It is composed by the hot
gas of the Intra Cluster Medium (ICM, around 10-15 per cent of
©2022 The Authors
arXiv:2204.10751v1 [astro-ph.CO] 22 Apr 2022
2Daniel de Andres et.al.
the total cluster mass) and stars (less than 5 per cent of the mass).
Numerical simulations play a fundamental role to study the
properties of galaxy clusters. In the simplest scenario, N-body sim-
ulations can easily describe the dark-matter component of clusters,
which is governed only by gravity; nowadays it is computationally
possible to perform very large cosmological simulations, up to a few
Gpc3, e.g. MillenniumXXL (Angulo et al.,2012), MICE (Fosalba
et al.,2015), MultiDark (Klypin et al.,2016), Dark sky (Skillman
et al.,2014), OuterRim (Habib et al.,2016), FLAGSHIP (Potter
et al.,2017), Uchuu (Ishiyama et al.,2021), BACCO (Angulo et al.,
2021) and UNIT project (Chuang et al.,2019), which include thou-
sands of galaxy clusters. Nevertheless, when aiming to describe
the baryon component of clusters, due to the complex physics in-
volved in the processes of cluster formation, radiative hydrodynamic
numerical simulations have to be used. These simulations are com-
putationally very expensive so this puts strong limitations to the
size of the computational volumes. Examples of state-of-the art of
such simulations are: Illustris (Vogelsberger et al.,2014), Ea-
gle (Schaye et al.,2015), Horizon-AGN (Chisari et al.,2016),
Magneticum (Dolag et al.,2016) or BAHAMAS (McCarthy et al.,
2018). Hydrodynamical simulations are essential to calibrate mass
proxies and to study the systematics affecting observational mea-
surements. They are also essential to deeply understand the forma-
tion and evolution of clusters of galaxies and all their gas-dynamical
effects. For this reason, numerical simulations have been a powerful
tool to guide galaxy clusters observations for more than 20 years
(Evrard et al.,1996;Bryan & Norman,1998).
In an ideal scenario one would need to have a large sample of
simulated galaxy clusters with enough numerical resolution, both
in mass and in the gravity and pressure forces. This high resolution
would allow to accurately resolve the internal substructures and to
obtain a detailed modelling of the most relevant physical processes.
The best way to achieve this would be by simulating large cosmo-
logical boxes containing up to tens of thousands of galaxy clusters.
Unfortunately, due to the large computational effort demanded by
these simulations, one needs to find a compromise between their
three main components: volume size, mass resolution and physical
processes included. A possible solution to the computational prob-
lems related with scalability of present-day hydrodynamical codes
is to proceed with the so-called ‘zoom’ simulations, such us the
MUSIC1simulation (Sembolini et al.,2013), the Dianoga clusters
(Planelles et al.,2013), Rhapsody-G (Wu et al.,2015), MACSIS
(Barnes et al.,2016), Cluster-EAGLE (Barnes et al.,2017), hy-
drangea (Bahé et al.,2017) clusters and The Three Hundred
(The300)2simulation project (Cui et al.,2018). Zoom simulations
are performed mimicking the observations, by creating a catalogue
of resimulated galaxy clusters that are extracted from low-resolution
N-body simulations. The regions containing clusters of galaxies are
then resimulated at very high resolution, adding gas physics in the
resimulated areas and keeping the rest of the box at low resolution
in order to reproduce the same gravitational evolution.
An alternative approach to hydrodynamical simulations to de-
scribe the gas and stellar properties of galaxy clusters, is to use
Semi-Analytic Models (SAMs), such us GALACTICUS (Benson,
2012), SAG (Cora et al.,2018), SAGE (Croton et al.,2016) and
GALFORM (Lacey et al.,2016). In this approach, the numerous
complex non-linear radiative physical processes associated to the
1https://music.ft.uam.es
2https://the300-project.org
gas-star components are modelled using a combination of analytic
approximations and empirical calibrations of many free parameters
against a set of observational constrains (see e.g. Baugh (2006) for
a review). Nevertheless, SAMs are also computationally expensive
since most of them are based on the information provided by merger
history of each individual dark matter halo. A complementary ap-
proach is the use of phenomenological models to derive physical
properties of the ICM as in Zandanel et al. (2018). Describing the
gas physics in simulated galaxy clusters requires therefore a big
computational effort and impose a compromise between numerical
resolution and size of the cosmological volume to simulate.
The main goal of supervised Machine Learning (ML) is to gen-
erate models that can learn complex relationships between input and
output variables from high-dimensional data that can later be used
to make predictions on unseen data. In this scenario, ML could offer
a powerful alternative to infer some fundamental information on the
main properties (e.g. gas and star masses, gas temperature, etc) of
the baryon component of galaxy clusters, without the large com-
putational cost required by hydrodynamical simulations or SAMs.
Applications of ML to find a mapping between hydrodynamical
and N-body simulations have been already presented in previous
works. Firstly, in Kamdar et al. (2016), a promising technique to
study galaxy formation using numerical simulations and ML was
presented; Jo & Kim (2019) estimated galactic baryonic proper-
ties mimicking the IllustrisTNG simulation (Nelson et al.,2019);
Wadekar et al. (2021) generated neutral hydrogen from dark mat-
ter; Bernardini et al. (2022) predicted high resolution baryon fields
from dark matter simulations and Moews et al. (2021) used hybrid
analytic and machine learning model to paint dark matter galac-
tic halos with hydrodynamical properties. Recently, The CAMELS
collaboration (Villaescusa-Navarro et al.,2022) has released results
from almost ten thousands simulations (both hydrodynamical and
N-body) with different cosmologies and baryon physical models
that are an invaluable tool for training current and future Artificial
Intelligence algorithms that will be very useful for galaxy formation
studies. Unfortunately, given the box sizes, the number of cluster-
size objects is poorly represented in these simulations.
The purpose of this study is to explore the applicability of
ML techniques to generate baryon cluster properties from DM-only
halo catalogues mimicking the results from The Three Hundred
hydrodynamical simulations. More precisely, we use the properties
of the cluster-sized halos extracted from parent dark matter only
full box simulation MDPL2 as the features of our dataset. Then we
collect several baryon properties of the objects that have been res-
imulated with radiative processes and hydrodynamics as targets (the
predicted variables) of the ML models. Our work differs from previ-
ous studies in that the baryon properties are extracted from ‘zoom’
simulations and therefore, we have paired one to one the objects
corresponding to the full N-body only simulations with their hydro-
dynamical counterparts. As explained below, The300 simulations
corresponds to spherical regions centred on the 324 most massive
clusters found in the MDPL2 box. But there are more cluster-size
halos found within each region with lower masses. The masses of
the cluster-size catalogue of hydrodynamical simulated objects we
are using ranges from ∼1013 ℎ−1Mup to ∼1015 ℎ−1M.
The article is structured as follows: In § 2, we describe how the
training dataset is generated using The300 and the MDPL2 simu-
lations. In § 3, we explain the different ML algorithms used in this
work and the training setup. We also study the feature importance
and dimensionality reduction of our feature space. In § 4, the main
results for this work are shown, including an analysis of the perfor-
mance of the ML models and their dependence on mass resolution
MNRAS 000,1–15 (2022)
Machine Learning in Galaxy Clusters 3
of the simulations. In § 5, we study the scaling relations extracted
from the new ML-generated catalogues and finally in § 6, we draw
our main conclusions and propose possible future studies.
2 THE TRAINING DATASET
In order to create the database for training the ML models, we
use the MDPL23simulation, which has been run using the cosmo-
logical parameters measured by the Planck Collaboration (Planck
Collaboration et al.,2016). The MDPL2 simulation is composed
by a periodic cube of comoving length 1ℎ−1Gpc containing 38403
dark-matter particles of mass 1.5×109ℎ−1M.
To build this training dataset, first of all we needed to identify
and extract in the MDPL2 simulation, the same cluster objects that
were used to run The300 hydrodynamical simulations. We then se-
lect the main properties of the dark matter clusters and associate
them with the baryonic properties extracted from The300 hydrody-
namical counterparts.
2.1 MDPL2: Dark Matter input variables
In order to identify the dark matter halos and measure their in-
ternal properties in the MDPL2 N-body simulation we have used
the Rockstar halo finder (Behroozi et al.,2012), complemented
with additional information based on the halo mass accretion his-
tory from the Consistent Halo Merger Trees analysis (Behroozi
et al.,2013). We have extracted a total of 26 relevant physical Rock-
star + Consistent Trees variables4(masses at different radii, ve-
locities, symmetry factors, properties related with mass accretion
history, etc) to create our dark matter catalogue. In addition, we have
also considered the scale factor 𝑎(𝑧)of clusters as an input variable.
Furthermore, we have introduced a cutoff in halo mass such that
log(𝑀/( ℎ−1M)) ≥ 13.5and redshift ≤1.03.
In Fig. 1, we show the Spearman correlation matrix of the 26
Rockstar variables and the scale factor 𝑎(𝑧). These variables are
ordered using a hierarchical clustering algorithm based on Ward’s
linkage on a condensed distance matrix. We used the Python im-
plementation of this algorithm from SciPy (Virtanen et al.,2020).
We can easily identify 5 groups in the correlation matrix. The first
group (variables 0 to 12) corresponds to masses and velocities at
different radii. In a second group, different ellipticity shape factors
(from 13 to 16) are included. Variables from 17 to 21 corresponds to
the scale radius, the ratio between the kinetic and potential energy
and the offsets between density peak and centre-of-mass, which are
directly related to the dynamical state of the cluster halos. The next
group of variables (22 and 23) correspond to the dimensionless spin
parameters of the cluster. Finally, variables from 24 to 26 represent
the scale factor (redshift) and the time evolution of mass accretion.
As can be seen in the figure, feature variables inside the same block
are strongly correlated among them and they are weakly, or not
correlated to variables inside other blocks. This might imply that
selecting more than one feature belonging to the same block could
not add any new predictive information. This is studied in detail
in section § 3. A more detailed description of the selected feature
variables can be found in the Appendix A.
3www.cosmosim.org
4More information regarding the selection of Rockstar variables can be
found in Appendix A
2.2 The300: baryonic output variables
Subsequently, for a subset of the MDPL2 cluster halos we need to
have their baryonic properties. For this purpose we have used the
results of The300 project, which has re-simulated spherical regions
of radius 15ℎ−1Mpc centred around the 324 most massive clus-
ters found in the MDPL2 simulation at 𝑧=0. These regions were
then mapped back to the initial conditions and their particles were
split into gas and dark-matter, while the rest of the particles in the
remaining box were re-sampled into different levels of lower reso-
lution and larger masses. With this zoom-in technique, we ensure
that the subsequent gravitational evolution will reproduce the same
objects in the high resolution area while we minimise the effects of
contamination of low resolution particles from external regions due
to mass segregation. In any case, we checked that all the clusters
used in this work are free from contamination of low mass resolution
particles at least within their virial radii.
The300 project has produced different versions of hydrody-
namical simulations from these initial zoomed conditions which
include different baryonic physics modules: radiative cooling, star
formation and Supernovae Feedback using the Gadget-MUSIC
SPH+TreePM code (Sembolini et al.,2013) and newer versions
that include feedbacks from Super Massive Black Holes: Gadget-
X(Murante et al.,2010;Rasia et al.,2015), GIZMO-SIMBA (Davé
et al.,2019).
However, in this work, we only make use of the Gadget-X runs.
The halos in these simulations are identified and analysed with the
Amiga Halo Finder (AHF) (Knollmann & Knebe,2009), which is
more suitable than Rockstar for simulations with multiple particles
species (i.e. dark matter particles, gas, stellar particles and Black
Holes). From the information contained in the AHF catalogues, we
have collected the following baryon properties:
•The total gas mass 𝑀gas inside a spherical volume whose
overdensity is 500 times greater than the critical density of the
Universe. The radius of this sphere is denoted as 𝑅500.
•The Stellar mass 𝑀star inside 𝑅500
•The gas temperature 𝑇gas computed as the mass weighted tem-
perature, inside 𝑅500
𝑇=Í𝑖∈𝑅500 𝑇𝑖𝑚𝑖
Í𝑖∈𝑅500 𝑚𝑖
,(1)
where 𝑇𝑖and 𝑚𝑖are respectively the temperature and mass of the
gas particle.
•The X-ray Y-parameter 𝑌Xdefined as𝑇gas×𝑀gas, which related
with the total thermal energy of the gas and it has been shown that
it is a good proxy of the total cluster mass (Kravtsov et al.,2006).
Note that this quantity can be derived from others. However, we
prefer to treat it as an independent target, i.e the ML models are also
trained to predict 𝑌Xas one of the target variables.
•The integrated Compton-y parameter 𝑌SZ over 𝑅500 given by
the Sunyaev-Zel’dovich (SZ) effect (Sunyaev & Zeldovich,1972).
Particularly, the integrated value 𝑌SZ is computed from Compton-y
parameter maps estimated as in the following:
𝑦=
𝜎T𝑘B
𝑚e𝑐2∫𝑛e𝑇e𝑑𝑙 , (2)
where 𝜎Tis the Thomson cross section, 𝑘Bis the Boltzmann con-
stant, 𝑐the speed of light, 𝑚ethe electron rest-mass, 𝑛ethe electron
number density, 𝑇eis the electron temperature and the integration
is done along the observer’s line of sight. Assuming 𝑑𝑉 =𝑑𝐴𝑑𝑙 ,
Eq.(2) is computed in our simulated data as in Sembolini et al.
MNRAS 000,1–15 (2022)
4Daniel de Andres et.al.
(2013) and Le Brun et al. (2015):
𝑦=
𝜎T𝑘B
𝑚e𝑐2𝑑𝐴
i
𝑇i𝑁e,i𝑊(𝑟, ℎ𝑖). (3)
Note that here we have used the number of electrons in the gas
particles 𝑁egiven that 𝑛e=𝑁e/𝑑𝐴/𝑑𝑙 . Moreover, 𝑊(𝑟, ℎi)is the
same SPH smoothing kernel as in the hydrodynamical simulation
with smoothing length ℎi. The 𝑦-maps are generated with the centre
on the projected maximum density peak position of the halo. Each
image has a fixed angular resolution of 500 that is extended to at least
𝑅200 in all the clusters. The clusters at 𝑧=0are placed at 𝑧=0.05
to generate the mock images while the clusters at higher redshifts
simply use its original value from the simulations. We then integrate
the Compton-y map up to 𝑅500 using only the z-plane projection.
Since the dataset is large, the effect of projections is negligible.
Note that this approach of estimating 𝑌SZ gives us the cylindrical
integrated Compton-y parameter.
2.3 The Final Training Dataset
After defining our input and output variables, we finally match one-
by-one the clusters between the two simulations that fulfil these two
conditions for the relative shifts between the cluster centres and the
halos mass differences:
distance(CMDPL2,CThe300)<0.4×𝑅200,The300 , (4)
𝑀MDPL2,200
𝑀The300,200
<0.1. (5)
Here, 𝐶MDPL2 and 𝐶The300 stand for the centre of mass of the
clusters while 𝑀MDPL2,200 and 𝑀The300,200 stand for the mass in-
side a sphere of radius 𝑅=𝑅200 for each simulations (between DM
only Rockstar catalogue and the AHF catalogue respectively). Due
to both the baryon effect (see Cui et al.,2012,2014, for example)
and to different algorithms used by the halo finders, it is not possible
to determine with all certainty that all the halos are exactly matched.
Notice that the centre difference can be as high as 0.4𝑅200. How-
ever, with this restrictive selection criteria, only the true/very close
counterparts are selected. In this way, we finally provide the baryon
properties for the matched MDPL2 clusters using the corresponding
The300 objects.
After this procedure, our dataset is finally composed of 49540
different objects. Note that all the 33 halo catalogues available from
𝑧=0to 𝑧=1.03 in the two simulations have been considered. Only
1264 objects correspond to clusters at 𝑧=0, the rest of them are the
progenitors of the same objects at different redshifts. The number
of objects as a function of their mass and redshift can respectively
be found in Fig. 2 and Fig. 3. Our final dataset is composed of 27
DM input variables and 5 baryon output variables. These are the
features and targets which are used for training and testing the ML
algorithms described in the next section.
3 MACHINE LEARNING ALGORITHMS:
DESCRIPTION AND TRAINING
In this section, we first describe the machine learning algorithms
used in this work and the training setup. Then, we study the impor-
tance of our feature variables in order to reduce the dimensionality
of our dataset.
3.1 Machine Learning Algorithms and Training Setup
In order to estimate the baryon properties of the dark matter only
clusters, several effective supervised machine learning methods
have been employed. We particularly focus on ensemble tree-based
methods: random forest (RF; Breiman,2001) and extreme gradient
boosting (XGBoost; Chen & Guestrin,2016), and, dense Neural
Networks or Multilayer Perceptron (MLP; Schmidhuber,2015). RF
and XGBoost have shown to be among the best machine learning
methods for tabular data (i.e. without a known grid-like topology,
such as images) (Fernández-Delgado et al.,2014;Bentéjac et al.,
2021;Zhang et al.,2017). Convolutional deep neural network mod-
els have shown spectacular performance for image-based and struc-
tured data in general (Schmidhuber,2015). However, for tabular
data, as is the case of this study, their performance is poor (Zhang
et al.,2017). Notwithstanding, deep dense networks can perform
well in these scenarios, so we will also consider these models.
Random Forest and XGBoost are metamodels that are com-
posed of decision trees. During training, these algorithms build
hundreds of decision trees from a single training dataset. The pro-
cess for building these trees in random forest and XGBoost, is based
on quite different ideas. Although, the objective in both cases is to
build decision tree models that complement each other in order to
obtain a classification/regression model better than any of its parts
(Dietterich,1998).
Random forest rely on stochastic techniques to generate many
random solutions to the problem at hand. In order to generate each
single tree, random forest algorithm first generates a new dataset by
extracting at random 𝑁instances of the training data of size 𝑁with
replacement (i.e. bootstrap sample). This bootstrap sample is used
to train a decision tree in which the best split at each node of the
tree is selected from a random subsample of features of the data.
Generally, the size of the random subset of features is of the order
of √𝐷or log2(𝐷), with 𝐷the number of features of the problem.
The final output of the random forest for a given instance is obtained
as the mode or mean of all trees for classification and regression
respectively. In addition, since the randomisation process to build
the trees is independent, the process of building a random forest can
be easily parallelised.
On the other hand, XGBoost relies mainly in a gradient de-
scend approach although it also incorporates stochastic techniques
to further increase its performance. XGBoost is an additive model
based on Gradient Boosting. The output of an additive model is
the sum of the outputs of its components. In order, to create this
ensemble, regression trees are trained sequentially to approximate
the gradient of the loss function of the data in the previous iter-
ations. Hence, each new tree learns the remainder of the concept
not learned in previous steps. XGBoost also includes a penalisation
term in the number of leaves of the trees to avoid over-fitting. In
addition, XGBoost incorporates random feature selection, bootstrap
sample and several other randomisation features.
In order to perform a fair comparison among algorithms and
also to obtain good estimations of the performance of the different
algorithms, we carried out the following experimental procedure
based on K-fold cross-validation and grid-search. K-fold cross-
validation consist in splitting the data into K disjoint sets of ap-
proximately equal size and then to use iteratively 𝐾−1sets for
training the model and the remaining set for validation. The main
experiment is performed using the same 10-fold cross-validation for
the prediction of the five baryonic properties analysed in this study
using the Rockstar halo catalogue from The300 hydro clusters.
The steps for each of the 10 partitions of the cross-validation are:
MNRAS 000,1–15 (2022)
Machine Learning in Galaxy Clusters 5
Figure 1. Spearman correlation coefficient matrix for the (feature) variables of the Rockstar identified clusters. The variables are organised in different
blocks according to their correlation values. Variables for each block are denoted in the x-axis in brackets: [1,...,12] are mass and velocity variables, [13,...,16]
correspond to ellipticity, [17,...,21] are related to the dynamical state of the cluster, [22,23] represent dimensionless spin parameters and [24,25,26] are related
to the scale factor and time evolution of mass accretion. Note that this matrix is symmetric with respect to the diagonal. Each variable description can be found
in Appendix A.
Figure 2. Mass distribution of the The300 Galaxy clusters analysed in this
work
(i) Find the best hyper-parameters of each of the tested algorithms:
RF, XGBoost and MLP. For that, a grid-search with 5-fold cross-
validation within the train dataset only was performed. The values
for the grid of hyper-parameters are shown below;
Figure 3. Redshift distribution of The300 Galaxy clusters analysed in this
work
(ii) The best set of hyper-parameters for each method were used to
train a single model using the whole training set;
(iii) The models were validated using the test set;
In order to generate dark matter only halo catalogues with hydrody-
namic properties, the 10 trained models from each of the 10-folds of
MNRAS 000,1–15 (2022)
6Daniel de Andres et.al.
the cross-validation were used. The hydrodynamic features of each
halo are then computed as the average of the inferred values from
these 10 models.
For the grid search the set of values of the tested hyper-
parameters for each of the analysed methods are:
•Random Forest:
– The number of trees in the forest: ‘n_estimators’=[100,500]
– the number of features to consider when looking for the best
split ‘max_features’ : [‘sqrt’,‘log2’]
•XGBoost:
– ‘n_estimators’= [100,500]
– Maximum depth of a tree:
‘max_depth’= [6,10,14,15,16,20]
– Minimum loss reduction required to make a further partition
on a leaf node of the tree:
‘gamma’ = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
– Step size shrinkage used in update to prevents overfitting:
‘eta’ = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
•MLP:
– ‘hidden_layer_sizes’ = [(8,),(20,),(100,),(8,8),(8,20,8),
(20,20,20),(100,100,100), (20,20,20,20),(100,100,100,100)]
– ‘activation’=‘relu’,
– ‘solver’=‘adam’,
– ‘learning_rate’=10−4
Furthermore, MLP has been trained for 500 epochs or until the
training loss is constant during 20 epochs. For more information of
these hyper-parameters, we refer the reader to the Python libraries
used throughout this work: for RF and MLP we have used https:
//scikit-learn.org (Pedregosa et al.,2011) and for XGBoost
https://github.com/dmlc/xgboost.
In order to train these models the mean squared error of the log-
arithmic values of the targets (log MSE) was used as loss function:
L=
1
𝑛
𝑛
𝑖=1(log 𝑦true,𝑖 −log 𝑦pred, 𝑖 )2,(6)
where 𝑦true,𝑖 is the true value of the target extracted from the The300
simulation and log 𝑦pred,𝑖 is the predicted target’s value by our
model. Note that, since the model is trained with log 𝑦as targets,
then the prediction of the model is directly the logarithm of the
given target. In addition, 𝑛corresponds to the number of objects in
the dataset (e.g. train, validation and test set) where the log MSE is
computed. For numerical reasons, we have also used the logarithmic
value of the features during the training process.
3.2 Feature importance and dimensionality reduction
Although machine learning models can generalise complex func-
tions, generally, it is not trivial to interpret their decisions. In fact,
they are often referred as black box estimators (e.g. Barredo Arrieta
et al.,2020). Therefore, it is of great value to be able to inspect
what is the learnt relation between features and targets given a par-
ticular model. One such inspection technique is feature importance.
Particularly, feature importance is a family of techniques that as-
signs a score 𝐹𝑠(𝑥, 𝑦)to each input features 𝑥depending on how
useful they are when it comes to predicting a particular target 𝑦.
Furthermore, feature importance is very related to dimensionality
reduction, a family of techniques that aim at getting rid of non-
informative variables from a model (e.g. Kuhn et al.,2013). In this
section, we use feature importance algorithms to determine what
features are more relevant and therefore, reduce the dimensionality
of our 27-dimensional input space.
One commonly used algorithm to estimate feature importance
for ensembles of decision trees (such as RF and XGBoost) is Per-
mutation Importance (Breiman,2001). In this algorithm, the impor-
tance of each feature is estimated as the decrease of the model score
when the values of a feature are randomly shuffled. This technique,
however, fails when correlated features are present in the dataset
(Altmann et al.,2010). A second shortcoming of this algorithm is
that it only considers the importance of individual features.
Other technique is the use of forest of trees to evaluate the
importance of features computed as the mean and standard deviation
of accumulation of impurity decrease within each tree (Breiman,
2001), which for regression is the variance reduction. In random
forest, internal node features are selected with some criterion, or
loss function. We can then measure how on average each feature
decreases the criterion in the splits of the forests. Nevertheless,
this technique also fails due to the fact that our features are highly
correlated, and it is also known to be biased in favour of variables
with many possible split points (e.g Nembrini et al.,2018).
Instead, we use the Greedy Search Feature Importance Algo-
rithm (GSFIA, see for example Ferri et al. (1994). This technique
considers the importance of the combination of features and not
only the individual feature importance. It works iteratively by se-
lecting and evaluating one variable at a time until all features are
ordered from the most to the least relevant. The algorithm works
with a list of selected variables, 𝐿, initially empty, 𝐿=[], and a
pool of possible variables to be selected, 𝑃, initially containing all
𝐷variables of the problem, 𝑃=𝑠𝑒𝑡(𝑥1, . . . , 𝑥 𝐷). Then, a proce-
dure is repeated 𝐷times in which, at each step, one variable from
the pool 𝑃is selected and moved to the list 𝐿. In the 𝑘−th step
of the loop the procedure creates |𝑃|models trained on all of the
features in 𝐿plus one feature from 𝑃. The model that minimises the
MSE identifies the most important variable from 𝑃in combination
with the variables in 𝐿. This variable is then removed from 𝑃and
appended to the list 𝐿. At the end of the algorithm, all variables of
the problem are sorted by importance in list 𝐿together with the loss
function associated with them. GSFIA it is depicted using pseudo
code in algorithm 1.
With this algorithm we can define the feature importance score
𝐹(𝑥, 𝑦)as follows:
•Run GSFIA to rank all features from the most to the least
important variables and save the corresponding value of log MSE.
•The score is then defined as the log MSE of every iteration
normalised to the corresponding value of the first iteration.
Note that the normalised log MSE will be 1 for the first feature, and
will decrease progressively as we consider more features until it
converges to a minimum value. It could happen that after including
several features, the normalised log MSE increases as more fea-
tures are included (see Fig. 4 for the case of 𝑀star). This indicates
that the last features included do not improve or even degrade the
performance of the model.
The algorithm was run using random forest as model (line 9 of
algorithm 1). In addition, due to the randomness of the ML model,
the inner loop of the algorithm was repeated 10 times in order to
reduce the variability of the results. In Figure 4, the average of the
normalised log MSE and its standard deviation are shown for the
different targets considered. In the horizontal axis, the final order
MNRAS 000,1–15 (2022)
Machine Learning in Galaxy Clusters 7
Algorithm 1: Pseudo code of Greedy Search Feature
Importance Algorithm
inputs : x = features dataset; y = target dataset
outputs: L = organised list of features according to their
importance degree ; score = normalised MSE of
every element in L
1L = [empty list];
2score = [empty list];
3P=[𝑥1, ... ,𝑥𝐷];
4i = 0;
5while i<=length(x) do
6loss = zeros(length(P));
7j = 0;
8while j<=length(P) do
9dataset = L+P[j] # sum of lists ;
10 model.train(dataset,y);
11 loss[j]=model.MSE;
12 j=j+1;
13 indx = argmin(loss);
14 L.append(P[indx]);
15 P.drop(indx);
16 score.append(loss[indx]);
17 i = i+1;
18 score = score / score[0];
19 return L, score
for the feature variables is shown. Variables in red colour are the
reduced set of features that will be considered for further analysis.
These features are summarised in Table 1.
As shown in Table 1, we expect that the selected variables
generally come from different correlation blocks as shown in Fig. 1.
This is so, since variables from the same block are correlated and
once the algorithm chooses one feature, it skips using variables
with the same information. However, this is not always the case
(e.g. variables 2, 6 and 7 are selected for a couple of targets). This
can be explained since the correlation between those variables is
high but it is not 1. Hence, for our case the marginal information
that a second variable inside a correlated block gives, is higher than
that given by other variables. As far as the meaning of selected
variables is concerned, we can distinguish two different important
blocks in the correlation matrix: The mass and velocity block (the
first block from 0 to 12), and the time evolution block (from 24 to
26). The conclusion of this analysis is that the rest of the blocks are
redundant or contribute little to the estimating baryon properties, i.e.
the ellipticity block (from 13 to 16), the dynamical state block (from
17 to 21) and the spin block. Moreover, masses and velocities are the
most important features for estimating baryon properties while the
variables associated with the time evolution of the mass accretion
into halos play a secondary role in the regression algorithms. The
redundant role of the ellipticity variables can be explained by taking
into consideration that we are estimating integrated quantities from
the particles within spheres of radius 𝑅=𝑅500 , regardless of the
shape of their 3D distributions.
Note that, we combine data from different redshifts as our
training and test samples. We do not think that the evolution of
these baryon properties will affect our results because (1) as shown
in Cui et al. (2022), these quantities in Gadget-X simulations hardly
depend on redshift, especially at 𝑧.1(see also Truong et al.,2018,
for example); (2) we also include the scale factor as a feature variable
in the training. If there were a clear redshift dependence on any target
Table 1. Lists of selected DM-only features for the different targets after
applying GSFIA (algorithm 1).
target Important features
𝑀gas M500c(2), Vpeak(7), scale_of_last_MM(25), Macc(6), a(24)
𝑀star Vpeak(7), Halfmass_Scale(26), scale_of_last_MM(25), a(24)
𝑇gas Vpeak(7), a(24), scale_of_last_MM(25)
𝑌XVpeak(7), M500c(2), scale_of_last_MM(25), vrms(12),Macc(6)
𝑌SZ Vpeak(7), Mpeak(4), scale_of_last_MM(25), a(24), rs(17)
variable, the scale factor feature would show a higher contribution.
However, as shown in Table 1 and Fig. 4, the scale factor contributes
only weakly to the normalised loss function L.
Furthermore, we have to highlight that although we have used
Random Forest for the GSFIA, other Machine Learning algorithms
might also be used. However, GSFIA is computationally expensive
given the fact that its computing time increases with the number of
features 𝐷as 𝑂(𝐷2). Therefore, we prefer to use RF because it is
computationally more efficient and it does not have as many hyper-
parameters to tune in. Consequently, this choice might introduce
a bias given the fact that a particular model is being used for the
selection of the important variables. However, in the next section
we will show that this particular selection of variables yields similar
performance for the different ML algorithms considered throughout
this work.
4 RESULTS
In this section, we first study what machine learning algorithm is
of higher quality for our particular dataset and study the accuracy
of our model predictions. Then, we populate the dark-matter-only
MDPL2 simulation with baryon properties and determine whether
we can also successfully use the trained machine learning model on
dark-matter-only low resolution simulations.
4.1 Error analysis
In order to determine the accuracy of our ML models, we have
trained our 3 models on the dataset composed of all features and on
the dataset with the reduced set of features summarised in Table 1
using the experimental setup described in the previous section. The
average performance of the models is shown in Fig. 5. In the left
panel we show the log MSE defined in Eq.(6) for the different tested
models as a function of the target variables when all input features
are used. In the right panel, the same quantities are displayed for the
reduced set of features.
As a general result, it can be observed from Fig. 5 that XGBoost
algorithm has the best performance for all targetsexcept for𝑌𝑆 𝑍 with
the reduced set of features. In any case the performance of XGBoost
in this last target is almost identical to the best model. For RF, we find
equivalent performances for both sets of features in 𝑀gas,𝑌Xand
𝑌SZ; a somewhat worse result for the reduced set on𝑇gas; and better
performance on 𝑀star for the reduced set. For XGBoost, the trends
are similar to those of Random Forest, although the difference in
performance for XGBoost between both sets of features is negligible
for 𝑇gas and smaller for 𝑀star. For the MLP model, all results using
the reduced set of features are worse than those obtained when using
all the features in the catalogues. These differences between the tree
based approaches (RF and XGBoost) and MLP can be explained
taking into consideration that the selection of important features was
done using Random Forest. In any case, the performance of MLP is
MNRAS 000,1–15 (2022)
8Daniel de Andres et.al.
Figure 4. Normalised MSE (y-axis) given by GSFIA (algorithm 1) as a
function of DM-only variables described in Appendix Aranked by feature
importance in descending order. From top to bottom we show our results for
different targets: 𝑀gas,𝑀star,𝑇gas,𝑌Xand 𝑌SZ. Blue dashed lines represent
the average value of the normalised MSE for 10 different k-folds and error
bars correspond to the standard deviation. The selected features for each
target are highlighted in red and shown in Table 1.
the worst for all targets even when all features are considered. After
the previous analysis, we can conclude that XGBoost gives the most
accurate model predictions. Therefore, we will only consider this
algorithm for the rest this work. A summary of the performance for
all models can be found in Table 2 for the reduced set of features.
The scores shown in Fig. 5 summarise in a single value the
performance of the models. However they do not allow us to under-
stand how the model performs in the different regions of the space
of features and targets. In order to analyse this, we first define the
relative difference in performance for a single target yas
diff(y)=
𝑦pred −𝑦true
𝑦pred ,(7)
where 𝑦pred and 𝑦true are the predicted and real target values for x.
Figure 5. The logarithmic MSE defined in Eq.(6) for the 3 ML models
considered: RF in blue, XGBoost in red and MLP in black colour. The
x-axis indicates the different baryon targets. The points with error bars
represent the mean and the standard deviation of the logarithmic MSE for
the test set using 10 different k-folds. The left panel correspond to training
the ML models using all the DM-only variables shown in Fig. 1 and in the
right panel the algorithms are trained with the reduced set of features listed
in Table 1.
Note that in Eq.(7), we are not considering the logarithmic value of
the targets. One can interpret these differences as a probability dis-
tribution. This means that given a value of 𝑦pred one might estimate
the aleatory intrinsic scatter to that particular predicted value. These
differences are shown in Fig. 6 as a function of the predicted target
𝑦pred (first column), the cluster mass 𝑀500 (second column) and the
peak of the velocity profile along the mass accretion history, 𝑉peak
(third column) for all redshifts. In Fig. 6, instead of plotting the in-
dividual differences for all instances, the mean value(dashed black)
and the 66% (red region) and 95% (blue region) confident intervals
are represented for sliding windows (bins) containing roughly the
same number of objects.
The main result that can be observed from Fig. 6 is that the
predictions are unbiased with respect to the most important features
(𝑀500 and 𝑉peak) and with respect to the predicted targets as the
mean is very close to 0 for all the ranges. However, the scatter
vary depending on the target as it is depicted in Fig. 5 and Fig. 6.
Particularly, 𝑇gas is the target most accurately predicted, with an
average scatter of 7% (standard deviation of Eq.(7)) and 𝑌Xis the
predicted variable with higher average scatter ( 16%). A summary
of the overall statistics of the error scatter can be found in Table 2.
In addition, we found a slight dependence of the scatter on the
𝑀500,𝑉peak and inferred targets values (except for𝑇gas). The scatter
seems to decrease as these values increase. A possible explanation
is that massive clusters are more self-similar than smaller groups
which present a larger halo-to-halo variation as a consequence of
the stronger impact of non-gravitational processes.
4.2 ML inference of Baryonic properties in Dark matter only
datasets
We now proceed to apply the trained ML model to infer the different
baryonic properties in the full set of MDPL2 halo catalogues. For
that, we will use the 10 XGBoost models trained on the reduced
set of features of The300 clusters. In order to create the catalogue,
we first build a dataset with the reduced set of features (shown in
Table 1) for each halo of the full MDPL2 box. Note that the same
transformations and cutoffs are applied to the full MDPL2 Rockstar
catalogue as in § 2. Next, we discard clusters whose features values
are not inside the hyper-cube defined by the MDPL2 features used
MNRAS 000,1–15 (2022)
Machine Learning in Galaxy Clusters 9
Figure 6. The relative difference (y-axis) defined in Eq.(7) as a function of the XGBoost-predicted target variable (first column), the cluster 3D dynamical
mass 𝑀500𝑐(2)(second column) and the peak of the maximum value of the radial circular velocity profile across the halo’s mass accretion history 𝑉peak(7)
(third column). From top to bottom, each row corresponds to different baryon targets: 𝑀gas,𝑀star,𝑇gas,𝑌X,𝑌SZ. Dashed black lines are the mean value of the
relative difference, which are very close to diff=0. Red regions and blue regions represent the 66% and95% confidence intervals respectively. Additionally, 0%
and 20% relative difference lines are represented in dashed green colour. The data are binned using sliding windows that contain roughly the same number of
clusters.
MNRAS 000,1–15 (2022)
10 Daniel de Andres et.al.
Table 2. Logarithmic value of the MSE defined in Eq.(6) for the reduced set of features in Table 1. In brackets, we show the standard deviation (scatter) 𝜎
of the relative difference defined in Eq.(7). Rows correspond to values of the log MSE for different models while columns correspond to values for different
baryonic targets.
Value 𝑀gas 𝑀star 𝑇gas 𝑌X𝑌SZ
log MSE×10−3XGBoost 2.17 (11%) 3.43 (14%) 0.94 (7%) 4.81 (17%) 4.07 (16%)
log MSE ×10−3RF 2.20 (11%) 3.47 (14%) 1.56 (10%) 4.85 (17%) 3.96 (16%)
log MSE ×10−3MLP 2.75 (12%) 8.16 (21%) 2.23 (11%) 5.97 (19%) 5.82 (19%)
for training since ML models are not designed for extrapolation
inference. This means that only MDPL2 clusters such that
𝑥training
min ≤𝑥MLDP2 ≤𝑥training
max , for 𝑥∈features (8)
will be taken into consideration. Where 𝑥training is a feature cor-
responding to training dataset and 𝑥MLDP2 is the same feature for
the full MDPL2 simulation. Only 397 clusters out of 1,306,185 are
outside the hyper-cube defined by the most important features and
therefore, they are not considered for the analysis.
In order to evaluate if the generated catalogue presents prop-
erties that are coherent with the properties of fully simulated data,
we will compare several baryon properties with respect to the halo
mass for The300 and the full MDPL2 generated catalogue. These
results are shown in Fig. 7 for different redshift values (columns). In
these plots, the values of the targets (rows) are plotted with respect
to 𝑀500. For the targets 𝑀gas and 𝑀star, the plots show the relative
fractions:
𝑓𝑖=𝑀𝑖/𝑀500 , (9)
where 𝑖can be either gas or star. Error bars represent the standard
deviation of our model predictions on the MDPL2 catalogue (1𝜎),
orange/brown regions correspond to 1𝜎region for The300 test set
predictions (for all the k-folds) and blue regions are the equivalent
but for The300 true targets. Moreover, in the last row of the figure
the number of clusters per bin is represented as a function of 𝑀500
for both The300 and MDPL2 datasets.
As a general result, the mean predicted values for MDPL2
objects (black error bars) are similar and also their distributions
per mass bin are comparable with the true values (blue region), i.e.
in agreement with Fig. 6. However, the scatter of the predictions
is slightly smaller (around 10-20%) than the corresponding scatter
using the true values of The300 data for 𝑓gas and 𝑓star. Furthermore,
a similar result to the ones shown in Fig. 7 are obtained when
plotting as a function of 𝑉peak instead of 𝑀500 . We need to point out
that for massive clusters ( >8×1014 ℎ−1M), the number of objects
is similar in the The300 and MDPL2 simulations. Particularly, the
last two mass bins are mostly composed of the same objects and the
difference lies in the baryon properties of the The300 simulation.
4.2.1 Dependence of ML model predictions on DM mass
resolution
The ML model has been trained on a particular DM simulation with
a fixed resolution in mass. Here we are interested to compare the
predictions of the ML model when applied to halo catalogues from
simulations with lower mass resolution. Since some of the features
of the halos are expected to be affected by resolution, then the infer
baryon quantities from the ML model could also be affect by that.
Since our goal is to make this ML model as universal as possible
so it can be applied to different DM-only simulations with larger
volumes, it is important to test for these effects. In order to do that,
we are going to apply the trained XGBoost model in two simula-
tions run with identical initial conditions but with a difference of
a factor 8 in particle mass. For this test, we are going to use also
another completely different realisation than MDPL2, i.e. the UNIT
project. The UNIT5N-body cosmological simulations (UNITSIM,
Chuang et al.,2019) are designed to provide accurate predictions of
the clustering properties of dark matter halos using the suppressed
variance method proposed by Angulo & Pontzen (2016). We partic-
ularly focus on one of the UNIT simulations with the same box side
length than MDPL2, (i.e. 1ℎ−1Gpc ) and similar number of parti-
cles (40963). Furthermore, this simulation has also been performed
with 8 times less number of particles (20483). For simplicity we will
refer to these two simulations as UNITSIM4096 and UNITSIM2048
for the high and low resolution versions respectively.
Dark matter cluster-size halo catalogues from Rockstar+
Consistent Trees are then selected for UNITSIM4096 and UNIT-
SIM2048 following the same procedure described in § 2. We then
apply the trained XGBOOST model to these catalogues to infer the
target baryon properties for each DM halos in the two versions.
These baryon properties present similar statistics (mean and scatter
per mass bin) as those shown in Fig. 7. In order to make a more
quantitative comparison of the results for the two UNIT simula-
tions, we bin the data as in Fig. 7 according to 𝑀500 and compute
the difference of the mean values and estimate an upper limit for its
scatter as
¯𝜇=𝜇2048 −𝜇4096 and ¯𝜎=𝜎2
2048 +𝜎2
4096.(10)
Here, 𝜇stands for mean values and 𝜎for the standard deviation of
a bin. The particular values of ¯𝜇and ¯𝜎are shown for 3 different
snapshots in Fig. 8. As can be seen in this figure, ¯𝜇'0with a small
value for the scatter for all mass bins. The scatter ¯𝜎is within ∼2%
for 𝑓gas and ∼0.5% for 𝑓star. For 𝑇gas, the residuals amount to ∼0.1
dex and for 𝑌Xand 𝑌SZ up to ∼0.2dex. Therefore, we conclude
that the baryonic properties predicted by the ML model for the same
halos simulated with a factor of 8 difference in mass resolution are
statistically equivalent.
5 VALIDATION OF THE GAS SCALING RELATIONS
Scaling relations are generally power laws that relate properties in
astrophysical systems, such us the Colour-Magnitude Relation or
the Tully Fisher relation (Tully & Fisher,1977) for galaxies. The
applications of the scaling relations are manifold, such as inferring
masses of galaxy clusters that are sensitive to cosmological param-
eters (e.g. Planck Collaboration et al.,2016). For a recent review
of scaling relations for galaxy clusters we refer the reader to e.g.
Lovisari & Maughan (2022). The temperature-mass relation can be
5https://unitsims.ft.uam.es
MNRAS 000,1–15 (2022)
Machine Learning in Galaxy Clusters 11
Figure 7. XGBoost predictions (y-axis) as a function of cluster mass, 𝑀500𝑐(2), at 4 different redshifts (different columns) from 𝑧=0to 𝑧=1.032. The
first five rows correspond to the predictions of our baryonic targets: gas and star fractions 𝑓gas and 𝑓star, gas temperature 𝑇gas, X-ray Y-parameter 𝑌Xand SZ
Y-parameter 𝑌SZ. The data is binned along the x-axis and the means of the predicted values for the test set are shown in red dashed lines with their scatter
(standard deviation) represented as shaded brown/orange region. True values of The300 train set are shown as a blue dashed line and their scatter corresponds
to shaded blue region. Black points represent the average values for the predictions of MDPL2 clusters per mass bin and the error bars correspond to 1𝜎scatter.
The bottom row shows the number of cluster objects (N) per mass bin for The300 (blue histogram) and MDPL2 (orange) simulations
written as
𝐸(𝑧)−2/3𝑇gas
keV =10𝐴𝑇𝑀
M𝐵𝑇
, (11)
where 𝐸(𝑧)=𝐻(𝑧)/𝐻0and H(z) is the Hubble parameter. Similarly,
for the 𝑌X−𝑀and 𝑌SZ −𝑀we use
𝐸(𝑧)−2/3𝑌X
ℎ−1MkeV
=10𝐴𝑋𝑀
M𝐵𝑋
(12)
and
𝐸(𝑧)−2/3𝑑2
𝐴𝑌SZ
Mpc2=10𝐴𝑆 𝑍 𝑀
M𝐵𝑆𝑍
. (13)
Here, 𝐴𝑖and 𝐵𝑖(𝑖=𝑇, X,SZ) are the parameters that we are
interested in obtaining by fitting the above equations to our data.
Once we have generated baryon catalogues for different N-body
simulations we apply a simple linear fitting function in logarithm
space to fit the data to the equations listed above. However, selecting
data from different snapshots gives us small variations of the 𝐴𝑖
MNRAS 000,1–15 (2022)
12 Daniel de Andres et.al.
Figure 8. The difference of XGBoost-predicted baryonic properties for halos corresponding to the UNITSIM2048 and UNITSIM4096 DM-only simulations
as a function of the cluster total mass 𝑀500𝑐(2). The y-axis represents values of ¯𝜇(Black points ) and ¯𝜎(error bars and blue region) defined in Eq.(10) for
different mass bins. From top to bottom, different rows represent the baryonic properties considered: 𝑓gas,𝑓star,𝑇gas,𝑌Xand 𝑌SZ. From left to right, we show
our results for three different redshifts z=0 (first column), z=0.52 (second column) and z=1 (third column).
and 𝐵𝑖best fitting parameters with redshift. Therefore, we use the
following parametrization to study the redshift dependence:
𝐴𝑖(𝑧)=𝐴𝑖,0(1+𝑧)𝛼𝑖(14)
𝐵𝑖(𝑧)=𝐵𝑖,0(1+𝑧)𝛽𝑖(15)
where 𝐴𝑖,0and 𝐵𝑖 ,0are the values of the intercept and slope at z=0
and 𝛼𝑖and 𝛽𝑖describe their possible dependence with respect to the
redshift. With this new parametrization we apply a non-linear least
square fitting model to fit the function described by equations (11),
(12) and (13) updated with equations (14) and (15). The best-fitting
parameters are shown in Table 3 and Table 4. Note that we have
used the mass corresponding to the N-body simulation (the feature
variable 𝑀500𝑐(2)as the mass of the cluster)
As a general result, the fitting parameters are in agreement
among the three different N-body simulations and are slightly dif-
ferent from The300 hydrodynamical simulation. This deviation,
though small, is caused by the fact of considering the full box
of dark matter only simulations instead of the smaller volume of the
‘zoom’ simulation. The effect of resolution is negligible for galaxy
clusters. There is also a small difference between The300 simula-
tions true data (The300), and the fitting counterpart using the ML
predicted data (The300*). This slight difference can be mainly ap-
preciated in the intrinsic scatter of the linear fitting function, which
is generally smaller in the case of The300*. It is important to note
that the scatter of the scaling law for The300 simulations is generally
larger when comparing it with the values shown in Table 2, where
the scatter (standard deviation of the relative difference) is reduced
by a factor of 0.5 for the gas temperature, 0.3 for 𝑌Xand 0.45 for
𝑌SZ. Moreover, the most relevant variables for each gas properties
presented in table Table 1 can be used for finding analytical expres-
sions for scaling laws with a reduced MSE using genetic algorithms
(Wadekar et al.,2022).
As far as the redshift dependence is concerned, it is negligible
for 𝑌Xand 𝑌SZ where the parameters 𝛼and 𝛽are of order .10−3.
However, the parameter 𝛼𝑇'0.3cannot be ignored. This indicates
that the evolution of 𝑇gas is relevant as it can also be appreciated in
Table 1, where the scale factor a(24) is the second most important
variable, reducing the normalised loss function Lfrom 1 to 0.6.
6 SUMMARY AND CONCLUSIONS
Numerical simulations are key to studying galaxy clusters. On the
one hand, with the current technology it is possible to perform
large volume N-body simulations that can be useful to describe
the dark-matter component. However, big volume hydrodynamical
simulations cannot be carried out due to their computational de-
mands. We have therefore trained a set of machine learning models
to populate high volume dark-matter-only simulations with bary-
onic properties. In particular, we have defined our feature space as
the Rockstar variables of DM-only halos and our target variables
are directly estimated from The Three Hundred hydrodynamical
simulations: the mass of the gas 𝑀gas, the mass of the stars 𝑀star,
the gas temperature 𝑇gas, X-ray Y-parameter 𝑌Xand the integrated
Compton-y parameter 𝑌SZ. All these quantities are integrated quan-
MNRAS 000,1–15 (2022)
Machine Learning in Galaxy Clusters 13
Table 3. The best fit parameters for the 𝑇−𝑀,𝑌X−𝑀and 𝑌SZ −𝑀relations for the different simulation sets. The log MSE in Eq.(6) and average scatter
of the relative difference in Eq.(7), (in parenthesis), are also shown. For The300 simulation, the true values of the baryon properties have been used while for
The300* the predicted XGBoost values are used instead. The relative error in the estimated parameters 𝐴𝑖, and 𝐵𝑖is always ≤10−3.
Simulation
𝐴𝑇,0𝐵𝑇,0log MSE𝑇𝐴𝑋,0𝐵𝑋,0log MSEX𝐴𝑆𝑍 ,0𝐵𝑆𝑍 ,0log MSESZ
The300 0.2083 0.6081 1.8×10−3(10%) 13.09 1.718 8.3×10−3(25%) -5.499 1.697 9.5×10−3(29%)
The300* 0.2082 0.6054 1.6×10−3(10%) 13.08 1.718 3.9×10−3(17%) -5.497 1.692 7.2×10−3(25%)
MDPL2 0.2133 0.5863 3.3×10−3(11%) 13.07 1.767 2.8×10−3(13%) -5.513 1.710 6.5×10−3(21%)
UNITSIM4096 0.2122 0.5865 3.3×10−3(11%) 13.07 1.767 2.8×10−3(13%) -5.514 1.709 6.5×10−3(21%)
UNITSIM2048 0.2126 0.5854 3.3×10−3(11%) 13.07 1.766 2.8×10−3(13%) -5.515 1.709 6.5×10−3(21%)
Table 4. The best fit redshift dependence parameters for the scaling relations defines in Eq. 14 and Eq. 15
Simulation 𝛼𝑇(×10−3)𝛽𝑇(×10−3)𝛼X(×10−3)𝛽X(×10−3)𝛼SZ (×10−3)𝛽SZ (×10−3)
The300 −339.0±5.3−3.8±4.3−1.15 ±0.16 3.4±3.3 5.36 ±0.41 −0.113 ±0.035
The300* −336.7±4.8−11.1±4.0−0.73 ±0.11 6.4±2.2 6.22 ±0.45 −0.122 ±0.031
MDPL2 −314.3±1.2 30.4±1.8 0.417 ±0.021 0.42 ±0.65 7.181 ±0.074 0.2±1.0
UNITSIM4096 −308.1±1.3 30.5±1.8 0.52 ±0.21 0.22 ±0.68 7.162 ±0.075 0.0±1.1
UNITSIM2048 −302.9±1.3 29.9±1.8−0.052 ±0.021 −0.93 ±0.71 6.279 ±0.075 −2.0±1.1
tities in a spherical regions of overdensity 500 times the critical
density at their corresponding redshift.
Particularly, we have considered 3 different ML models, ran-
dom forest (RF), extreme gradient boosting (XGBoost) and Multi-
Layer Perceptron (MLP). We have determined that XGBoost is the
algorithm that is more suitable to our dataset and whose predictions
are closer to the true hydrodynamical targets, as shown in Table 2.
We have applied an algorithm –Greedy Search Feature Importance
Algorithm (GSFIA)– to identify the features that have more pre-
dictive information. By using GSFIA, we have managed to reduce
the dimensionality of our feature space from 27 to approximately 5
variables depending on the target variable. We have demonstrated
that masses and velocities have a higher amount of predictive infor-
mation while time evolution variables play a secondary role in the
prediction of our targets. What is more, ellipticity, dynamical state,
and spin features are redundant. A possible explanation for this is
that our baryon targets are integrated in spherical regions.
Then, we have applied our trained ML model to populate halo
catalogues with baryonic properties from two full box N-body sim-
ulations: the MultiDark simulation (MDPL2) and the UNIT N-
body cosmological simulations (UNITSIM). The MDPL2 predicted
baryon properties are compatible to those of The300 simulations, as
shown in Fig. 7. The application on two UNITSIM simulations with
1ℎ−1Gpc box size 20483and 40463particles has determined that
our model can be successfully applied to boxes whose resolution is
up to 1/8of the corresponding simulation used for training. This
suggests that this is a promising method to populate the UNITSIM
large volume N-body halos with baryon properties up to 27 Gpc3
(i.e a 3ℎ−1Gpc box size with 61443particles). This will be an
excellent tool to study the large scale distribution of galaxy clusters
in an unprecedented way. For instance, we can estimate the cosmic
variance in the number counts of X-ray detected clusters from the
eROSITA satellite all-sky survey Liu et al. (2021) by extracting
many different light-cones from this large computational volume.
This will be the subject of a forthcoming paper.
Furthermore, the scaling relations are powerful mass-
observable proxies. We have check that the best-fitting parame-
ters inferred using our three mock DM full-box baryon catalogues
are compatible. They nevertheless differ slightly from those of the
The300, partially because of the considerable smaller number of
cluster objects in the hydrodynamical simulations used to get the
best fit values. This would suggest that mass completeness have
an small impact, thought not negligible, in the calibration of the
mass-proxies.
To conclude, our work shows that ML models are very useful
methods for finding a mapping between dark matter halo properties
found in N-body and the complex hydrodynamical simulations. We
checked that, on average, the generated catalogue for the 3 dark-
matter-only simulations used throughout this paper have the same
distributions to that of true training set and therefore, they can be
used for painting dark matter halos with baryonic properties that are
directly related with observed quantities, providing added value to
large volume collisionless N-body simulations.
ACKNOWLEDGEMENTS
DA and GY would like to thank MINECO/FEDER for financial
support under research PGC2018-094975-C21. WC is supported
by the STFC AGP Grant ST/V000594/1 and by the ATRAC-
CIÓN DE TALENTO INVESTIGADOR DE LA COMUNIDAD
DE MADRID 2020-T1/TIC-19882. He further acknowledges the
science research grants from the China Manned Space Project with
NO. CMS-CSST-2021-A01 and CMS-CSST-2021-B01. GM ac-
knowledges financial support from PID2019-106827GB-I00/AEI
/ 10.13039/501100011033 The CosmoSim database used in this pa-
per is a service by the Leibniz-Institute for Astrophysics Potsdam
(AIP). The MultiDark database was developed in cooperation
with the Spanish MultiDark Consolider Project CSD2009-00064.
The authors acknowledge The Red Española de Supercomputación
for granting computing time for running the hydrodynamical simu-
lations of The300 galaxy cluster project in the Marenostrum super-
computer at the Barcelona Supercomputing Center.
MNRAS 000,1–15 (2022)
14 Daniel de Andres et.al.
DATA AVAILABILITY
The trained models and data products for MDPL2, UNITSIM2048
and UNITSIM4096 are publicly available at https://github.
com/The300th/DarkML.
References
Allen S. W., Evrard A. E., Mantz A. B., 2011, Annual Review of
Astronomy and Astrophysics, 49, 409
Allgood B., Flores R. A., Primack J. R., Kravtsov A. V., Wechsler
R. H., Faltenbacher A., Bullock J. S., 2006, MNRAS,367, 1781
Altmann A., Toloşi L., Sander O., Lengauer T., 2010, Bioinfor-
matics, 26, 1340
Angulo R. E., Pontzen A., 2016, MNRAS,462, L1
Angulo R., Springel V., White S., Jenkins A., Baugh C., Frenk C.,
2012, Monthly Notices of the Royal Astronomical Society, 426,
2046
Angulo R. E., Zennaro M., Contreras S., Aricò G., Pellejero-Ibañez
M., Stücker J., 2021, MNRAS,507, 5869
Bahé Y. M., et al., 2017, Monthly Notices of the Royal Astronom-
ical Society, 470, 4186
Barnes D. J., Kay S. T., Henson M. A., McCarthy I. G., Schaye
J., Jenkins A., 2016, Monthly Notices of the Royal Astronomical
Society, p. stw2722
Barnes D. J., et al., 2017, Monthly Notices of the Royal Astronom-
ical Society, 471, 1088
Barredo Arrieta A., et al., 2020, Information Fusion, 58, 82
Baugh C. M., 2006, Reports on Progress in Physics,69, 3101
Behroozi P. S., Wechsler R. H., Wu H.-Y., 2012, The Astrophysical
Journal, 762, 109
Behroozi P. S., Wechsler R. H., Wu H.-Y., Busha M. T., Klypin
A. A., Primack J. R., 2013, ApJ,763, 18
Benson A. J., 2012, New Astronomy, 17, 175
Bentéjac C., Csörgő A., Martínez-Muñoz G., 2021, Artificial In-
telligence Review, 54, 1937
Bernardini M., Feldmann R., Anglés-Alcázar D., Boylan-Kolchin
M., Bullock J., Mayer L., Stadel J., 2022, MNRAS,509, 1323
Breiman L., 2001, Machine Learning, 45, 5
Bryan G. L., Norman M. L., 1998, ApJ,495, 80
Bullock J. S., Kolatt T. S., Sigad Y., Somerville R. S., Kravtsov
A. V., Klypin A. A., Primack J. R., Dekel A., 2001, MNRAS,
321, 559
Chen T., Guestrin C., 2016, in Proceedings of the 22Nd ACM
SIGKDD International Conference on Knowledge Discovery and
Data Mining. KDD ’16. ACM, New York, NY, USA, pp 785–794
Chisari N., et al., 2016, MNRAS,461, 2702
Chuang C.-H., et al., 2019, MNRAS,487, 48
Cora S. A., et al., 2018, MNRAS,479, 2
Croton D. J., et al., 2016, ApJS,222, 22
Cui W., Borgani S., Dolag K., Murante G., Tornatore L., 2012,
MNRAS,423, 2279
Cui W., Borgani S., Murante G., 2014, MNRAS,441, 1769
Cui W., et al., 2018, Monthly Notices of the Royal Astronomical
Society, 480, 2898
Cui W., et al., 2022, arXiv e-prints, p. arXiv:2202.14038
Davé R., Anglés-Alcázar D., Narayanan D., Li Q., Rafieferantsoa
M. H., Appleby S., 2019, MNRAS,486, 2827
Dietterich T. G., 1998, AI MAGAZINE, 18, 97
Dolag K., Komatsu E., Sunyaev R., 2016, Monthly Notices of the
Royal Astronomical Society, 463, 1797
Evrard A. E., Metzler C. A., Navarro J. F., 1996, ApJ,469, 494
Fernández-Delgado M., Cernadas E., Barro S., Amorim D., 2014,
Journal of Machine Learning Research, 15, 3133
Ferri F. J., Pudil P., Hatef M., Kittler J., 1994, in , Vol. 16, Machine
Intelligence and Pattern Recognition. Elsevier, pp 403–413
Fosalba P., Crocce M., Gaztañaga E., Castander F., 2015, Monthly
Notices of the Royal Astronomical Society, 448, 2987
Habib S., et al., 2016, New Astronomy, 42, 49
Ishiyama T., et al., 2021, MNRAS,506, 4210
Jo Y., Kim J.-h., 2019, Monthly Notices of the Royal Astronomical
Society, 489, 3565
Kamdar H. M., Turk M. J., Brunner R. J., 2016, MNRAS,457,
1162
Klypin A. A., Trujillo-Gomez S., Primack J., 2011, ApJ,740, 102
Klypin A., Yepes G., Gottlöber S., Prada F., Heß S., 2016, MN-
RAS,457, 4340
Knollmann S. R., Knebe A., 2009, ApJS,182, 608
Kravtsov A. V., Borgani S., 2012, Annual Review of Astronomy
and Astrophysics, 50, 353
Kravtsov A. V., Vikhlinin A., Nagai D., 2006, ApJ,650, 128
Kuhn M., Johnson K., et al., 2013, Applied predictive modeling.
Vol. 26, Springer
Lacey C. G., et al., 2016, MNRAS,462, 3854
Le Brun A. M., McCarthy I. G., Melin J.-B., 2015, Monthly No-
tices of the Royal Astronomical Society, 451, 3868
Liu A., et al., 2021, arXiv preprint arXiv:2106.14518
Lovisari L., Maughan B. J., 2022, arXiv e-prints, p.
arXiv:2202.07673
McCarthy I. G., Bird S., Schaye J., Harnois-Deraps J., Font A. S.,
Van Waerbeke L., 2018, Monthly Notices of the Royal Astro-
nomical Society, 476, 2999
Moews B., Davé R., Mitra S., Hassan S., Cui W., 2021, Monthly
Notices of the Royal Astronomical Society, 504, 4024
Murante G., Monaco P., Giovalli M., Borgani S., Diaferio A., 2010,
MNRAS,405, 1491
Navarro J. F., Frenk C. S., White S. D. M., 1997, ApJ,490, 493
Nelson D., et al., 2019, Computational Astrophysics and Cosmol-
ogy,6, 2
Nembrini S., König I. R., Wright M. N., 2018, Bioinformatics, 34,
3711
Pedregosa F., et al., 2011, Journal of Machine Learning Research,
12, 2825
Peebles P. J. E., 1969, ApJ,155, 393
Planck Collaboration et al., 2016, A&A,594, A13
Planelles S., Borgani S., Dolag K., Ettori S., Fabjan D., Murante G.,
Tornatore L., 2013, Monthly Notices of the Royal Astronomical
Society, 431, 1487
Potter D., Stadel J., Teyssier R., 2017, Computational Astrophysics
and Cosmology, 4, 1
Rasia E., et al., 2015, ApJ,813, L17
Schaye J., et al., 2015, MNRAS,446, 521
Schmidhuber J., 2015, Neural Networks, 61, 85
Sembolini F., Yepes G., De Petris M., Gottlöber S., Lamagna L.,
Comis B., 2013, MNRAS,429, 323
Skillman S. W., Warren M. S., Turk M. J., Wechsler R. H., Holz
D. E., Sutter P. M., 2014, arXiv e-prints, p. arXiv:1407.2600
Sunyaev R. A., Zeldovich Y. B., 1972, Comments on Astrophysics
and Space Physics, 4, 173
Truong N., et al., 2018, MNRAS,474, 4089
Tully R. B., Fisher J. R., 1977, Astronomy and Astrophysics, 54,
661
Villaescusa-Navarro F., et al., 2022, arXiv e-prints, p.
MNRAS 000,1–15 (2022)
Machine Learning in Galaxy Clusters 15
arXiv:2201.01300
Virtanen P., et al., 2020, Nature Methods,17, 261
Vogelsberger M., et al., 2014, MNRAS,444, 1518
Wadekar D., Villaescusa-Navarro F., Ho S., Perreault-Levasseur
L., 2021, The Astrophysical Journal, 916, 42
Wadekar D., et al., 2022, arXiv e-prints, p. arXiv:2201.01305
Wu H.-Y., Evrard A. E., Hahn O., Martizzi D., Teyssier R., Wech-
sler R. H., 2015, Monthly Notices of the Royal Astronomical
Society, 452, 1982
Zandanel F., Fornasa M., Prada F., Reiprich T. H., Pacaud F.,
Klypin A., 2018, MNRAS,480, 987
Zhang C., Liu C., Zhang X., Almpanidis G., 2017, Expert Systems
with Applications, 82, 128
APPENDIX A: DESCRIPTION AND ENUMERATION OF
FEATURE VARIABLES
In this appendix, we describe the selected 26 features from the
Rockstar + Consistent Trees catalogues. Although this infor-
mation can be found in Behroozi et al. (2012) and Behroozi
et al. (2013), as well as in the CosmoSim Multidark database
https//www.cosmosim.org/, we include in Table A1 a brief de-
scription of the variables, for the reader’s convenience.
This paper has been typeset from a T
EX/L
A
T
EX file prepared by the author.
MNRAS 000,1–15 (2022)
16 Daniel de Andres et.al.
Table A1. The feature variables used in this text from the Rockstar catalogue. The first column represents the variable name and their enumeration in brackets.
Variable Units Description
M2500c (0) ℎ−1MMass inside a radius of a sphere where the matter density is 2500 times the critical density at the cluster’s redshift
num_prog (1) total number of progenitors of the cluster
M500c (2) ℎ−1MMass inside a radius of a sphere where the matter density is 500 times the critical density at the cluster’s redshift
M200c (3) ℎ−1MMass inside a radius of a sphere where the matter density is 200 times the critical density at the cluster’s redshift
Mpeak (4) ℎ−1MThe peak value of the halo mass across its accretion history
mvir (5) ℎ−1Mhalo mass within the virial radius
Macc (6) ℎ−1Mhalo mass at accretion time.
Vpeak (7) km/s Peak value of Vmax(9) across mass accretion history.
Vmax\@Mpeak (8) km/s Vmax at the expansion time at which Mpeak was reach
Vmax (9) km/s maximum value of the circular velocity.
Vacc (10) km/s Vmax at accretion time
rvir (11) ℎ−1kpc halo radius at virial overdensity
vrms (12) km/s root mean squared velocity dispersion
b_to_a(500c) (13) ration between the second largest shape ellipsoid axis and largest shape ellipsoid axis, for particles within 𝑅500
c_to_a(500c) (14) ration between the third largest shape ellipsoid axis and largest shape ellipsoid axis, for particles within 𝑅500
b_to_a (15) ration between the second largest shape ellipsoid axis and largest shape ellipsoid axis determined by method in Allgood et al. (2006)
c_to_a (16) ration between the third largest shape ellipsoid axis and largest shape ellipsoid axis determined by method in Allgood et al. (2006)
rs (17) ℎ−1kpc comoving scale radius from the fit to a NFW (Navarro et al.,1997) density profile
Rs_Klypin (18) ℎ−1kpc comoving scale radius determined using Vmax and Mvir formula (Klypin et al.,2011)
T/|U| (19) the ratio between the total kinetic and potential energies of particles within virial radius.
Xoff (20) ℎ−1kpc Offset between comoving density peak and the particles center of mass position
Voff (21) km/s Offset between halo core velocity and the center of mass velocity for particles within the virial radius
Spin (22) Peebles’s dimensionless Spin parameter of the halo (Peebles,1969).
Spin_Bullock (23) Bullock’s dimensionless spin parameter (Bullock et al.,2001)
a (24) Expansion scale factor of the corresponding simulation snapshot
scale_of_last_MM (25) Expansion scale factor of the last major merger with a mass ratio greater than 0.3
Halfmass_Scale (26) Expansion scale factor when the most massive halo progenitor reached 0.5×Mpeak(4)
MNRAS 000,1–15 (2022)