Available via license: CC BY 4.0

Content may be subject to copyright.

MNRAS 000,1–15 (2022) Preprint 25 April 2022 Compiled using MNRAS L

A

T

EX style ﬁle v3.0

Machine Learning methods to estimate observational properties of

galaxy clusters in large volume cosmological N-body simulations

Daniel de Andres1,2, Gustavo Yepes1,2, Federico Sembolini1,3, Gonzalo Martínez-

Muñoz4, Weiguang Cui1,2,5, Francisco Robledo6,7, Chia-Hsun Chuang8,9, Elena

Rasia10,11

1Departamento de Física Teórica, M-8, Universidad Autónoma de Madrid, Cantoblanco 28049, Madrid, Spain

2Centro de Investigación Avanzada en Física Fundamental,(CIAFF), Universidad Autónoma de Madrid, Cantoblanco, 28049 Madrid, Spain

3Equifax Ibérica, Data & Analytics, Paseo de la Castellana 259D, Madrid, Spain

4Computer Science Department, Escuela Politécnica Superior, Universidad Autónoma de Madrid, Cantoblanco, 28049, Spain

5Institute for Astronomy, University of Edinburgh, Royal Observatory, Edinburgh EH9 3HJ, UK

6Departamento de Fundamentos del Análisis Económico II, Universidad del País Vasco/Euskal Herriko Unibertsitatea, Barrio Sarriena s/n,

48940 Leioa, Bizkaia, Spain

7Laboratoire de Mathématiques et de leurs Applications. Université de Pau et des Pays de l’Adour, Avenue de l’Université, BP 576, 64012 Pau, France

8Department of Physics and Astronomy, University of Utah, Salt Lake City, UT 84112, USA

9Kavli Institute for Particle Astrophysics and Cosmology, Stanford University, 452 Lomita Mall, Stanford, CA 94305, USA

10 INAF - Osservatorio Astronomico Trieste, via Tiepolo 11, 34123, Trieste 34123, Italy

11 Institute of Fundamental Physics of the Universe, via Beirut 2, 34151 Grignano, Trieste, Italy

Accepted: receives: in original form:

ABSTRACT

In this paper we study the applicability of a set of supervised machine learning (ML)

models speciﬁcally trained to infer observed related properties of the baryonic component

(stars and gas) from a set of features of dark matter only cluster-size halos. The training set is

built from The Three Hundred project which consists of a series of zoomed hydrodynamical

simulations of cluster-size regions extracted from the 1 Gpc volume Multidark dark-matter

only simulation (MDPL2). We use as target variables a set of baryonic properties for the intra

cluster gas and stars derived from the hydrodynamical simulations and correlate them with

the properties of the dark matter halos from the MDPL2 N-body simulation. The diﬀerent

ML models are trained from this database and subsequently used to infer the same baryonic

properties for the whole range of cluster-size halos identiﬁed in the MDPL2. We also test the

robustness of the predictions of the models against mass resolution of the dark matter halos and

conclude that their inferred baryonic properties are rather insensitive to their DM properties

which are resolved with almost an order of magnitude smaller number of particles. We conclude

that the ML models presented in this paper can be used as an accurate and computationally

eﬃcient tool for populating cluster-size halos with observational related baryonic properties

in large volume N-body simulations making them more valuable for comparison with full sky

galaxy cluster surveys at diﬀerent wavelengths. We make the best ML trained model publicly

available.

Key words: cosmology: theory – cosmology:dark matter – cosmology:large-scale structure

of Universe – methods: numerical – galaxies: clusters: general –galaxies: halos

1 INTRODUCTION

Galaxy clusters are the largest gravitationally bound objects of

the Universe and constitute one of the best cosmological probes

to constrain cosmological parameters of the Universe. The main

component of galaxy clusters is dark matter (DM), which accounts

for 85 per cent of the total mass (for a full review see e.g. Allen

et al.,2011;Kravtsov & Borgani,2012). Although the existence

of DM is now widely accepted by the scientiﬁc community and

strongly supported by modern cosmological theories, it has not

been directly detected yet. To study galaxy clusters, we have

therefore to focus on their baryonic component, which represents

the remaining 15 per cent of the mass. It is composed by the hot

gas of the Intra Cluster Medium (ICM, around 10-15 per cent of

©2022 The Authors

arXiv:2204.10751v1 [astro-ph.CO] 22 Apr 2022

2Daniel de Andres et.al.

the total cluster mass) and stars (less than 5 per cent of the mass).

Numerical simulations play a fundamental role to study the

properties of galaxy clusters. In the simplest scenario, N-body sim-

ulations can easily describe the dark-matter component of clusters,

which is governed only by gravity; nowadays it is computationally

possible to perform very large cosmological simulations, up to a few

Gpc3, e.g. MillenniumXXL (Angulo et al.,2012), MICE (Fosalba

et al.,2015), MultiDark (Klypin et al.,2016), Dark sky (Skillman

et al.,2014), OuterRim (Habib et al.,2016), FLAGSHIP (Potter

et al.,2017), Uchuu (Ishiyama et al.,2021), BACCO (Angulo et al.,

2021) and UNIT project (Chuang et al.,2019), which include thou-

sands of galaxy clusters. Nevertheless, when aiming to describe

the baryon component of clusters, due to the complex physics in-

volved in the processes of cluster formation, radiative hydrodynamic

numerical simulations have to be used. These simulations are com-

putationally very expensive so this puts strong limitations to the

size of the computational volumes. Examples of state-of-the art of

such simulations are: Illustris (Vogelsberger et al.,2014), Ea-

gle (Schaye et al.,2015), Horizon-AGN (Chisari et al.,2016),

Magneticum (Dolag et al.,2016) or BAHAMAS (McCarthy et al.,

2018). Hydrodynamical simulations are essential to calibrate mass

proxies and to study the systematics aﬀecting observational mea-

surements. They are also essential to deeply understand the forma-

tion and evolution of clusters of galaxies and all their gas-dynamical

eﬀects. For this reason, numerical simulations have been a powerful

tool to guide galaxy clusters observations for more than 20 years

(Evrard et al.,1996;Bryan & Norman,1998).

In an ideal scenario one would need to have a large sample of

simulated galaxy clusters with enough numerical resolution, both

in mass and in the gravity and pressure forces. This high resolution

would allow to accurately resolve the internal substructures and to

obtain a detailed modelling of the most relevant physical processes.

The best way to achieve this would be by simulating large cosmo-

logical boxes containing up to tens of thousands of galaxy clusters.

Unfortunately, due to the large computational eﬀort demanded by

these simulations, one needs to ﬁnd a compromise between their

three main components: volume size, mass resolution and physical

processes included. A possible solution to the computational prob-

lems related with scalability of present-day hydrodynamical codes

is to proceed with the so-called ‘zoom’ simulations, such us the

MUSIC1simulation (Sembolini et al.,2013), the Dianoga clusters

(Planelles et al.,2013), Rhapsody-G (Wu et al.,2015), MACSIS

(Barnes et al.,2016), Cluster-EAGLE (Barnes et al.,2017), hy-

drangea (Bahé et al.,2017) clusters and The Three Hundred

(The300)2simulation project (Cui et al.,2018). Zoom simulations

are performed mimicking the observations, by creating a catalogue

of resimulated galaxy clusters that are extracted from low-resolution

N-body simulations. The regions containing clusters of galaxies are

then resimulated at very high resolution, adding gas physics in the

resimulated areas and keeping the rest of the box at low resolution

in order to reproduce the same gravitational evolution.

An alternative approach to hydrodynamical simulations to de-

scribe the gas and stellar properties of galaxy clusters, is to use

Semi-Analytic Models (SAMs), such us GALACTICUS (Benson,

2012), SAG (Cora et al.,2018), SAGE (Croton et al.,2016) and

GALFORM (Lacey et al.,2016). In this approach, the numerous

complex non-linear radiative physical processes associated to the

1https://music.ft.uam.es

2https://the300-project.org

gas-star components are modelled using a combination of analytic

approximations and empirical calibrations of many free parameters

against a set of observational constrains (see e.g. Baugh (2006) for

a review). Nevertheless, SAMs are also computationally expensive

since most of them are based on the information provided by merger

history of each individual dark matter halo. A complementary ap-

proach is the use of phenomenological models to derive physical

properties of the ICM as in Zandanel et al. (2018). Describing the

gas physics in simulated galaxy clusters requires therefore a big

computational eﬀort and impose a compromise between numerical

resolution and size of the cosmological volume to simulate.

The main goal of supervised Machine Learning (ML) is to gen-

erate models that can learn complex relationships between input and

output variables from high-dimensional data that can later be used

to make predictions on unseen data. In this scenario, ML could oﬀer

a powerful alternative to infer some fundamental information on the

main properties (e.g. gas and star masses, gas temperature, etc) of

the baryon component of galaxy clusters, without the large com-

putational cost required by hydrodynamical simulations or SAMs.

Applications of ML to ﬁnd a mapping between hydrodynamical

and N-body simulations have been already presented in previous

works. Firstly, in Kamdar et al. (2016), a promising technique to

study galaxy formation using numerical simulations and ML was

presented; Jo & Kim (2019) estimated galactic baryonic proper-

ties mimicking the IllustrisTNG simulation (Nelson et al.,2019);

Wadekar et al. (2021) generated neutral hydrogen from dark mat-

ter; Bernardini et al. (2022) predicted high resolution baryon ﬁelds

from dark matter simulations and Moews et al. (2021) used hybrid

analytic and machine learning model to paint dark matter galac-

tic halos with hydrodynamical properties. Recently, The CAMELS

collaboration (Villaescusa-Navarro et al.,2022) has released results

from almost ten thousands simulations (both hydrodynamical and

N-body) with diﬀerent cosmologies and baryon physical models

that are an invaluable tool for training current and future Artiﬁcial

Intelligence algorithms that will be very useful for galaxy formation

studies. Unfortunately, given the box sizes, the number of cluster-

size objects is poorly represented in these simulations.

The purpose of this study is to explore the applicability of

ML techniques to generate baryon cluster properties from DM-only

halo catalogues mimicking the results from The Three Hundred

hydrodynamical simulations. More precisely, we use the properties

of the cluster-sized halos extracted from parent dark matter only

full box simulation MDPL2 as the features of our dataset. Then we

collect several baryon properties of the objects that have been res-

imulated with radiative processes and hydrodynamics as targets (the

predicted variables) of the ML models. Our work diﬀers from previ-

ous studies in that the baryon properties are extracted from ‘zoom’

simulations and therefore, we have paired one to one the objects

corresponding to the full N-body only simulations with their hydro-

dynamical counterparts. As explained below, The300 simulations

corresponds to spherical regions centred on the 324 most massive

clusters found in the MDPL2 box. But there are more cluster-size

halos found within each region with lower masses. The masses of

the cluster-size catalogue of hydrodynamical simulated objects we

are using ranges from ∼1013 ℎ−1Mup to ∼1015 ℎ−1M.

The article is structured as follows: In § 2, we describe how the

training dataset is generated using The300 and the MDPL2 simu-

lations. In § 3, we explain the diﬀerent ML algorithms used in this

work and the training setup. We also study the feature importance

and dimensionality reduction of our feature space. In § 4, the main

results for this work are shown, including an analysis of the perfor-

mance of the ML models and their dependence on mass resolution

MNRAS 000,1–15 (2022)

Machine Learning in Galaxy Clusters 3

of the simulations. In § 5, we study the scaling relations extracted

from the new ML-generated catalogues and ﬁnally in § 6, we draw

our main conclusions and propose possible future studies.

2 THE TRAINING DATASET

In order to create the database for training the ML models, we

use the MDPL23simulation, which has been run using the cosmo-

logical parameters measured by the Planck Collaboration (Planck

Collaboration et al.,2016). The MDPL2 simulation is composed

by a periodic cube of comoving length 1ℎ−1Gpc containing 38403

dark-matter particles of mass 1.5×109ℎ−1M.

To build this training dataset, ﬁrst of all we needed to identify

and extract in the MDPL2 simulation, the same cluster objects that

were used to run The300 hydrodynamical simulations. We then se-

lect the main properties of the dark matter clusters and associate

them with the baryonic properties extracted from The300 hydrody-

namical counterparts.

2.1 MDPL2: Dark Matter input variables

In order to identify the dark matter halos and measure their in-

ternal properties in the MDPL2 N-body simulation we have used

the Rockstar halo ﬁnder (Behroozi et al.,2012), complemented

with additional information based on the halo mass accretion his-

tory from the Consistent Halo Merger Trees analysis (Behroozi

et al.,2013). We have extracted a total of 26 relevant physical Rock-

star + Consistent Trees variables4(masses at diﬀerent radii, ve-

locities, symmetry factors, properties related with mass accretion

history, etc) to create our dark matter catalogue. In addition, we have

also considered the scale factor 𝑎(𝑧)of clusters as an input variable.

Furthermore, we have introduced a cutoﬀ in halo mass such that

log(𝑀/( ℎ−1M)) ≥ 13.5and redshift ≤1.03.

In Fig. 1, we show the Spearman correlation matrix of the 26

Rockstar variables and the scale factor 𝑎(𝑧). These variables are

ordered using a hierarchical clustering algorithm based on Ward’s

linkage on a condensed distance matrix. We used the Python im-

plementation of this algorithm from SciPy (Virtanen et al.,2020).

We can easily identify 5 groups in the correlation matrix. The ﬁrst

group (variables 0 to 12) corresponds to masses and velocities at

diﬀerent radii. In a second group, diﬀerent ellipticity shape factors

(from 13 to 16) are included. Variables from 17 to 21 corresponds to

the scale radius, the ratio between the kinetic and potential energy

and the oﬀsets between density peak and centre-of-mass, which are

directly related to the dynamical state of the cluster halos. The next

group of variables (22 and 23) correspond to the dimensionless spin

parameters of the cluster. Finally, variables from 24 to 26 represent

the scale factor (redshift) and the time evolution of mass accretion.

As can be seen in the ﬁgure, feature variables inside the same block

are strongly correlated among them and they are weakly, or not

correlated to variables inside other blocks. This might imply that

selecting more than one feature belonging to the same block could

not add any new predictive information. This is studied in detail

in section § 3. A more detailed description of the selected feature

variables can be found in the Appendix A.

3www.cosmosim.org

4More information regarding the selection of Rockstar variables can be

found in Appendix A

2.2 The300: baryonic output variables

Subsequently, for a subset of the MDPL2 cluster halos we need to

have their baryonic properties. For this purpose we have used the

results of The300 project, which has re-simulated spherical regions

of radius 15ℎ−1Mpc centred around the 324 most massive clus-

ters found in the MDPL2 simulation at 𝑧=0. These regions were

then mapped back to the initial conditions and their particles were

split into gas and dark-matter, while the rest of the particles in the

remaining box were re-sampled into diﬀerent levels of lower reso-

lution and larger masses. With this zoom-in technique, we ensure

that the subsequent gravitational evolution will reproduce the same

objects in the high resolution area while we minimise the eﬀects of

contamination of low resolution particles from external regions due

to mass segregation. In any case, we checked that all the clusters

used in this work are free from contamination of low mass resolution

particles at least within their virial radii.

The300 project has produced diﬀerent versions of hydrody-

namical simulations from these initial zoomed conditions which

include diﬀerent baryonic physics modules: radiative cooling, star

formation and Supernovae Feedback using the Gadget-MUSIC

SPH+TreePM code (Sembolini et al.,2013) and newer versions

that include feedbacks from Super Massive Black Holes: Gadget-

X(Murante et al.,2010;Rasia et al.,2015), GIZMO-SIMBA (Davé

et al.,2019).

However, in this work, we only make use of the Gadget-X runs.

The halos in these simulations are identiﬁed and analysed with the

Amiga Halo Finder (AHF) (Knollmann & Knebe,2009), which is

more suitable than Rockstar for simulations with multiple particles

species (i.e. dark matter particles, gas, stellar particles and Black

Holes). From the information contained in the AHF catalogues, we

have collected the following baryon properties:

•The total gas mass 𝑀gas inside a spherical volume whose

overdensity is 500 times greater than the critical density of the

Universe. The radius of this sphere is denoted as 𝑅500.

•The Stellar mass 𝑀star inside 𝑅500

•The gas temperature 𝑇gas computed as the mass weighted tem-

perature, inside 𝑅500

𝑇=Í𝑖∈𝑅500 𝑇𝑖𝑚𝑖

Í𝑖∈𝑅500 𝑚𝑖

,(1)

where 𝑇𝑖and 𝑚𝑖are respectively the temperature and mass of the

gas particle.

•The X-ray Y-parameter 𝑌Xdeﬁned as𝑇gas×𝑀gas, which related

with the total thermal energy of the gas and it has been shown that

it is a good proxy of the total cluster mass (Kravtsov et al.,2006).

Note that this quantity can be derived from others. However, we

prefer to treat it as an independent target, i.e the ML models are also

trained to predict 𝑌Xas one of the target variables.

•The integrated Compton-y parameter 𝑌SZ over 𝑅500 given by

the Sunyaev-Zel’dovich (SZ) eﬀect (Sunyaev & Zeldovich,1972).

Particularly, the integrated value 𝑌SZ is computed from Compton-y

parameter maps estimated as in the following:

𝑦=

𝜎T𝑘B

𝑚e𝑐2∫𝑛e𝑇e𝑑𝑙 , (2)

where 𝜎Tis the Thomson cross section, 𝑘Bis the Boltzmann con-

stant, 𝑐the speed of light, 𝑚ethe electron rest-mass, 𝑛ethe electron

number density, 𝑇eis the electron temperature and the integration

is done along the observer’s line of sight. Assuming 𝑑𝑉 =𝑑𝐴𝑑𝑙 ,

Eq.(2) is computed in our simulated data as in Sembolini et al.

MNRAS 000,1–15 (2022)

4Daniel de Andres et.al.

(2013) and Le Brun et al. (2015):

𝑦=

𝜎T𝑘B

𝑚e𝑐2𝑑𝐴

i

𝑇i𝑁e,i𝑊(𝑟, ℎ𝑖). (3)

Note that here we have used the number of electrons in the gas

particles 𝑁egiven that 𝑛e=𝑁e/𝑑𝐴/𝑑𝑙 . Moreover, 𝑊(𝑟, ℎi)is the

same SPH smoothing kernel as in the hydrodynamical simulation

with smoothing length ℎi. The 𝑦-maps are generated with the centre

on the projected maximum density peak position of the halo. Each

image has a ﬁxed angular resolution of 500 that is extended to at least

𝑅200 in all the clusters. The clusters at 𝑧=0are placed at 𝑧=0.05

to generate the mock images while the clusters at higher redshifts

simply use its original value from the simulations. We then integrate

the Compton-y map up to 𝑅500 using only the z-plane projection.

Since the dataset is large, the eﬀect of projections is negligible.

Note that this approach of estimating 𝑌SZ gives us the cylindrical

integrated Compton-y parameter.

2.3 The Final Training Dataset

After deﬁning our input and output variables, we ﬁnally match one-

by-one the clusters between the two simulations that fulﬁl these two

conditions for the relative shifts between the cluster centres and the

halos mass diﬀerences:

distance(CMDPL2,CThe300)<0.4×𝑅200,The300 , (4)

𝑀MDPL2,200

𝑀The300,200

<0.1. (5)

Here, 𝐶MDPL2 and 𝐶The300 stand for the centre of mass of the

clusters while 𝑀MDPL2,200 and 𝑀The300,200 stand for the mass in-

side a sphere of radius 𝑅=𝑅200 for each simulations (between DM

only Rockstar catalogue and the AHF catalogue respectively). Due

to both the baryon eﬀect (see Cui et al.,2012,2014, for example)

and to diﬀerent algorithms used by the halo ﬁnders, it is not possible

to determine with all certainty that all the halos are exactly matched.

Notice that the centre diﬀerence can be as high as 0.4𝑅200. How-

ever, with this restrictive selection criteria, only the true/very close

counterparts are selected. In this way, we ﬁnally provide the baryon

properties for the matched MDPL2 clusters using the corresponding

The300 objects.

After this procedure, our dataset is ﬁnally composed of 49540

diﬀerent objects. Note that all the 33 halo catalogues available from

𝑧=0to 𝑧=1.03 in the two simulations have been considered. Only

1264 objects correspond to clusters at 𝑧=0, the rest of them are the

progenitors of the same objects at diﬀerent redshifts. The number

of objects as a function of their mass and redshift can respectively

be found in Fig. 2 and Fig. 3. Our ﬁnal dataset is composed of 27

DM input variables and 5 baryon output variables. These are the

features and targets which are used for training and testing the ML

algorithms described in the next section.

3 MACHINE LEARNING ALGORITHMS:

DESCRIPTION AND TRAINING

In this section, we ﬁrst describe the machine learning algorithms

used in this work and the training setup. Then, we study the impor-

tance of our feature variables in order to reduce the dimensionality

of our dataset.

3.1 Machine Learning Algorithms and Training Setup

In order to estimate the baryon properties of the dark matter only

clusters, several eﬀective supervised machine learning methods

have been employed. We particularly focus on ensemble tree-based

methods: random forest (RF; Breiman,2001) and extreme gradient

boosting (XGBoost; Chen & Guestrin,2016), and, dense Neural

Networks or Multilayer Perceptron (MLP; Schmidhuber,2015). RF

and XGBoost have shown to be among the best machine learning

methods for tabular data (i.e. without a known grid-like topology,

such as images) (Fernández-Delgado et al.,2014;Bentéjac et al.,

2021;Zhang et al.,2017). Convolutional deep neural network mod-

els have shown spectacular performance for image-based and struc-

tured data in general (Schmidhuber,2015). However, for tabular

data, as is the case of this study, their performance is poor (Zhang

et al.,2017). Notwithstanding, deep dense networks can perform

well in these scenarios, so we will also consider these models.

Random Forest and XGBoost are metamodels that are com-

posed of decision trees. During training, these algorithms build

hundreds of decision trees from a single training dataset. The pro-

cess for building these trees in random forest and XGBoost, is based

on quite diﬀerent ideas. Although, the objective in both cases is to

build decision tree models that complement each other in order to

obtain a classiﬁcation/regression model better than any of its parts

(Dietterich,1998).

Random forest rely on stochastic techniques to generate many

random solutions to the problem at hand. In order to generate each

single tree, random forest algorithm ﬁrst generates a new dataset by

extracting at random 𝑁instances of the training data of size 𝑁with

replacement (i.e. bootstrap sample). This bootstrap sample is used

to train a decision tree in which the best split at each node of the

tree is selected from a random subsample of features of the data.

Generally, the size of the random subset of features is of the order

of √𝐷or log2(𝐷), with 𝐷the number of features of the problem.

The ﬁnal output of the random forest for a given instance is obtained

as the mode or mean of all trees for classiﬁcation and regression

respectively. In addition, since the randomisation process to build

the trees is independent, the process of building a random forest can

be easily parallelised.

On the other hand, XGBoost relies mainly in a gradient de-

scend approach although it also incorporates stochastic techniques

to further increase its performance. XGBoost is an additive model

based on Gradient Boosting. The output of an additive model is

the sum of the outputs of its components. In order, to create this

ensemble, regression trees are trained sequentially to approximate

the gradient of the loss function of the data in the previous iter-

ations. Hence, each new tree learns the remainder of the concept

not learned in previous steps. XGBoost also includes a penalisation

term in the number of leaves of the trees to avoid over-ﬁtting. In

addition, XGBoost incorporates random feature selection, bootstrap

sample and several other randomisation features.

In order to perform a fair comparison among algorithms and

also to obtain good estimations of the performance of the diﬀerent

algorithms, we carried out the following experimental procedure

based on K-fold cross-validation and grid-search. K-fold cross-

validation consist in splitting the data into K disjoint sets of ap-

proximately equal size and then to use iteratively 𝐾−1sets for

training the model and the remaining set for validation. The main

experiment is performed using the same 10-fold cross-validation for

the prediction of the ﬁve baryonic properties analysed in this study

using the Rockstar halo catalogue from The300 hydro clusters.

The steps for each of the 10 partitions of the cross-validation are:

MNRAS 000,1–15 (2022)

Machine Learning in Galaxy Clusters 5

Figure 1. Spearman correlation coeﬃcient matrix for the (feature) variables of the Rockstar identiﬁed clusters. The variables are organised in diﬀerent

blocks according to their correlation values. Variables for each block are denoted in the x-axis in brackets: [1,...,12] are mass and velocity variables, [13,...,16]

correspond to ellipticity, [17,...,21] are related to the dynamical state of the cluster, [22,23] represent dimensionless spin parameters and [24,25,26] are related

to the scale factor and time evolution of mass accretion. Note that this matrix is symmetric with respect to the diagonal. Each variable description can be found

in Appendix A.

Figure 2. Mass distribution of the The300 Galaxy clusters analysed in this

work

(i) Find the best hyper-parameters of each of the tested algorithms:

RF, XGBoost and MLP. For that, a grid-search with 5-fold cross-

validation within the train dataset only was performed. The values

for the grid of hyper-parameters are shown below;

Figure 3. Redshift distribution of The300 Galaxy clusters analysed in this

work

(ii) The best set of hyper-parameters for each method were used to

train a single model using the whole training set;

(iii) The models were validated using the test set;

In order to generate dark matter only halo catalogues with hydrody-

namic properties, the 10 trained models from each of the 10-folds of

MNRAS 000,1–15 (2022)

6Daniel de Andres et.al.

the cross-validation were used. The hydrodynamic features of each

halo are then computed as the average of the inferred values from

these 10 models.

For the grid search the set of values of the tested hyper-

parameters for each of the analysed methods are:

•Random Forest:

– The number of trees in the forest: ‘n_estimators’=[100,500]

– the number of features to consider when looking for the best

split ‘max_features’ : [‘sqrt’,‘log2’]

•XGBoost:

– ‘n_estimators’= [100,500]

– Maximum depth of a tree:

‘max_depth’= [6,10,14,15,16,20]

– Minimum loss reduction required to make a further partition

on a leaf node of the tree:

‘gamma’ = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]

– Step size shrinkage used in update to prevents overﬁtting:

‘eta’ = [0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]

•MLP:

– ‘hidden_layer_sizes’ = [(8,),(20,),(100,),(8,8),(8,20,8),

(20,20,20),(100,100,100), (20,20,20,20),(100,100,100,100)]

– ‘activation’=‘relu’,

– ‘solver’=‘adam’,

– ‘learning_rate’=10−4

Furthermore, MLP has been trained for 500 epochs or until the

training loss is constant during 20 epochs. For more information of

these hyper-parameters, we refer the reader to the Python libraries

used throughout this work: for RF and MLP we have used https:

//scikit-learn.org (Pedregosa et al.,2011) and for XGBoost

https://github.com/dmlc/xgboost.

In order to train these models the mean squared error of the log-

arithmic values of the targets (log MSE) was used as loss function:

L=

1

𝑛

𝑛

𝑖=1(log 𝑦true,𝑖 −log 𝑦pred, 𝑖 )2,(6)

where 𝑦true,𝑖 is the true value of the target extracted from the The300

simulation and log 𝑦pred,𝑖 is the predicted target’s value by our

model. Note that, since the model is trained with log 𝑦as targets,

then the prediction of the model is directly the logarithm of the

given target. In addition, 𝑛corresponds to the number of objects in

the dataset (e.g. train, validation and test set) where the log MSE is

computed. For numerical reasons, we have also used the logarithmic

value of the features during the training process.

3.2 Feature importance and dimensionality reduction

Although machine learning models can generalise complex func-

tions, generally, it is not trivial to interpret their decisions. In fact,

they are often referred as black box estimators (e.g. Barredo Arrieta

et al.,2020). Therefore, it is of great value to be able to inspect

what is the learnt relation between features and targets given a par-

ticular model. One such inspection technique is feature importance.

Particularly, feature importance is a family of techniques that as-

signs a score 𝐹𝑠(𝑥, 𝑦)to each input features 𝑥depending on how

useful they are when it comes to predicting a particular target 𝑦.

Furthermore, feature importance is very related to dimensionality

reduction, a family of techniques that aim at getting rid of non-

informative variables from a model (e.g. Kuhn et al.,2013). In this

section, we use feature importance algorithms to determine what

features are more relevant and therefore, reduce the dimensionality

of our 27-dimensional input space.

One commonly used algorithm to estimate feature importance

for ensembles of decision trees (such as RF and XGBoost) is Per-

mutation Importance (Breiman,2001). In this algorithm, the impor-

tance of each feature is estimated as the decrease of the model score

when the values of a feature are randomly shuﬄed. This technique,

however, fails when correlated features are present in the dataset

(Altmann et al.,2010). A second shortcoming of this algorithm is

that it only considers the importance of individual features.

Other technique is the use of forest of trees to evaluate the

importance of features computed as the mean and standard deviation

of accumulation of impurity decrease within each tree (Breiman,

2001), which for regression is the variance reduction. In random

forest, internal node features are selected with some criterion, or

loss function. We can then measure how on average each feature

decreases the criterion in the splits of the forests. Nevertheless,

this technique also fails due to the fact that our features are highly

correlated, and it is also known to be biased in favour of variables

with many possible split points (e.g Nembrini et al.,2018).

Instead, we use the Greedy Search Feature Importance Algo-

rithm (GSFIA, see for example Ferri et al. (1994). This technique

considers the importance of the combination of features and not

only the individual feature importance. It works iteratively by se-

lecting and evaluating one variable at a time until all features are

ordered from the most to the least relevant. The algorithm works

with a list of selected variables, 𝐿, initially empty, 𝐿=[], and a

pool of possible variables to be selected, 𝑃, initially containing all

𝐷variables of the problem, 𝑃=𝑠𝑒𝑡(𝑥1, . . . , 𝑥 𝐷). Then, a proce-

dure is repeated 𝐷times in which, at each step, one variable from

the pool 𝑃is selected and moved to the list 𝐿. In the 𝑘−th step

of the loop the procedure creates |𝑃|models trained on all of the

features in 𝐿plus one feature from 𝑃. The model that minimises the

MSE identiﬁes the most important variable from 𝑃in combination

with the variables in 𝐿. This variable is then removed from 𝑃and

appended to the list 𝐿. At the end of the algorithm, all variables of

the problem are sorted by importance in list 𝐿together with the loss

function associated with them. GSFIA it is depicted using pseudo

code in algorithm 1.

With this algorithm we can deﬁne the feature importance score

𝐹(𝑥, 𝑦)as follows:

•Run GSFIA to rank all features from the most to the least

important variables and save the corresponding value of log MSE.

•The score is then deﬁned as the log MSE of every iteration

normalised to the corresponding value of the ﬁrst iteration.

Note that the normalised log MSE will be 1 for the ﬁrst feature, and

will decrease progressively as we consider more features until it

converges to a minimum value. It could happen that after including

several features, the normalised log MSE increases as more fea-

tures are included (see Fig. 4 for the case of 𝑀star). This indicates

that the last features included do not improve or even degrade the

performance of the model.

The algorithm was run using random forest as model (line 9 of

algorithm 1). In addition, due to the randomness of the ML model,

the inner loop of the algorithm was repeated 10 times in order to

reduce the variability of the results. In Figure 4, the average of the

normalised log MSE and its standard deviation are shown for the

diﬀerent targets considered. In the horizontal axis, the ﬁnal order

MNRAS 000,1–15 (2022)

Machine Learning in Galaxy Clusters 7

Algorithm 1: Pseudo code of Greedy Search Feature

Importance Algorithm

inputs : x = features dataset; y = target dataset

outputs: L = organised list of features according to their

importance degree ; score = normalised MSE of

every element in L

1L = [empty list];

2score = [empty list];

3P=[𝑥1, ... ,𝑥𝐷];

4i = 0;

5while i<=length(x) do

6loss = zeros(length(P));

7j = 0;

8while j<=length(P) do

9dataset = L+P[j] # sum of lists ;

10 model.train(dataset,y);

11 loss[j]=model.MSE;

12 j=j+1;

13 indx = argmin(loss);

14 L.append(P[indx]);

15 P.drop(indx);

16 score.append(loss[indx]);

17 i = i+1;

18 score = score / score[0];

19 return L, score

for the feature variables is shown. Variables in red colour are the

reduced set of features that will be considered for further analysis.

These features are summarised in Table 1.

As shown in Table 1, we expect that the selected variables

generally come from diﬀerent correlation blocks as shown in Fig. 1.

This is so, since variables from the same block are correlated and

once the algorithm chooses one feature, it skips using variables

with the same information. However, this is not always the case

(e.g. variables 2, 6 and 7 are selected for a couple of targets). This

can be explained since the correlation between those variables is

high but it is not 1. Hence, for our case the marginal information

that a second variable inside a correlated block gives, is higher than

that given by other variables. As far as the meaning of selected

variables is concerned, we can distinguish two diﬀerent important

blocks in the correlation matrix: The mass and velocity block (the

ﬁrst block from 0 to 12), and the time evolution block (from 24 to

26). The conclusion of this analysis is that the rest of the blocks are

redundant or contribute little to the estimating baryon properties, i.e.

the ellipticity block (from 13 to 16), the dynamical state block (from

17 to 21) and the spin block. Moreover, masses and velocities are the

most important features for estimating baryon properties while the

variables associated with the time evolution of the mass accretion

into halos play a secondary role in the regression algorithms. The

redundant role of the ellipticity variables can be explained by taking

into consideration that we are estimating integrated quantities from

the particles within spheres of radius 𝑅=𝑅500 , regardless of the

shape of their 3D distributions.

Note that, we combine data from diﬀerent redshifts as our

training and test samples. We do not think that the evolution of

these baryon properties will aﬀect our results because (1) as shown

in Cui et al. (2022), these quantities in Gadget-X simulations hardly

depend on redshift, especially at 𝑧.1(see also Truong et al.,2018,

for example); (2) we also include the scale factor as a feature variable

in the training. If there were a clear redshift dependence on any target

Table 1. Lists of selected DM-only features for the diﬀerent targets after

applying GSFIA (algorithm 1).

target Important features

𝑀gas M500c(2), Vpeak(7), scale_of_last_MM(25), Macc(6), a(24)

𝑀star Vpeak(7), Halfmass_Scale(26), scale_of_last_MM(25), a(24)

𝑇gas Vpeak(7), a(24), scale_of_last_MM(25)

𝑌XVpeak(7), M500c(2), scale_of_last_MM(25), vrms(12),Macc(6)

𝑌SZ Vpeak(7), Mpeak(4), scale_of_last_MM(25), a(24), rs(17)

variable, the scale factor feature would show a higher contribution.

However, as shown in Table 1 and Fig. 4, the scale factor contributes

only weakly to the normalised loss function L.

Furthermore, we have to highlight that although we have used

Random Forest for the GSFIA, other Machine Learning algorithms

might also be used. However, GSFIA is computationally expensive

given the fact that its computing time increases with the number of

features 𝐷as 𝑂(𝐷2). Therefore, we prefer to use RF because it is

computationally more eﬃcient and it does not have as many hyper-

parameters to tune in. Consequently, this choice might introduce

a bias given the fact that a particular model is being used for the

selection of the important variables. However, in the next section

we will show that this particular selection of variables yields similar

performance for the diﬀerent ML algorithms considered throughout

this work.

4 RESULTS

In this section, we ﬁrst study what machine learning algorithm is

of higher quality for our particular dataset and study the accuracy

of our model predictions. Then, we populate the dark-matter-only

MDPL2 simulation with baryon properties and determine whether

we can also successfully use the trained machine learning model on

dark-matter-only low resolution simulations.

4.1 Error analysis

In order to determine the accuracy of our ML models, we have

trained our 3 models on the dataset composed of all features and on

the dataset with the reduced set of features summarised in Table 1

using the experimental setup described in the previous section. The

average performance of the models is shown in Fig. 5. In the left

panel we show the log MSE deﬁned in Eq.(6) for the diﬀerent tested

models as a function of the target variables when all input features

are used. In the right panel, the same quantities are displayed for the

reduced set of features.

As a general result, it can be observed from Fig. 5 that XGBoost

algorithm has the best performance for all targetsexcept for𝑌𝑆 𝑍 with

the reduced set of features. In any case the performance of XGBoost

in this last target is almost identical to the best model. For RF, we ﬁnd

equivalent performances for both sets of features in 𝑀gas,𝑌Xand

𝑌SZ; a somewhat worse result for the reduced set on𝑇gas; and better

performance on 𝑀star for the reduced set. For XGBoost, the trends

are similar to those of Random Forest, although the diﬀerence in

performance for XGBoost between both sets of features is negligible

for 𝑇gas and smaller for 𝑀star. For the MLP model, all results using

the reduced set of features are worse than those obtained when using

all the features in the catalogues. These diﬀerences between the tree

based approaches (RF and XGBoost) and MLP can be explained

taking into consideration that the selection of important features was

done using Random Forest. In any case, the performance of MLP is

MNRAS 000,1–15 (2022)

8Daniel de Andres et.al.

Figure 4. Normalised MSE (y-axis) given by GSFIA (algorithm 1) as a

function of DM-only variables described in Appendix Aranked by feature

importance in descending order. From top to bottom we show our results for

diﬀerent targets: 𝑀gas,𝑀star,𝑇gas,𝑌Xand 𝑌SZ. Blue dashed lines represent

the average value of the normalised MSE for 10 diﬀerent k-folds and error

bars correspond to the standard deviation. The selected features for each

target are highlighted in red and shown in Table 1.

the worst for all targets even when all features are considered. After

the previous analysis, we can conclude that XGBoost gives the most

accurate model predictions. Therefore, we will only consider this

algorithm for the rest this work. A summary of the performance for

all models can be found in Table 2 for the reduced set of features.

The scores shown in Fig. 5 summarise in a single value the

performance of the models. However they do not allow us to under-

stand how the model performs in the diﬀerent regions of the space

of features and targets. In order to analyse this, we ﬁrst deﬁne the

relative diﬀerence in performance for a single target yas

diﬀ(y)=

𝑦pred −𝑦true

𝑦pred ,(7)

where 𝑦pred and 𝑦true are the predicted and real target values for x.

Figure 5. The logarithmic MSE deﬁned in Eq.(6) for the 3 ML models

considered: RF in blue, XGBoost in red and MLP in black colour. The

x-axis indicates the diﬀerent baryon targets. The points with error bars

represent the mean and the standard deviation of the logarithmic MSE for

the test set using 10 diﬀerent k-folds. The left panel correspond to training

the ML models using all the DM-only variables shown in Fig. 1 and in the

right panel the algorithms are trained with the reduced set of features listed

in Table 1.

Note that in Eq.(7), we are not considering the logarithmic value of

the targets. One can interpret these diﬀerences as a probability dis-

tribution. This means that given a value of 𝑦pred one might estimate

the aleatory intrinsic scatter to that particular predicted value. These

diﬀerences are shown in Fig. 6 as a function of the predicted target

𝑦pred (ﬁrst column), the cluster mass 𝑀500 (second column) and the

peak of the velocity proﬁle along the mass accretion history, 𝑉peak

(third column) for all redshifts. In Fig. 6, instead of plotting the in-

dividual diﬀerences for all instances, the mean value(dashed black)

and the 66% (red region) and 95% (blue region) conﬁdent intervals

are represented for sliding windows (bins) containing roughly the

same number of objects.

The main result that can be observed from Fig. 6 is that the

predictions are unbiased with respect to the most important features

(𝑀500 and 𝑉peak) and with respect to the predicted targets as the

mean is very close to 0 for all the ranges. However, the scatter

vary depending on the target as it is depicted in Fig. 5 and Fig. 6.

Particularly, 𝑇gas is the target most accurately predicted, with an

average scatter of 7% (standard deviation of Eq.(7)) and 𝑌Xis the

predicted variable with higher average scatter ( 16%). A summary

of the overall statistics of the error scatter can be found in Table 2.

In addition, we found a slight dependence of the scatter on the

𝑀500,𝑉peak and inferred targets values (except for𝑇gas). The scatter

seems to decrease as these values increase. A possible explanation

is that massive clusters are more self-similar than smaller groups

which present a larger halo-to-halo variation as a consequence of

the stronger impact of non-gravitational processes.

4.2 ML inference of Baryonic properties in Dark matter only

datasets

We now proceed to apply the trained ML model to infer the diﬀerent

baryonic properties in the full set of MDPL2 halo catalogues. For

that, we will use the 10 XGBoost models trained on the reduced

set of features of The300 clusters. In order to create the catalogue,

we ﬁrst build a dataset with the reduced set of features (shown in

Table 1) for each halo of the full MDPL2 box. Note that the same

transformations and cutoﬀs are applied to the full MDPL2 Rockstar

catalogue as in § 2. Next, we discard clusters whose features values

are not inside the hyper-cube deﬁned by the MDPL2 features used

MNRAS 000,1–15 (2022)

Machine Learning in Galaxy Clusters 9

Figure 6. The relative diﬀerence (y-axis) deﬁned in Eq.(7) as a function of the XGBoost-predicted target variable (ﬁrst column), the cluster 3D dynamical

mass 𝑀500𝑐(2)(second column) and the peak of the maximum value of the radial circular velocity proﬁle across the halo’s mass accretion history 𝑉peak(7)

(third column). From top to bottom, each row corresponds to diﬀerent baryon targets: 𝑀gas,𝑀star,𝑇gas,𝑌X,𝑌SZ. Dashed black lines are the mean value of the

relative diﬀerence, which are very close to diﬀ=0. Red regions and blue regions represent the 66% and95% conﬁdence intervals respectively. Additionally, 0%

and 20% relative diﬀerence lines are represented in dashed green colour. The data are binned using sliding windows that contain roughly the same number of

clusters.

MNRAS 000,1–15 (2022)

10 Daniel de Andres et.al.

Table 2. Logarithmic value of the MSE deﬁned in Eq.(6) for the reduced set of features in Table 1. In brackets, we show the standard deviation (scatter) 𝜎

of the relative diﬀerence deﬁned in Eq.(7). Rows correspond to values of the log MSE for diﬀerent models while columns correspond to values for diﬀerent

baryonic targets.

Value 𝑀gas 𝑀star 𝑇gas 𝑌X𝑌SZ

log MSE×10−3XGBoost 2.17 (11%) 3.43 (14%) 0.94 (7%) 4.81 (17%) 4.07 (16%)

log MSE ×10−3RF 2.20 (11%) 3.47 (14%) 1.56 (10%) 4.85 (17%) 3.96 (16%)

log MSE ×10−3MLP 2.75 (12%) 8.16 (21%) 2.23 (11%) 5.97 (19%) 5.82 (19%)

for training since ML models are not designed for extrapolation

inference. This means that only MDPL2 clusters such that

𝑥training

min ≤𝑥MLDP2 ≤𝑥training

max , for 𝑥∈features (8)

will be taken into consideration. Where 𝑥training is a feature cor-

responding to training dataset and 𝑥MLDP2 is the same feature for

the full MDPL2 simulation. Only 397 clusters out of 1,306,185 are

outside the hyper-cube deﬁned by the most important features and

therefore, they are not considered for the analysis.

In order to evaluate if the generated catalogue presents prop-

erties that are coherent with the properties of fully simulated data,

we will compare several baryon properties with respect to the halo

mass for The300 and the full MDPL2 generated catalogue. These

results are shown in Fig. 7 for diﬀerent redshift values (columns). In

these plots, the values of the targets (rows) are plotted with respect

to 𝑀500. For the targets 𝑀gas and 𝑀star, the plots show the relative

fractions:

𝑓𝑖=𝑀𝑖/𝑀500 , (9)

where 𝑖can be either gas or star. Error bars represent the standard

deviation of our model predictions on the MDPL2 catalogue (1𝜎),

orange/brown regions correspond to 1𝜎region for The300 test set

predictions (for all the k-folds) and blue regions are the equivalent

but for The300 true targets. Moreover, in the last row of the ﬁgure

the number of clusters per bin is represented as a function of 𝑀500

for both The300 and MDPL2 datasets.

As a general result, the mean predicted values for MDPL2

objects (black error bars) are similar and also their distributions

per mass bin are comparable with the true values (blue region), i.e.

in agreement with Fig. 6. However, the scatter of the predictions

is slightly smaller (around 10-20%) than the corresponding scatter

using the true values of The300 data for 𝑓gas and 𝑓star. Furthermore,

a similar result to the ones shown in Fig. 7 are obtained when

plotting as a function of 𝑉peak instead of 𝑀500 . We need to point out

that for massive clusters ( >8×1014 ℎ−1M), the number of objects

is similar in the The300 and MDPL2 simulations. Particularly, the

last two mass bins are mostly composed of the same objects and the

diﬀerence lies in the baryon properties of the The300 simulation.

4.2.1 Dependence of ML model predictions on DM mass

resolution

The ML model has been trained on a particular DM simulation with

a ﬁxed resolution in mass. Here we are interested to compare the

predictions of the ML model when applied to halo catalogues from

simulations with lower mass resolution. Since some of the features

of the halos are expected to be aﬀected by resolution, then the infer

baryon quantities from the ML model could also be aﬀect by that.

Since our goal is to make this ML model as universal as possible

so it can be applied to diﬀerent DM-only simulations with larger

volumes, it is important to test for these eﬀects. In order to do that,

we are going to apply the trained XGBoost model in two simula-

tions run with identical initial conditions but with a diﬀerence of

a factor 8 in particle mass. For this test, we are going to use also

another completely diﬀerent realisation than MDPL2, i.e. the UNIT

project. The UNIT5N-body cosmological simulations (UNITSIM,

Chuang et al.,2019) are designed to provide accurate predictions of

the clustering properties of dark matter halos using the suppressed

variance method proposed by Angulo & Pontzen (2016). We partic-

ularly focus on one of the UNIT simulations with the same box side

length than MDPL2, (i.e. 1ℎ−1Gpc ) and similar number of parti-

cles (40963). Furthermore, this simulation has also been performed

with 8 times less number of particles (20483). For simplicity we will

refer to these two simulations as UNITSIM4096 and UNITSIM2048

for the high and low resolution versions respectively.

Dark matter cluster-size halo catalogues from Rockstar+

Consistent Trees are then selected for UNITSIM4096 and UNIT-

SIM2048 following the same procedure described in § 2. We then

apply the trained XGBOOST model to these catalogues to infer the

target baryon properties for each DM halos in the two versions.

These baryon properties present similar statistics (mean and scatter

per mass bin) as those shown in Fig. 7. In order to make a more

quantitative comparison of the results for the two UNIT simula-

tions, we bin the data as in Fig. 7 according to 𝑀500 and compute

the diﬀerence of the mean values and estimate an upper limit for its

scatter as

¯𝜇=𝜇2048 −𝜇4096 and ¯𝜎=𝜎2

2048 +𝜎2

4096.(10)

Here, 𝜇stands for mean values and 𝜎for the standard deviation of

a bin. The particular values of ¯𝜇and ¯𝜎are shown for 3 diﬀerent

snapshots in Fig. 8. As can be seen in this ﬁgure, ¯𝜇'0with a small

value for the scatter for all mass bins. The scatter ¯𝜎is within ∼2%

for 𝑓gas and ∼0.5% for 𝑓star. For 𝑇gas, the residuals amount to ∼0.1

dex and for 𝑌Xand 𝑌SZ up to ∼0.2dex. Therefore, we conclude

that the baryonic properties predicted by the ML model for the same

halos simulated with a factor of 8 diﬀerence in mass resolution are

statistically equivalent.

5 VALIDATION OF THE GAS SCALING RELATIONS

Scaling relations are generally power laws that relate properties in

astrophysical systems, such us the Colour-Magnitude Relation or

the Tully Fisher relation (Tully & Fisher,1977) for galaxies. The

applications of the scaling relations are manifold, such as inferring

masses of galaxy clusters that are sensitive to cosmological param-

eters (e.g. Planck Collaboration et al.,2016). For a recent review

of scaling relations for galaxy clusters we refer the reader to e.g.

Lovisari & Maughan (2022). The temperature-mass relation can be

5https://unitsims.ft.uam.es

MNRAS 000,1–15 (2022)

Machine Learning in Galaxy Clusters 11

Figure 7. XGBoost predictions (y-axis) as a function of cluster mass, 𝑀500𝑐(2), at 4 diﬀerent redshifts (diﬀerent columns) from 𝑧=0to 𝑧=1.032. The

ﬁrst ﬁve rows correspond to the predictions of our baryonic targets: gas and star fractions 𝑓gas and 𝑓star, gas temperature 𝑇gas, X-ray Y-parameter 𝑌Xand SZ

Y-parameter 𝑌SZ. The data is binned along the x-axis and the means of the predicted values for the test set are shown in red dashed lines with their scatter

(standard deviation) represented as shaded brown/orange region. True values of The300 train set are shown as a blue dashed line and their scatter corresponds

to shaded blue region. Black points represent the average values for the predictions of MDPL2 clusters per mass bin and the error bars correspond to 1𝜎scatter.

The bottom row shows the number of cluster objects (N) per mass bin for The300 (blue histogram) and MDPL2 (orange) simulations

written as

𝐸(𝑧)−2/3𝑇gas

keV =10𝐴𝑇𝑀

M𝐵𝑇

, (11)

where 𝐸(𝑧)=𝐻(𝑧)/𝐻0and H(z) is the Hubble parameter. Similarly,

for the 𝑌X−𝑀and 𝑌SZ −𝑀we use

𝐸(𝑧)−2/3𝑌X

ℎ−1MkeV

=10𝐴𝑋𝑀

M𝐵𝑋

(12)

and

𝐸(𝑧)−2/3𝑑2

𝐴𝑌SZ

Mpc2=10𝐴𝑆 𝑍 𝑀

M𝐵𝑆𝑍

. (13)

Here, 𝐴𝑖and 𝐵𝑖(𝑖=𝑇, X,SZ) are the parameters that we are

interested in obtaining by ﬁtting the above equations to our data.

Once we have generated baryon catalogues for diﬀerent N-body

simulations we apply a simple linear ﬁtting function in logarithm

space to ﬁt the data to the equations listed above. However, selecting

data from diﬀerent snapshots gives us small variations of the 𝐴𝑖

MNRAS 000,1–15 (2022)

12 Daniel de Andres et.al.

Figure 8. The diﬀerence of XGBoost-predicted baryonic properties for halos corresponding to the UNITSIM2048 and UNITSIM4096 DM-only simulations

as a function of the cluster total mass 𝑀500𝑐(2). The y-axis represents values of ¯𝜇(Black points ) and ¯𝜎(error bars and blue region) deﬁned in Eq.(10) for

diﬀerent mass bins. From top to bottom, diﬀerent rows represent the baryonic properties considered: 𝑓gas,𝑓star,𝑇gas,𝑌Xand 𝑌SZ. From left to right, we show

our results for three diﬀerent redshifts z=0 (ﬁrst column), z=0.52 (second column) and z=1 (third column).

and 𝐵𝑖best ﬁtting parameters with redshift. Therefore, we use the

following parametrization to study the redshift dependence:

𝐴𝑖(𝑧)=𝐴𝑖,0(1+𝑧)𝛼𝑖(14)

𝐵𝑖(𝑧)=𝐵𝑖,0(1+𝑧)𝛽𝑖(15)

where 𝐴𝑖,0and 𝐵𝑖 ,0are the values of the intercept and slope at z=0

and 𝛼𝑖and 𝛽𝑖describe their possible dependence with respect to the

redshift. With this new parametrization we apply a non-linear least

square ﬁtting model to ﬁt the function described by equations (11),

(12) and (13) updated with equations (14) and (15). The best-ﬁtting

parameters are shown in Table 3 and Table 4. Note that we have

used the mass corresponding to the N-body simulation (the feature

variable 𝑀500𝑐(2)as the mass of the cluster)

As a general result, the ﬁtting parameters are in agreement

among the three diﬀerent N-body simulations and are slightly dif-

ferent from The300 hydrodynamical simulation. This deviation,

though small, is caused by the fact of considering the full box

of dark matter only simulations instead of the smaller volume of the

‘zoom’ simulation. The eﬀect of resolution is negligible for galaxy

clusters. There is also a small diﬀerence between The300 simula-

tions true data (The300), and the ﬁtting counterpart using the ML

predicted data (The300*). This slight diﬀerence can be mainly ap-

preciated in the intrinsic scatter of the linear ﬁtting function, which

is generally smaller in the case of The300*. It is important to note

that the scatter of the scaling law for The300 simulations is generally

larger when comparing it with the values shown in Table 2, where

the scatter (standard deviation of the relative diﬀerence) is reduced

by a factor of 0.5 for the gas temperature, 0.3 for 𝑌Xand 0.45 for

𝑌SZ. Moreover, the most relevant variables for each gas properties

presented in table Table 1 can be used for ﬁnding analytical expres-

sions for scaling laws with a reduced MSE using genetic algorithms

(Wadekar et al.,2022).

As far as the redshift dependence is concerned, it is negligible

for 𝑌Xand 𝑌SZ where the parameters 𝛼and 𝛽are of order .10−3.

However, the parameter 𝛼𝑇'0.3cannot be ignored. This indicates

that the evolution of 𝑇gas is relevant as it can also be appreciated in

Table 1, where the scale factor a(24) is the second most important

variable, reducing the normalised loss function Lfrom 1 to 0.6.

6 SUMMARY AND CONCLUSIONS

Numerical simulations are key to studying galaxy clusters. On the

one hand, with the current technology it is possible to perform

large volume N-body simulations that can be useful to describe

the dark-matter component. However, big volume hydrodynamical

simulations cannot be carried out due to their computational de-

mands. We have therefore trained a set of machine learning models

to populate high volume dark-matter-only simulations with bary-

onic properties. In particular, we have deﬁned our feature space as

the Rockstar variables of DM-only halos and our target variables

are directly estimated from The Three Hundred hydrodynamical

simulations: the mass of the gas 𝑀gas, the mass of the stars 𝑀star,

the gas temperature 𝑇gas, X-ray Y-parameter 𝑌Xand the integrated

Compton-y parameter 𝑌SZ. All these quantities are integrated quan-

MNRAS 000,1–15 (2022)

Machine Learning in Galaxy Clusters 13

Table 3. The best ﬁt parameters for the 𝑇−𝑀,𝑌X−𝑀and 𝑌SZ −𝑀relations for the diﬀerent simulation sets. The log MSE in Eq.(6) and average scatter

of the relative diﬀerence in Eq.(7), (in parenthesis), are also shown. For The300 simulation, the true values of the baryon properties have been used while for

The300* the predicted XGBoost values are used instead. The relative error in the estimated parameters 𝐴𝑖, and 𝐵𝑖is always ≤10−3.

Simulation

𝐴𝑇,0𝐵𝑇,0log MSE𝑇𝐴𝑋,0𝐵𝑋,0log MSEX𝐴𝑆𝑍 ,0𝐵𝑆𝑍 ,0log MSESZ

The300 0.2083 0.6081 1.8×10−3(10%) 13.09 1.718 8.3×10−3(25%) -5.499 1.697 9.5×10−3(29%)

The300* 0.2082 0.6054 1.6×10−3(10%) 13.08 1.718 3.9×10−3(17%) -5.497 1.692 7.2×10−3(25%)

MDPL2 0.2133 0.5863 3.3×10−3(11%) 13.07 1.767 2.8×10−3(13%) -5.513 1.710 6.5×10−3(21%)

UNITSIM4096 0.2122 0.5865 3.3×10−3(11%) 13.07 1.767 2.8×10−3(13%) -5.514 1.709 6.5×10−3(21%)

UNITSIM2048 0.2126 0.5854 3.3×10−3(11%) 13.07 1.766 2.8×10−3(13%) -5.515 1.709 6.5×10−3(21%)

Table 4. The best ﬁt redshift dependence parameters for the scaling relations deﬁnes in Eq. 14 and Eq. 15

Simulation 𝛼𝑇(×10−3)𝛽𝑇(×10−3)𝛼X(×10−3)𝛽X(×10−3)𝛼SZ (×10−3)𝛽SZ (×10−3)

The300 −339.0±5.3−3.8±4.3−1.15 ±0.16 3.4±3.3 5.36 ±0.41 −0.113 ±0.035

The300* −336.7±4.8−11.1±4.0−0.73 ±0.11 6.4±2.2 6.22 ±0.45 −0.122 ±0.031

MDPL2 −314.3±1.2 30.4±1.8 0.417 ±0.021 0.42 ±0.65 7.181 ±0.074 0.2±1.0

UNITSIM4096 −308.1±1.3 30.5±1.8 0.52 ±0.21 0.22 ±0.68 7.162 ±0.075 0.0±1.1

UNITSIM2048 −302.9±1.3 29.9±1.8−0.052 ±0.021 −0.93 ±0.71 6.279 ±0.075 −2.0±1.1

tities in a spherical regions of overdensity 500 times the critical

density at their corresponding redshift.

Particularly, we have considered 3 diﬀerent ML models, ran-

dom forest (RF), extreme gradient boosting (XGBoost) and Multi-

Layer Perceptron (MLP). We have determined that XGBoost is the

algorithm that is more suitable to our dataset and whose predictions

are closer to the true hydrodynamical targets, as shown in Table 2.

We have applied an algorithm –Greedy Search Feature Importance

Algorithm (GSFIA)– to identify the features that have more pre-

dictive information. By using GSFIA, we have managed to reduce

the dimensionality of our feature space from 27 to approximately 5

variables depending on the target variable. We have demonstrated

that masses and velocities have a higher amount of predictive infor-

mation while time evolution variables play a secondary role in the

prediction of our targets. What is more, ellipticity, dynamical state,

and spin features are redundant. A possible explanation for this is

that our baryon targets are integrated in spherical regions.

Then, we have applied our trained ML model to populate halo

catalogues with baryonic properties from two full box N-body sim-

ulations: the MultiDark simulation (MDPL2) and the UNIT N-

body cosmological simulations (UNITSIM). The MDPL2 predicted

baryon properties are compatible to those of The300 simulations, as

shown in Fig. 7. The application on two UNITSIM simulations with

1ℎ−1Gpc box size 20483and 40463particles has determined that

our model can be successfully applied to boxes whose resolution is

up to 1/8of the corresponding simulation used for training. This

suggests that this is a promising method to populate the UNITSIM

large volume N-body halos with baryon properties up to 27 Gpc3

(i.e a 3ℎ−1Gpc box size with 61443particles). This will be an

excellent tool to study the large scale distribution of galaxy clusters

in an unprecedented way. For instance, we can estimate the cosmic

variance in the number counts of X-ray detected clusters from the

eROSITA satellite all-sky survey Liu et al. (2021) by extracting

many diﬀerent light-cones from this large computational volume.

This will be the subject of a forthcoming paper.

Furthermore, the scaling relations are powerful mass-

observable proxies. We have check that the best-ﬁtting parame-

ters inferred using our three mock DM full-box baryon catalogues

are compatible. They nevertheless diﬀer slightly from those of the

The300, partially because of the considerable smaller number of

cluster objects in the hydrodynamical simulations used to get the

best ﬁt values. This would suggest that mass completeness have

an small impact, thought not negligible, in the calibration of the

mass-proxies.

To conclude, our work shows that ML models are very useful

methods for ﬁnding a mapping between dark matter halo properties

found in N-body and the complex hydrodynamical simulations. We

checked that, on average, the generated catalogue for the 3 dark-

matter-only simulations used throughout this paper have the same

distributions to that of true training set and therefore, they can be

used for painting dark matter halos with baryonic properties that are

directly related with observed quantities, providing added value to

large volume collisionless N-body simulations.

ACKNOWLEDGEMENTS

DA and GY would like to thank MINECO/FEDER for ﬁnancial

support under research PGC2018-094975-C21. WC is supported

by the STFC AGP Grant ST/V000594/1 and by the ATRAC-

CIÓN DE TALENTO INVESTIGADOR DE LA COMUNIDAD

DE MADRID 2020-T1/TIC-19882. He further acknowledges the

science research grants from the China Manned Space Project with

NO. CMS-CSST-2021-A01 and CMS-CSST-2021-B01. GM ac-

knowledges ﬁnancial support from PID2019-106827GB-I00/AEI

/ 10.13039/501100011033 The CosmoSim database used in this pa-

per is a service by the Leibniz-Institute for Astrophysics Potsdam

(AIP). The MultiDark database was developed in cooperation

with the Spanish MultiDark Consolider Project CSD2009-00064.

The authors acknowledge The Red Española de Supercomputación

for granting computing time for running the hydrodynamical simu-

lations of The300 galaxy cluster project in the Marenostrum super-

computer at the Barcelona Supercomputing Center.

MNRAS 000,1–15 (2022)

14 Daniel de Andres et.al.

DATA AVAILABILITY

The trained models and data products for MDPL2, UNITSIM2048

and UNITSIM4096 are publicly available at https://github.

com/The300th/DarkML.

References

Allen S. W., Evrard A. E., Mantz A. B., 2011, Annual Review of

Astronomy and Astrophysics, 49, 409

Allgood B., Flores R. A., Primack J. R., Kravtsov A. V., Wechsler

R. H., Faltenbacher A., Bullock J. S., 2006, MNRAS,367, 1781

Altmann A., Toloşi L., Sander O., Lengauer T., 2010, Bioinfor-

matics, 26, 1340

Angulo R. E., Pontzen A., 2016, MNRAS,462, L1

Angulo R., Springel V., White S., Jenkins A., Baugh C., Frenk C.,

2012, Monthly Notices of the Royal Astronomical Society, 426,

2046

Angulo R. E., Zennaro M., Contreras S., Aricò G., Pellejero-Ibañez

M., Stücker J., 2021, MNRAS,507, 5869

Bahé Y. M., et al., 2017, Monthly Notices of the Royal Astronom-

ical Society, 470, 4186

Barnes D. J., Kay S. T., Henson M. A., McCarthy I. G., Schaye

J., Jenkins A., 2016, Monthly Notices of the Royal Astronomical

Society, p. stw2722

Barnes D. J., et al., 2017, Monthly Notices of the Royal Astronom-

ical Society, 471, 1088

Barredo Arrieta A., et al., 2020, Information Fusion, 58, 82

Baugh C. M., 2006, Reports on Progress in Physics,69, 3101

Behroozi P. S., Wechsler R. H., Wu H.-Y., 2012, The Astrophysical

Journal, 762, 109

Behroozi P. S., Wechsler R. H., Wu H.-Y., Busha M. T., Klypin

A. A., Primack J. R., 2013, ApJ,763, 18

Benson A. J., 2012, New Astronomy, 17, 175

Bentéjac C., Csörgő A., Martínez-Muñoz G., 2021, Artiﬁcial In-

telligence Review, 54, 1937

Bernardini M., Feldmann R., Anglés-Alcázar D., Boylan-Kolchin

M., Bullock J., Mayer L., Stadel J., 2022, MNRAS,509, 1323

Breiman L., 2001, Machine Learning, 45, 5

Bryan G. L., Norman M. L., 1998, ApJ,495, 80

Bullock J. S., Kolatt T. S., Sigad Y., Somerville R. S., Kravtsov

A. V., Klypin A. A., Primack J. R., Dekel A., 2001, MNRAS,

321, 559

Chen T., Guestrin C., 2016, in Proceedings of the 22Nd ACM

SIGKDD International Conference on Knowledge Discovery and

Data Mining. KDD ’16. ACM, New York, NY, USA, pp 785–794

Chisari N., et al., 2016, MNRAS,461, 2702

Chuang C.-H., et al., 2019, MNRAS,487, 48

Cora S. A., et al., 2018, MNRAS,479, 2

Croton D. J., et al., 2016, ApJS,222, 22

Cui W., Borgani S., Dolag K., Murante G., Tornatore L., 2012,

MNRAS,423, 2279

Cui W., Borgani S., Murante G., 2014, MNRAS,441, 1769

Cui W., et al., 2018, Monthly Notices of the Royal Astronomical

Society, 480, 2898

Cui W., et al., 2022, arXiv e-prints, p. arXiv:2202.14038

Davé R., Anglés-Alcázar D., Narayanan D., Li Q., Raﬁeferantsoa

M. H., Appleby S., 2019, MNRAS,486, 2827

Dietterich T. G., 1998, AI MAGAZINE, 18, 97

Dolag K., Komatsu E., Sunyaev R., 2016, Monthly Notices of the

Royal Astronomical Society, 463, 1797

Evrard A. E., Metzler C. A., Navarro J. F., 1996, ApJ,469, 494

Fernández-Delgado M., Cernadas E., Barro S., Amorim D., 2014,

Journal of Machine Learning Research, 15, 3133

Ferri F. J., Pudil P., Hatef M., Kittler J., 1994, in , Vol. 16, Machine

Intelligence and Pattern Recognition. Elsevier, pp 403–413

Fosalba P., Crocce M., Gaztañaga E., Castander F., 2015, Monthly

Notices of the Royal Astronomical Society, 448, 2987

Habib S., et al., 2016, New Astronomy, 42, 49

Ishiyama T., et al., 2021, MNRAS,506, 4210

Jo Y., Kim J.-h., 2019, Monthly Notices of the Royal Astronomical

Society, 489, 3565

Kamdar H. M., Turk M. J., Brunner R. J., 2016, MNRAS,457,

1162

Klypin A. A., Trujillo-Gomez S., Primack J., 2011, ApJ,740, 102

Klypin A., Yepes G., Gottlöber S., Prada F., Heß S., 2016, MN-

RAS,457, 4340

Knollmann S. R., Knebe A., 2009, ApJS,182, 608

Kravtsov A. V., Borgani S., 2012, Annual Review of Astronomy

and Astrophysics, 50, 353

Kravtsov A. V., Vikhlinin A., Nagai D., 2006, ApJ,650, 128

Kuhn M., Johnson K., et al., 2013, Applied predictive modeling.

Vol. 26, Springer

Lacey C. G., et al., 2016, MNRAS,462, 3854

Le Brun A. M., McCarthy I. G., Melin J.-B., 2015, Monthly No-

tices of the Royal Astronomical Society, 451, 3868

Liu A., et al., 2021, arXiv preprint arXiv:2106.14518

Lovisari L., Maughan B. J., 2022, arXiv e-prints, p.

arXiv:2202.07673

McCarthy I. G., Bird S., Schaye J., Harnois-Deraps J., Font A. S.,

Van Waerbeke L., 2018, Monthly Notices of the Royal Astro-

nomical Society, 476, 2999

Moews B., Davé R., Mitra S., Hassan S., Cui W., 2021, Monthly

Notices of the Royal Astronomical Society, 504, 4024

Murante G., Monaco P., Giovalli M., Borgani S., Diaferio A., 2010,

MNRAS,405, 1491

Navarro J. F., Frenk C. S., White S. D. M., 1997, ApJ,490, 493

Nelson D., et al., 2019, Computational Astrophysics and Cosmol-

ogy,6, 2

Nembrini S., König I. R., Wright M. N., 2018, Bioinformatics, 34,

3711

Pedregosa F., et al., 2011, Journal of Machine Learning Research,

12, 2825

Peebles P. J. E., 1969, ApJ,155, 393

Planck Collaboration et al., 2016, A&A,594, A13

Planelles S., Borgani S., Dolag K., Ettori S., Fabjan D., Murante G.,

Tornatore L., 2013, Monthly Notices of the Royal Astronomical

Society, 431, 1487

Potter D., Stadel J., Teyssier R., 2017, Computational Astrophysics

and Cosmology, 4, 1

Rasia E., et al., 2015, ApJ,813, L17

Schaye J., et al., 2015, MNRAS,446, 521

Schmidhuber J., 2015, Neural Networks, 61, 85

Sembolini F., Yepes G., De Petris M., Gottlöber S., Lamagna L.,

Comis B., 2013, MNRAS,429, 323

Skillman S. W., Warren M. S., Turk M. J., Wechsler R. H., Holz

D. E., Sutter P. M., 2014, arXiv e-prints, p. arXiv:1407.2600

Sunyaev R. A., Zeldovich Y. B., 1972, Comments on Astrophysics

and Space Physics, 4, 173

Truong N., et al., 2018, MNRAS,474, 4089

Tully R. B., Fisher J. R., 1977, Astronomy and Astrophysics, 54,

661

Villaescusa-Navarro F., et al., 2022, arXiv e-prints, p.

MNRAS 000,1–15 (2022)

Machine Learning in Galaxy Clusters 15

arXiv:2201.01300

Virtanen P., et al., 2020, Nature Methods,17, 261

Vogelsberger M., et al., 2014, MNRAS,444, 1518

Wadekar D., Villaescusa-Navarro F., Ho S., Perreault-Levasseur

L., 2021, The Astrophysical Journal, 916, 42

Wadekar D., et al., 2022, arXiv e-prints, p. arXiv:2201.01305

Wu H.-Y., Evrard A. E., Hahn O., Martizzi D., Teyssier R., Wech-

sler R. H., 2015, Monthly Notices of the Royal Astronomical

Society, 452, 1982

Zandanel F., Fornasa M., Prada F., Reiprich T. H., Pacaud F.,

Klypin A., 2018, MNRAS,480, 987

Zhang C., Liu C., Zhang X., Almpanidis G., 2017, Expert Systems

with Applications, 82, 128

APPENDIX A: DESCRIPTION AND ENUMERATION OF

FEATURE VARIABLES

In this appendix, we describe the selected 26 features from the

Rockstar + Consistent Trees catalogues. Although this infor-

mation can be found in Behroozi et al. (2012) and Behroozi

et al. (2013), as well as in the CosmoSim Multidark database

https//www.cosmosim.org/, we include in Table A1 a brief de-

scription of the variables, for the reader’s convenience.

This paper has been typeset from a T

EX/L

A

T

EX ﬁle prepared by the author.

MNRAS 000,1–15 (2022)

16 Daniel de Andres et.al.

Table A1. The feature variables used in this text from the Rockstar catalogue. The ﬁrst column represents the variable name and their enumeration in brackets.

Variable Units Description

M2500c (0) ℎ−1MMass inside a radius of a sphere where the matter density is 2500 times the critical density at the cluster’s redshift

num_prog (1) total number of progenitors of the cluster

M500c (2) ℎ−1MMass inside a radius of a sphere where the matter density is 500 times the critical density at the cluster’s redshift

M200c (3) ℎ−1MMass inside a radius of a sphere where the matter density is 200 times the critical density at the cluster’s redshift

Mpeak (4) ℎ−1MThe peak value of the halo mass across its accretion history

mvir (5) ℎ−1Mhalo mass within the virial radius

Macc (6) ℎ−1Mhalo mass at accretion time.

Vpeak (7) km/s Peak value of Vmax(9) across mass accretion history.

Vmax\@Mpeak (8) km/s Vmax at the expansion time at which Mpeak was reach

Vmax (9) km/s maximum value of the circular velocity.

Vacc (10) km/s Vmax at accretion time

rvir (11) ℎ−1kpc halo radius at virial overdensity

vrms (12) km/s root mean squared velocity dispersion

b_to_a(500c) (13) ration between the second largest shape ellipsoid axis and largest shape ellipsoid axis, for particles within 𝑅500

c_to_a(500c) (14) ration between the third largest shape ellipsoid axis and largest shape ellipsoid axis, for particles within 𝑅500

b_to_a (15) ration between the second largest shape ellipsoid axis and largest shape ellipsoid axis determined by method in Allgood et al. (2006)

c_to_a (16) ration between the third largest shape ellipsoid axis and largest shape ellipsoid axis determined by method in Allgood et al. (2006)

rs (17) ℎ−1kpc comoving scale radius from the ﬁt to a NFW (Navarro et al.,1997) density proﬁle

Rs_Klypin (18) ℎ−1kpc comoving scale radius determined using Vmax and Mvir formula (Klypin et al.,2011)

T/|U| (19) the ratio between the total kinetic and potential energies of particles within virial radius.

Xoﬀ (20) ℎ−1kpc Oﬀset between comoving density peak and the particles center of mass position

Voﬀ (21) km/s Oﬀset between halo core velocity and the center of mass velocity for particles within the virial radius

Spin (22) Peebles’s dimensionless Spin parameter of the halo (Peebles,1969).

Spin_Bullock (23) Bullock’s dimensionless spin parameter (Bullock et al.,2001)

a (24) Expansion scale factor of the corresponding simulation snapshot

scale_of_last_MM (25) Expansion scale factor of the last major merger with a mass ratio greater than 0.3

Halfmass_Scale (26) Expansion scale factor when the most massive halo progenitor reached 0.5×Mpeak(4)

MNRAS 000,1–15 (2022)