PreprintPDF Available

The Body Mass Index recalculated

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The so-called 'least squares regression' for mathematical modeling is a widely used technique. It's so common that one might think nothing could be improved to the algorithm anymore. But it can. By minimizing the squares of the differences between measured and predicted values not only in the vertical, but also in the horizontal direction. I call this 'multidirectional regression'. The difference is very significant, especially for power function models, often used in biomedical sciences. And it makes the regression invariant if the dependent and independent variables are switched. This was a neglected problem with the traditional method. The Body Mass Index and the Corpulence Index and their correlation with body fat percentage are studied here as an example showing that this regression technique produces better results.
The Body Mass Index recalculated
Koen Van de moortel, MSc experimental physics, independent math tutor
Jules de Saint-Genoisstraat 98, 9050 Gentbrugge, Belgium, info@lerenisplezant.be
15 june 2021 - small addition 20 february 2022
Abstract: The so-called ‘least squares regression’ for mathematical modeling is a widely used
technique. It’s so common that one might think nothing could be improved to the algorithm
anymore. But it can. By minimizing the squares of the differences between measured and
predicted values not only in the vertical, but also in the horizontal direction. I call this
‘multidirectional regression’. The difference is very significant, especially for power function
models, often used in biomedical sciences. And it makes the regression invariant if the
dependent and independent variables are switched. This was a neglected problem with the
traditional method. The Body Mass Index and the Corpulence Index and their correlation with
body fat percentage are studied here as an example showing that this regression technique
produces better results.
Keywords: nonlinear regression, mathematical modeling, multidirectional regression, curve
fitting, software, algorithm, body mass index, BMI, corpulence index, ponderal index, body fat
percentage, anthropometry, scaling law, power function.
Introduction: the BMI mystery
In the process of writing a book about measuring methodology and regression analysis, I
thought the so called “Body Mass Index” (BMI) might be a good example of quantization, how to
put a number on ‘overweight’. As you probably know, it is calculated by taking a person’s mass
m (in kg) and divide it by the square of the height h (in meters). Now, this is quite awkward,
since the masses of objects with the same shape and similar density distributions are
proportional with the
third
power of the height (or any longitudinal dimension).
So I started digging... Why did the inventor, Adolphe Quêtelet, who happens to have lived in the
same city as me (Ghent, Belgium), define this index with in 1832? I wanted to find the
original data that he analyzed, the ‘reference people’ to calibrate it. Strangely, there seems to
be no trace of them on the internet, and also no other dataset could be found! Thousands of
sites offer ‘ideal mass’ tables or calculators, and many use obscure disclosed formulas,
clearly not using h², some even using a linear relationship!
For me as a physicist, it’s hard to believe, but apparently it took almost a century before
someone (Fritz Rohrer, CH) came up with the idea to calculate the index withanyway. This
number: m/h³ is then called ‘Corpulence Index’ (CI) or ‘Ponderal Index’ (PI) [Rohrer 1921]. It
took another century until someone (Sultan Babar, SA) found what was to be expected: “It has
the advantage that it doesn’t need to be adjusted for age after adolescence.” [Babar 2015] In
spite of that, the general public still only knows the BMI. Today, june 2021, a search on
Academia.edu on BMI produces more than 795000 results, while only 1735 articles mention the
CI, and 6428 the PI. It took until 2013 before someone like Nick Trefethen (numerical analyst at
the University of Oxford, GB) raised his eyebrows and dared to make this remark in The
Body Mass Index recalculated - Page 1 of 9
Economist: “The body-mass index that you (and the National Health Service) count on to
assess obesity is a bizarre measure. We live in a three-dimensional world, yet the BMI is
defined as weight divided by height squared. It was invented in the 1840s, before calculators,
when a formula had to be very simple to be usable. As a consequence of this ill-founded
definition, millions of short people think they are thinner than they are, and millions of tall
people think they are fatter.” And then he said: “You might think that the exponent should
simply be 3, but
that doesn’t match the data at all
. It has been known for a long time that
people don’t scale in a perfectly linear fashion as they grow. I propose that a better
approximation to the actual sizes and shapes of healthy bodies might be given by an exponent
of 2.5. So here is the formula I think is worth considering as an alternative to the standard
BMI: ‘new BMI’=1.3m/h2.5. [Trefethen, 2013, my emphasis]
Now, how could it “not match the data”? I was curious now to inspect some data myself. After
a long search, I came in touch with Nir Krakauer (The City College of New York), who was also
doing BMI-related modeling, and he was so kind to refer me to his data:
rdrr.io/github/dtkaplan/NIMBIOS/man/nhanesOriginal.html.
From this large collection, I extracted the masses and heights of 90 adult men who had a more
or less ‘ideal’ body fat percentage: between 11.6 and 13.8%. I’m not a medical doctor, but
according to different sources these percentages seem to be considered good for young
adults. The most important point for this selection was to have a more or less homogeneous
group with a range of sizes, but with similar densities. Of course I know other factors like
bone density and body type play a role as well, but this is the best I could do.
First, I put the data in the popular math program ‘GeoGebra’ (version 5 Classic):
This calculated the following relationship as ‘best fitting’: m = 18.8541 h2.1419.
Aha, that must have been the reason why Quêtelet decided to round the exponent of h to 2,
because the empirical value seems to be 2.1419! Then I realized that this program takes the
logarithms of the variables, in order to reduce the regression problem to a linear one. This
causes errors, as I illustrated elsewhere [Van de moortel 2021].
So I decided to put the data in my own software program, called ‘FittingKVdm’, which uses an
iterative algorithm to estimate the parameters. This produced: m = 19.331 h2.1084. Now the
exponent was even closer to 2! Strange!
GraphPad Prism 9.0.2, a program that seems well designed to me, and also uses iteration,
produced an identical result. Their writers also condemn the logarithm habit, by the way.
Still not being happy, I wanted to see the difference between a fit with a fixed exponent of 2
and one with exponent 3. The results: m = 20.5918 h2 and m = 11.4640 h3.
The value of 20.5918 is indeed a good BMI, and 11.4640 is close to the ‘good’ value of 12 for the
CI according to Sultan Babar.
Now, a picture is worth a thousand words, so I would like you to take a look on the mass
versus height graphs of both fitted curves (don’t mind if you can’t read the small letters, just
look at the dots and the lines):
Body Mass Index recalculated - Page 2 of 9
Which line visually fits the best through the cloud of points? Everyone I asked this question,
answered: the one on the right, obviously!
So now the question came up: is there something wrong with the regression method itself?
Well, there is definitely an asymmetry: the classical algorithm that everybody uses, minimizes
the sum of the (weighted)
vertical
distances between the measured (yi) and the predicted y
values f(xi). The weights are inversely proportional to the square of the measuring errors σy,i.
The parameters in the model function are adjusted in order to minimize this sum:
Sy f(x )
i i
2
y,i
2
i 1
n
Would it make any difference if we would use the sum of the
horizontal
distances? Why is it
not done? Well, in the case of non-invertible functions, especially periodical functions, there
are many such distances for every y value, but for a bijective function like the one above, it’s
perfectly possible. The simplest way to try it, is by switching the so called ‘independent’ and
the ‘dependent’ variable.
If the ‘best fit’ for our data, with free moving exponent, m = 19.331 h2.1084, was indeed the best fit,
it shouldn’t make any difference if we switched the h and m columns and fitted again, should
it? The expected outcome of this procedure would be:
 
hm
19.331 0.24534 m
1
2.1084 0.47429
Now, what was the actual outcome? h = 0.72877 m0.21394 Or, inverted: m = 4.38814 h4.67421
I double-checked it using GraphPad... same result. GeoGebra gave almost the same: h =
0.7225 m0.2159
This is not just a small difference, like a ‘rounding error’ or so. This is obviously shocking and
dramatic!
Is it possible that nobody ever noticed this? Or that nobody cared? Well, after a long search, I
found some people who made the same observation, but I couldn’t find any textbooks or
software manuals mentioning this problem.
I experimented with other data and other invertible functions. The same happens every time.
Body Mass Index recalculated - Page 3 of 9
Solution: multidirectional regression
The classical regression, minimizing the squares of the vertical distances (or residues ry,i, see
the graph below), seems to pull the line through a cloud of points too much horizontally.
Minimizing the horizontal distances rx,i (i.e. by switching x and y) pulls the line too much
vertically.
Because of symmetry reasons, there is no reason to favor one of both if f is a bijection.
Therefore it seems only logical to give the two ‘pulling forcesequal rights, and minimize this
sum:
 
Sy f(x ) x f (y )
i i
2
i
1
i
2
y,i
2
x,i
2
i 1
n
 
 
I implemented this in a Windows program called ‘FittingKVdm, version 1.0’, and I would call it
‘multidirectional regression’, or shorter ‘xy-fitting’, abbreviated ‘MDLS’ (as opposed to ‘OLS’ =
‘Ordinary Least Squares’). It can be expanded in multiple directions of course, if there are
more variables.
Fitting the same data now, yielded:
m=ahb, with a=10.482±0.058*, b=3.1581±0.0092*.
That exponent b is a lot closer to 3, as we physicists always expected! And 3.1581 is ap-
proximately equal to the geometric mean of 2.1084 and 4.67421, which makes sense.
Now again switching the variables, we obtain an exponent of 1/3.1581 as expected by the
symmetry!
(*The confidence intervals are estimated by doing 100 fittings with the data+random noise with
the same magnitude as the probable error on the measurements, i.e. the xi values are
replaced by xi+g(σx,i) and yi+g(σy,i), g being a Gauss distributed random number function.)
Body Mass Index recalculated - Page 4 of 9
Confirmation: relationship with body fat percentage
The same analysis was done for men and women (aged 16 and more) with different fat
percentages, first with parameters a and b floating, then with a fixed value of b=2, so in that
case ‘a’ represents the classic ‘BMI’, and then with b fixed to 3, so in this case ‘a’ represents
the ‘BMI with h³’ aka ‘CI’.
Men a and b floating b=2 fixed b=3 fixed
fat % (f) n a b a=‘BMI’ a= ‘BMI3’=‘CI’
[10, 12[
[12, 14[
[14, 16[
[16, 18[
[18, 20[
[20, 22[
[22, 24[
[24, 26[
[26, 28[
[28, 30[
[30, 32[
[32, 34[
[34, 36[
[36, 38[
[38, 40[
[40, 42[
[42, 44[
[44, 46[
[46, 48[
[48, 50[
9
169
344
423
467
528
562
806
960
943
837
721
501
351
224
118
51
15
9
1
9.800
10.84
7.695
10.82
10.28
9.587
10.24
11.81
12.47
12.10
11.80
12.15
10.31
10.55
9.109
8.993
10.89
35.67
10.52
3.192
3.093
3.761
3.186
3.337
3.540
3.543
3.366
3.326
3.480
3.600
3.628
4.016
4.070
4.485
4.567
4.419
2.355
4.735
19.318 ± 0.017
20.5621 ± 0.0051
21.2382 ± 0.0043
21.4028 ± 0.0033
22.3881 ± 0.0038
23.2707 ± 0.0035
24.5169 ± 0.0030
25.8724 ± 0.0029
26.6912 ± 0.0022
28.0654 ± 0.0027
29.3692 ± 0.0035
30.8169 ± 0.0031
32.9015 ± 0.0041
34.6159 ± 0.0046
38.3387 ± 0.0063
39.8371 ± 0.0084
44.184 ± 0.014
43.407 ± 0.024
46.445 ± 0.032
43.711 ± 0.054
10.921 ± 0.012
11.4527 ± 0.0036
11.9307 ± 0.0029
12.0430 ± 0.0022
12.5053 ± 0.0022
13.0892 ± 0.0022
13.9356 ± 0.0022
14.5751 ± 0.0022
15.0336 ± 0.0019
15.8937 ± 0.0019
16.6075 ± 0.0021
17.4054 ± 0.0024
18.4909 ± 0.0030
19.5014 ± 0.0035
21.5662 ± 0.0041
22.4331 ± 0.0067
24.8267 ± 0.0099
25.308 ± 0.021
27.236 ± 0.024
24.836 ± 0.049
When a and b are left free to be adjusted, the fitting is not very stable. We can presume that
this is because many other factors besides the body fat percentage play a role, like the body
type, muscle weight etc. The measurement points form clouds rather than precise lines.
Anyhow, the exponents are almost always more than 3. The average is even 3.668, and the
standard deviation 0.579. With fixed exponents, we see a very nice correlation of a versus the
fat percentage. The ‘goodness of fit’ (indicated by the χ² per degree of freedom, but also by the
estimated errors on the fitted parameters) was always better with b=3 than with b=2.
Body Mass Index recalculated - Page 5 of 9
Women a and b floating b=2 fixed b=3 fixed
fat % (f) n a b a=‘BMI’ a= ‘BMI3’=‘CI’
[16, 18[
[18, 20[
[20, 22[
[22, 24[
[24, 26[
[26, 28[
[28, 30[
[30, 32[
[32, 34[
[34, 36[
[36, 38[
[38, 40[
[40, 42[
[42, 44[
[44, 46[
[46, 48[
[48, 50[
[50, 52[
[52, 54[
[54, 56[
[56, 58[
2
13
32
51
124
201
275
383
459
562
720
738
892
896
756
583
404
255
99
26
4
6.472
7.763
9.110
9.542
9.814
11.08
13.82
11.48
9.557
11.03
12.49
15.12
11.49
14.76
12.14
12.94
11.62
10.22
10.84
20.06
4.187
3.807
3.469
3.479
3.461
3.314
2.914
3.389
3.878
3.707
3.570
3.316
4.014
3.606
4.156
4.199
4.590
5.061
5.096
3.977
17.209 ± 0.031
18.000 ± 0.021
18.834 ± 0.011
18.869 ± 0.092
19.8013 ± 0.0077
20.4095 ± 0.0055
21.1685 ± 0.0054
21.6030 ± 0.0039
23.1021 ± 0.0046
24.5112 ± 0.0042
25.5553 ± 0.0031
26.9707 ± 0.0030
28.7631 ± 0.0030
30.6793 ± 0.0030
32.4786 ± 0.0032
34.6952 ± 0.0039
38.0436 ± 0.0049
41.1029 ± 0.0074
46.046 ± 0.011
47.842 ± 0.027
52.369 ± 0.048
10.927 ± 0.023
11.302 ± 0.015
11.5395 ± 0.0076
11.5032 ± 0.0072
12.0853 ± 0.0054
12.3724 ± 0.0037
12.9417 ± 0.0040
13.2474 ± 0.0027
13.9777 ± 0.0030
14.8582 ± 0.0028
15.6460 ± 0.0028
16.5106 ± 0.0025
17.6573 ± 0.0022
18.8671 ± 0.0023
19.8652 ± 0.0029
21.3284 ± 0.0034
23.2991 ± 0.0038
25.2133 ± 0.0057
28.013 ± 0.011
29.441 ± 0.020
32.142 ± 0.036
We can make the same remarks for the women. The average b is here even more: 3.859, and
the standard deviation 0.556.
If the body mass is better correlated with than with h², as it was found, we would also
expect that the body fat percentage (f) is better correlated with the CI (=BMI3) than with the
classic BMI.
The graphs below show CI vs f for men (from the table above). The relationship is clearly not
linear, but it seems to follow an exponential pattern. The dotted lines are ‘worst case
scenarios', with parameters at the limits of their confidence intervals.
Body Mass Index recalculated - Page 6 of 9
The curve through the data points is the best fitting exponential function B=b af+c, with a, b and
c fitted parameters and B=‘classic BMIor CI’ (CI in the above graph). The graphs are similar
for men and women, for BMI and CI, but with different parameters of course. They are listed
here:
Corpulence Index vs Body Fat % (men 16 and older)
Corpulence Index vs Body Fat % (women 16 and older)
Body Mass Index recalculated - Page 7 of 9
men aged 16 or more (n=8039) women aged 16 or more (n=7475)
B=classic BMI B=BMI3 = CI B=classic BMI B=BMI3 = CI
a 1.0424±0.0011 1.0433±0.0011 1.0671±0.0019 1.0687±0.0019
b 5.08±0.26 2.77±0.14 0.97±0.081 0.564±0.050
c 11.42±0.37 6.48±0.21 14.7±0.31 9.15±0.16
χ²x per degree of
freedom
1.79432 1.5959 0.581528 0.444813
χ²y per degree of
freedom
11953.5 6344.76 3654.17 2086.1
By comparing the χ² values (in both x and y directions), we see that the body fat percentage (f)
is better correlated with the CI than with the BMI.
We see that the Corpulence Index can be estimated from the body fat % (f), using:
‘CI’ = 2.77 1.0433f + 6.43 (men)
‘CI’ = 0.564 1.0687f + 9.15 (women)
Conclusion
I hope I have awakened your interest and you will be curious to test multidirectional
regression with your own data. Use it whenever the dependent and independent variables can
be switched. The necessary software ‘FittingKVdm’ can be downloaded from:
www.lerenisplezant.be/fitting.htm and a 25 day free trial is possible. More examples of the
method and some practical thoughts can be found in another text [Van de moortel, April 2021].
I also hope it has become clear now that there is no reason anymore to use the classic BMI.
The ‘BMI’ with h³, aka Corpulence Index should logically be a better estimator for all kinds of
health issues. Maybe the theoretical exponent of h should even be bigger than 3, as suggested
by the empiric evidence. Also the reason why the relationship between body fat and the CI
seems to be exponential, is a puzzle to be solved by biologists. Maybe they already did, but my
main point here is to remark that the correlation is better with m/h³ than with m/h².
All your remarks and suggestions are most welcome, of course.
Body Mass Index recalculated - Page 8 of 9
References:
Babar, Sultan (March 2015): “Evaluating the Performance of 4 Indices in Determining
Adiposity”. Clinical Journal of Sport Medicine, Lippincott Williams & Wilkins, 25 (2): 183.
Rohrer, Fritz (1921): “Der Index der Körperfülle als Maß des Ernährungszustandes”. Münchner
Med. WSCHR. 68: 580–582.
Trefethen, Nick (2013) on his own website: people.maths.ox.ac.uk/trefethen/bmi.html
Van de moortel, Koen (February 2021): “Non-linear regression - Why you shouldn’t take the
logarithms of your variables”, DOI: 10.13140/RG.2.2.18442.80324
www.researchgate.net/publication/349324179_Non-linear_regression_-Why_you_shouldn'
t_take_the_logarithms_of_your_variables
Van de moortel, Koen (April 2021): “Multidirectional regression analysis”, DOI:
10.13140/RG.2.2.16703.64162
www.researchgate.net/publication/350838636_Multidirectional_regression_analysis
Also published here:
http://www.physicsjournal.net/article/view/30/3-3-11
Conflict of interest:
The author declares no conflict of interest.
Body Mass Index recalculated - Page 9 of 9
ResearchGate has not been able to resolve any citations for this publication.
Method
Full-text available
The article shows an improvement for the well-known least squares regression analysis is possible in the case of invertible functions.
Method
Full-text available
For non-linear regression, often the logarithms of the variables are taken, to reduce the problem to a linear regression. With some examples, I explain you why this is not such a good idea.
Evaluating the Performance of 4 Indices in Determining Adiposity
  • Sultan Babar
Babar, Sultan (March 2015): "Evaluating the Performance of 4 Indices in Determining Adiposity". Clinical Journal of Sport Medicine, Lippincott Williams & Wilkins, 25 (2): 183.