Knowing the number of floors of all buildings in a city is vital in many areas of urban planning such as energy demand prediction, estimation of inhabitant numbers of specific buildings or the calculation of population densities. Also, novel augmented reality use cases strongly rely on exact numbers and positions of floors. However, in many cases floor numbers are unknown, its collection is mostly a manual process or existing data is not up-to-date. A major difficulty in automating floor counting lies in the architectural variety of buildings from different decades. So far approaches are only rough geometric approximations. More recently approaches apply neural networks to achieve more precise results. But, these neural network approaches rely on various sources of input that are not available to every municipality. They also tend to fail on building types they have not been trained on and existing approaches are completely black-box so that it is difficult to determine when and why the prediction is wrong.
In this paper we propose a grey-box approach. In a stepwise process we can predict floor counts with high quality and remain explainable and parametrizable. By using data that is easy to obtain, namely the image of a building, we introduce two configurable methods to derive the number of floors. We demonstrate that the correct prediction quality can be significantly improved. In a thorough evaluation we analyze the quality depending on a number of factors such as image quality or building types.