Markus Pauly’s research while affiliated with TU Dortmund University and other places
What is this page?
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
The performance of Large Language Models, such as ChatGPT, generally increases with each further model release. In this paper, we investigate whether and how well different ChatGPT models solve the exams of three different logistics undergraduate courses. We contribute to the discussion of ChatGPT’s existing logistics knowledge, particularly in the field of warehousing. Both, the free version (GPT-4o mini) and the chargeable version (GPT-4o) completed three different logistics exams using three different prompting techniques (with and without role assignment as logistics expert or student). The o1-preview model was also used (without role assignment) for six attempts. The tests were repeated three times. A total of 60 tests were completed and compared to in-class results of logistics students. The results show that a total of 46 tests were passed. The best attempt solved 93% of an exam correctly. Compared to the students from the respective semester, ChatGPT outperforms students in one exam. On the other two exams the students perform better on average.
Many planning and decision activities in logistics and supply chain management are based on forecasts of multiple time dependent factors. Therefore, the quality of planning depends on the quality of the forecasts. We compare different state-of-the-art forecasting methods in terms of forecasting performance. Differently from most existing research in logistics, we do not perform this in a case-dependent way but consider a broad set of simulated time series to give more general recommendations. We therefore simulate various linear and nonlinear time series that reflect different situations. Our simulation results showed that the machine learning methods, especially Random Forests, performed particularly well in complex scenarios, with the differentiated time series training significantly improving the robustness of the model. In addition, the time series approaches proved to be competitive in low noise scenarios.
Random Forests have become a widely used tool in machine learning since their introduction in 2001, known for their strong performance in classification and regression tasks. One key feature of Random Forests is the Random Forest Permutation Importance Measure (RFPIM), an internal, non-parametric measure of variable importance. While widely used, theoretical work on RFPIM is sparse, and most research has focused on empirical findings. However, recent progress has been made, such as establishing consistency of the RFPIM, although a mathematical analysis of its asymptotic distribution is still missing. In this paper, we provide a formal proof of a Central Limit Theorem for RFPIM using U-Statistics theory. Our approach deviates from the conventional Random Forest model by assuming a random number of trees and imposing conditions on the regression functions and error terms, which must be bounded and additive, respectively. Our result aims at improving the theoretical understanding of RFPIM rather than conducting comprehensive hypothesis testing. However, our contributions provide a solid foundation and demonstrate the potential for future work to extend to practical applications which we also highlight with a small simulation study.
Ordinal data are frequently encountered, e.g., in the life and social sciences. Predicting ordinal outcomes can inform important decisions, e.g., in medicine or education. Two methodological streams tackle prediction of ordinal outcomes: Traditional parametric models, e.g., the proportional odds model (POM), and machine learning-based tree ensemble (TE) methods. A promising TE approach involves selecting the best performing from sets of randomly generated numeric scores assigned to ordinal response categories (ordinal forest; Hornung, 2019). We propose a new method, the ordinal score optimization algorithm, that takes a similar approach but selects scores through non-linear optimization. We compare these and other TE methods with the computationally much less expensive POM. Despite selective efforts, the literature lacks an encompassing simulation-based comparison. Aiming to fill this gap, we find that while TE approaches outperform the POM for strong non-linear effects, the latter is competitive for small sample sizes even under medium non-linear effects.
Comparing the mean vectors across different groups is a cornerstone in the realm of multivariate statistics, with quadratic forms commonly serving as test statistics. However, when the overall hypothesis is rejected, identifying specific vector components or determining the groups among which differences exist requires additional investigations. Conversely, employing multiple contrast tests (MCT) allows conclusions about which components or groups contribute to these differences. However, they come with a trade-off, as MCT lose some benefits inherent to quadratic forms. In this paper, we combine both approaches to get a quadratic form based multiple contrast test that leverages the advantages of both. To understand its theoretical properties, we investigate its asymptotic distribution in a semiparametric model. We thereby focus on two common quadratic forms - the Wald-type statistic and the Anova-type statistic - although our findings are applicable to any quadratic form. Furthermore, we employ Monte-Carlo and resampling techniques to enhance the test's performance in small sample scenarios. Through an extensive simulation study, we assess the performance of our proposed tests against existing alternatives, highlighting their advantages.
In times of ever faster learning machines, trust is an essential precondition for effective human-technology interaction (Rutinowski et al., 2024). We follow the definition of trust as the willingness of a party to be vulnerable to the actions of another party based on the expectation that the other will perform a particular action important to the trustor, irrespective of their means to monitor or control each other (de Visser et al., 2023). When individuals are not used to interacting with a particular technology, such as AI assisted (aerial) vehicles, they may over-or distrust systems. The objective of our research is to understand the process of trust calibration over time and to contribute to the "Theory of the Artificial Mind" (ToAM, e.g. Spector-Precel & Mioduser, 2015) within the context of work scenarios assumed to be crucial in developing an appropriate trust level. If humans do not have trust in an AI assisted system, they will not be willing to delegate tasks, regardless of their technical performance potential. Contrary, over-reliance on the system can lead to human complacency and failing to detect AI system errors (Alonso and De La Puente, 2018). The miscalibration of human trust in AI, and resulting inappropriately reliance on an AI system agent, will cause decreased performance in human-AI teams (Alonso and De La Puente, 2018). Thus, trust is a key factor in human-agent interaction and influences whether a human will rely upon the system (Lyons, 2013). Trust in AI systems is closely related to the level of understandability and predictability of a systems behaviour (Akula et al., 2019; Rutinowski et al., 2024). Trust in AI assisted systems occurs when users believe an AI system will work as intended, understand its strengths and limitations, and rely on or intend to rely on it.
The process of trust calibration consists of updating the trust stance by aligning the perception of an actor's trustworthiness with its actual trustworthiness so that the prediction error is minimized. Aspects of the calibrated trust dimension include the human's belief that an AI system will accomplish the task it was designed for, as well as an understanding of its errors and limitations (based on McDermott et al. 2019). Trustworthiness is a "felt" consequence of human-AI interaction and a fluent process that is shaped along calibration points. One of the calibration points is a trust violation followed by a subsequent trust repair. In interactions with unfamiliar AI-technology, trust initially depends on person related variables such as prior experience with a similar technology, technical knowledge, and the individual trust propensity. During further interaction, trust will be calibrated.
To empirically test the assumptions of the relationship between a human ToAM and its predictions, we designed the use case “Warehouse Drones”. Technological advancements of aerial drones for such indoor applications receive a lot of attention from the robotics community (Awasthi et. al, 2023). As of now, they are not common in industrial settings. This is because challenges like indoor localization, collision avoidance and battery runtime are not yet solved to a degree that would allow for their large-scale roll-out.
21 participants took part in the study (approved by the local ethics committee and after given informed consent, No 812) in August 2024 in the research hall at the chair of material handling and warehousing of TU Dortmund University. Two participants (participant 17 and 20) were excluded from our analysis due to incorrect execution of the experiment, leaving us with measurements from 19 participants (mean Mage = 28.89, standard deviation SDage = 9.71; 8 female, 11 male).
The research hall is a testbed for swarms of unmanned aerial vehicles (Gramse et. al 2023). It is equipped with a marker-based motion capturing system. It allows for tracking of all moving entities (humans, drones, boxes, etc.) with up to 300 Hz and an accuracy of less than 1 mm. Beyond that, RGBD-cameras are deployed for video recording. The aerial drones are self-built, in cooperation with the Fraunhofer Institute for Material Flow and Logistics. Common warehouse equipment such as boxes, racks, and forklift trucks are available to create logistical scenarios that closely resemble real-world systems.
When comparing multiple groups in clinical trials, we are not only interested in whether there is a difference between any groups but rather the location. Such research questions lead to testing multiple individual hypotheses. To control the familywise error rate (FWER), we must apply some corrections or introduce tests that control the FWER by design. In the case of time-to-event data, a Bonferroni-corrected log-rank test is commonly used. This approach has two significant drawbacks: (i) it loses power when the proportional hazards assumption is violated [1] and (ii) the correction generally leads to a lower power, especially when the test statistics are not independent [2]. We propose two new tests based on combined weighted log-rank tests. One as a simple multiple contrast test of weighted log-rank tests and one as an extension of the so-called CASANOVA test [3]. The latter was introduced for factorial designs. We propose a new multiple contrast test based on the CASANOVA approach. Our test promises to be more powerful under crossing hazards and eliminates the need for additional p-value correction. We assess the performance of our tests through extensive Monte Carlo simulation studies covering both proportional and non-proportional hazard scenarios. Finally, we apply the new and reference methods to a real-world data example. The new approaches control the FWER and show reasonable power in all scenarios. They outperform the adjusted approaches in some non-proportional settings in terms of power.
... (Kane, Price, Scotch, & Rabinowitz, 2014) (Hossain, 2023) (Noureen, Atique, Roy, & Bayne, 2019). This model outperformed the ARIMA and SARIMA model in predictive ability also in the paper of (Schmid, Roidl, & Pauly, 2024) On the other hand, the random forest model presented on the paper (Nerjaku & Sinaj, 2024) demonstrated low performance on predicting non-performing loans in Albania. In addition to random forest model the values of GDP are also predicted using the SARIMA and SARIMAX models. ...
... Additionally, Yu et al. (2012) propose a forecasting model to predict the fashion color trend leading to a higher success rate of new fashion products and an example of how resource allocation can be improved by increased operational efficiency through more accurate short to mid-term demand forecast is provided by Pauly and Kuhlmann (2023). Furthermore, forecasts play an important role in anticipating technological change (Foster, 1986;Modis, 1999) and estimating technology and product life cycles, which inform important portfolio and product development decisions (Modis, 1994;Petropoulos et al., 2022;Steinmeister et al., 2023). This is particularly true for the semiconductor industry with long lead times, a dynamic technological environment, and shortening product life cycles (Lv et al., 2018;Macher, 2006;Wu & Chien, 2008). ...
... The studies presented offer novel methodologies that leverage sequential statistical tests and adversarial approaches to enhance the efficiency and reliability of model selection and evaluation processes. Buczak et al. (2024) introduce sequential random search (SQRS), this study presents a novel approach to hyperparameter tuning that incorporates sequential statistical tests to eliminate underperforming configurations early. Demonstrated across various datasets, SQRS achieves comparable optimization with fewer evaluations, showcasing the efficiency and potential of integrating sequential testing in machine learning workflows. ...
... There are additional concerns about the accuracy and reliability of the content generated using ChatGPT and its use in education (Gill et al., 2024). Generative AI-based content is not free from biases, and such biased content can thus interfere with academic learning if tools such as ChatGPT are used as a knowledge source (Motoki et al., 2023;Ray, 2023;Rutinowski et al., 2024). Therefore, a thoughtful and responsible approach to the integration of AI tools is necessary to ensure their effective and ethical use in enhancing the learning experience for students. ...
... But also vectorized correlation matrices r (see e.g. Sattler and Pauly (2024)) or coefficient vectors of regression models Fahrmeir et al. (2013) are feasible. In survival analysis, also the restricted mean survival time is a viable parameter Munko et al. (2024). ...
... An effective statistical method for determining the significance of variations in more than two dependent variable means is multivariate analysis of variance (MANOVA) (Jadhav and Dolas 2023;Pursitasari et al. 2024). Recent advancements have focused on more adaptable and universal techniques, such as taking general quantiles like the median for a robust analysis, whereas classic MANOVA methodologies rely on assumptions like normality and homogeneity (Baumeister et al. 2024). When assumptions such as homogeneity of variance are broken, Pillai's trace test is the recommended choice (Jadhav and Dolas 2023). ...
... As we aim for a fair and realistic method comparison, we take up the recent suggestion to follow a mixed approach between simulation and benchmarking, in which the simulation model is motivated from real data, see e.g. Friedrich and Friede (2023) and Thurow et al. (2023). To this end, our replications are closely based on the National Education Panel Study for the Starting Cohort Adults (NEPS, SC6: 13.0.0, ...
... Precise forecasting allows warehouse managers to optimize space use, reduce stock-out risk, and improve overall efficiency [6,7]. In supply chain management, accurate forecasts are, for example, used to optimize resource use across the entire supply chain [8][9][10]. The above references show that the use of forecasting techniques such as time series models and machine learning methods has become increasingly popular in logistics in recent years. ...
... This suggests that the observed trend might be explained by a neglected interaction of those variables. In fact, Knop et al. (2023) showed in a re-analysis of the above mentioned data that it is vitally important to account for confounding and interaction effects, when making inference based on meta-regression with multiple moderators. Note that the importance of investigating interactions in meta-regression was already acknowledged by Li et al. (2017). ...
... At the same time, there exists background knowledge in many domains that can help to compensate for the potential shortcomings of datasets. For instance, domain experts have an understanding of the causal relationships in the data generation process [4]. It is the scope of this paper to unify expert knowledge and datasets with missing data to derive approximations of the underlying joint distribution. ...