A New Flexible Train-Test Split Algorithm, an approach for choosing among the Hold-out, K-fold cross-validation, and Hold-out iteration

2026-01-04 20:00 GMT · 4 months ago aimagpro.com

arXiv:2501.06492v2 Announce Type: replace
Abstract: Choosing an appropriate strategy for partitioning data into training and evaluation sets is a critical step in machine learning, yet validation methods are often selected using default or conventional settings without considering their impact on generalizability and real-world performance. Common approaches such as hold-out validation or k-fold cross-validation with fixed k values are frequently applied based solely on empirical practice.
To address this issue, we propose a flexible Python-based framework that systematically examines how different validation strategies affect predictive performance across seven widely used machine learning algorithms, including Decision Trees, K-Nearest Neighbors, Naive Bayes variants, Logistic Regression, calibrated linear Support Vector Machines, and histogram-based gradient boosting. The framework evaluates these methods under a wide range of validation schemes, including hold-out splits from 10% to 90%, k-fold cross-validation with k between 3 and 15, repeated hold-out, and nested cross-validation.
The framework is applied to three biomedical datasets of varying size, and performance is assessed using ROC-AUC, accuracy, and the Matthews correlation coefficient. The results show that no single validation strategy consistently outperforms others across all algorithms and datasets, indicating that optimal validation depends on the interaction between the algorithm, dataset characteristics, and evaluation metric.