Archives AI News

Non-Linear Counterfactual Aggregate Optimization

arXiv:2509.03438v1 Announce Type: new Abstract: We consider the problem of directly optimizing a non-linear function of an outcome, where this outcome itself is the sum of many small contributions. The non-linearity of the function means that the problem is not equivalent to the maximization of the expectation of the individual contribution. By leveraging the concentration properties of the sum of individual outcomes, we derive a scalable descent algorithm that directly optimizes for our stated objective. This allows for instance to maximize the probability of successful A/B test, for which it can be wiser to target a success criterion, such as exceeding a given uplift, rather than chasing the highest expected payoff.

Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation

arXiv:2509.03456v1 Announce Type: new Abstract: Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that better estimators inherently yield superior policies. Although theoretically justified, we argue this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes. In this paper, we provide theoretical insights and extensive empirical evidence showing that current OPL methods encounter severe optimization issues, particularly as action spaces become large. We demonstrate that simpler weighted log-likelihood objectives enjoy substantially better optimization properties and still recover competitive, often superior, learned policies. Our findings emphasize the necessity of explicitly addressing optimization considerations in the development of OPL algorithms for large action spaces.

Understanding and Improving the Shampoo Optimizer via Kullback-Leibler Minimization

arXiv:2509.03378v1 Announce Type: new Abstract: As an adaptive method, Shampoo employs a structured second-moment estimation, and its effectiveness has attracted growing attention. Prior work has primarily analyzed its estimation scheme through the Frobenius norm. Motivated by the natural connection between the second moment and a covariance matrix, we propose studying Shampoo's estimation as covariance estimation through the lens of Kullback-Leibler (KL) minimization. This alternative perspective reveals a previously hidden limitation, motivating improvements to Shampoo's design. Building on this insight, we develop a practical estimation scheme, termed KL-Shampoo, that eliminates Shampoo's reliance on Adam for stabilization, thereby removing the additional memory overhead introduced by Adam. Preliminary results show that KL-Shampoo improves Shampoo's performance, enabling it to stabilize without Adam and even outperform its Adam-stabilized variant, SOAP, in neural network pretraining.

Bayesian Additive Regression Trees for functional ANOVA model

arXiv:2509.03317v1 Announce Type: new Abstract: Bayesian Additive Regression Trees (BART) is a powerful statistical model that leverages the strengths of Bayesian inference and regression trees. It has received significant attention for capturing complex non-linear relationships and interactions among predictors. However, the accuracy of BART often comes at the cost of interpretability. To address this limitation, we propose ANOVA Bayesian Additive Regression Trees (ANOVA-BART), a novel extension of BART based on the functional ANOVA decomposition, which is used to decompose the variability of a function into different interactions, each representing the contribution of a different set of covariates or factors. Our proposed ANOVA-BART enhances interpretability, preserves and extends the theoretical guarantees of BART, and achieves superior predictive performance. Specifically, we establish that the posterior concentration rate of ANOVA-BART is nearly minimax optimal, and further provides the same convergence rates for each interaction that are not available for BART. Moreover, comprehensive experiments confirm that ANOVA-BART surpasses BART in both accuracy and uncertainty quantification, while also demonstrating its effectiveness in component selection. These results suggest that ANOVA-BART offers a compelling alternative to BART by balancing predictive accuracy, interpretability, and theoretical consistency.

Scale-Adaptive Generative Flows for Multiscale Scientific Data

arXiv:2509.02971v1 Announce Type: new Abstract: Flow-based generative models can face significant challenges when modeling scientific data with multiscale Fourier spectra, often producing large errors in fine-scale features. We address this problem within the framework of stochastic interpolants, via principled design of noise distributions and interpolation schedules. The key insight is that the noise should not be smoother than the target data distribution -- measured by Fourier spectrum decay rates -- to ensure bounded drift fields near the initial time. For Gaussian and near-Gaussian distributions whose fine-scale structure is known, we show that spectrum-matched noise improves numerical efficiency compared to standard white-noise approaches. For complex non-Gaussian distributions, we develop scale-adaptive interpolation schedules that address the numerical ill-conditioning arising from rougher-than-data noise. Numerical experiments on synthetic Gaussian random fields and solutions to the stochastic Allen-Cahn and Navier-Stokes equations validate our approach and demonstrate its ability to generate high-fidelity samples at lower computational cost than traditional approaches.

Inference on covariance structure in high-dimensional multi-view data

arXiv:2509.02772v1 Announce Type: cross Abstract: This article focuses on covariance estimation for multi-view data. Popular approaches rely on factor-analytic decompositions that have shared and view-specific latent factors. Posterior computation is conducted via expensive and brittle Markov chain Monte Carlo (MCMC) sampling or variational approximations that underestimate uncertainty and lack theoretical guarantees. Our proposed methodology employs spectral decompositions to estimate and align latent factors that are active in at least one view. Conditionally on these factors, we choose jointly conjugate prior distributions for factor loadings and residual variances. The resulting posterior is a simple product of normal-inverse gamma distributions for each variable, bypassing MCMC and facilitating posterior computation. We prove favorable increasing-dimension asymptotic properties, including posterior contraction and central limit theorems for point estimators. We show excellent performance in simulations, including accurate uncertainty quantification, and apply the methodology to integrate four high-dimensional views from a multi-omics dataset of cancer cell samples.

A Composite-Loss Graph Neural Network for the Multivariate Post-Processing of Ensemble Weather Forecasts

arXiv:2509.02784v1 Announce Type: cross Abstract: Ensemble forecasting systems have advanced meteorology by providing probabilistic estimates of future states, supporting applications from renewable energy production to transportation safety. Nonetheless, systematic biases often persist, making statistical post-processing essential. Traditional parametric post-processing techniques and machine learning-based methods can produce calibrated predictive distributions at specific locations and lead times, yet often struggle to capture dependencies across forecast dimensions. To address this, multivariate post-processing methods-such as ensemble copula coupling and the Schaake shuffle-are widely applied in a second step to restore realistic inter-variable or spatio-temporal dependencies. The aim of this study is the multivariate post-processing of ensemble forecasts using a graph neural network (dualGNN) trained with a composite loss function that combines the energy score (ES) and the variogram score (VS). The method is evaluated on two datasets: WRF-based solar irradiance forecasts over northern Chile and ECMWF visibility forecasts for Central Europe. The dualGNN consistently outperforms all empirical copula-based post-processed forecasts and shows significant improvements compared to graph neural networks trained solely on either the continuous ranked probability score (CRPS) or the ES, according to the evaluated multivariate verification metrics. Furthermore, for the WRF forecasts, the rank-order structure of the dualGNN forecasts captures valuable dependency information, enabling a more effective restoration of spatial relationships than either the raw numerical weather prediction ensemble or historical observational rank structures. By contrast, for the visibility forecasts, the GNNs trained on CRPS, ES, or the ES-VS combination outperform the calibrated reference.

Revisiting Clustering of Neural Bandits: Selective Reinitialization for Mitigating Loss of Plasticity

arXiv:2506.12389v2 Announce Type: replace-cross Abstract: Clustering of Bandits (CB) methods enhance sequential decision-making by grouping bandits into clusters based on similarity and incorporating cluster-level contextual information, demonstrating effectiveness and adaptability in applications like personalized streaming recommendations. However, when extending CB algorithms to their neural version (commonly referred to as Clustering of Neural Bandits, or CNB), they suffer from loss of plasticity, where neural network parameters become rigid and less adaptable over time, limiting their ability to adapt to non-stationary environments (e.g., dynamic user preferences in recommendation). To address this challenge, we propose Selective Reinitialization (SeRe), a novel bandit learning framework that dynamically preserves the adaptability of CNB algorithms in evolving environments. SeRe leverages a contribution utility metric to identify and selectively reset underutilized units, mitigating loss of plasticity while maintaining stable knowledge retention. Furthermore, when combining SeRe with CNB algorithms, the adaptive change detection mechanism adjusts the reinitialization frequency according to the degree of non-stationarity, ensuring effective adaptation without unnecessary resets. Theoretically, we prove that SeRe enables sublinear cumulative regret in piecewise-stationary environments, outperforming traditional CNB approaches in long-term performances. Extensive experiments on six real-world recommendation datasets demonstrate that SeRe-enhanced CNB algorithms can effectively mitigate the loss of plasticity with lower regrets, improving adaptability and robustness in dynamic settings.

A proximal augmented Lagrangian method for nonconvex optimization with equality and inequality constraints

arXiv:2509.02894v1 Announce Type: cross Abstract: We propose an inexact proximal augmented Lagrangian method (P-ALM) for nonconvex structured optimization problems. The proposed method features an easily implementable rule not only for updating the penalty parameters, but also for adaptively tuning the proximal term. It allows the penalty parameter to grow rapidly in the early stages to speed up progress, while ameliorating the issue of ill-conditioning in later iterations, a well-known drawback of the traditional approach of linearly increasing the penalty parameters. A key element in our analysis lies in the observation that the augmented Lagrangian can be controlled effectively along the iterates, provided an initial feasible point is available. Our analysis, while simple, provides a new theoretical perspective about P-ALM and, as a by-product, results in similar convergence properties for its non-proximal variant, the classical augmented Lagrangian method (ALM). Numerical experiments, including convex and nonconvex problem instances, demonstrate the effectiveness of our approach.

Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks

arXiv:2502.21269v2 Announce Type: replace Abstract: Understanding the inductive bias and generalization properties of large overparametrized machine learning models requires to characterize the dynamics of the training algorithm. We study the learning dynamics of large two-layer neural networks via dynamical mean field theory, a well established technique of non-equilibrium statistical physics. We show that, for large network width, the training dynamics exhibits a separation of timescales which implies: $(i)$ The emergence of a slow time scale associated with the growth in Gaussian/Rademacher complexity of the network; $(ii)$ Inductive bias towards small complexity if the initialization has small enough complexity; $(iii)$ A dynamical decoupling between feature learning and overfitting regimes; $(iv)$ A non-monotone behavior of the test error, associated `feature unlearning' regime at large times.