Archives AI News

Survival Analysis with Adversarial Regularization

arXiv:2312.16019v5 Announce Type: replace Abstract: Survival Analysis (SA) models the time until an event occurs, with applications in fields like medicine, defense, finance, and aerospace. Recent research indicates that Neural Networks (NNs) can effectively capture complex data patterns in SA, whereas simple generalized linear models often fall short in this regard. However, dataset uncertainties (e.g., noisy measurements, human error) can degrade NN model performance. To address this, we leverage advances in NN verification to develop training objectives for robust, fully-parametric SA models. Specifically, we propose an adversarially robust loss function based on a Min-Max optimization problem. We employ CROWN-Interval Bound Propagation (CROWN-IBP) to tackle the computational challenges inherent in solving this Min-Max problem. Evaluated over 10 SurvSet datasets, our method, Survival Analysis with Adversarial Regularization (SAWAR), consistently outperforms baseline adversarial training methods and state-of-the-art (SOTA) deep SA models across various covariate perturbations with respect to Negative Log Likelihood (NegLL), Integrated Brier Score (IBS), and Concordance Index (CI) metrics. Thus, we demonstrate that adversarial robustness enhances SA predictive performance and calibration, mitigating data uncertainty and improving generalization across diverse datasets by up to 150% compared to baselines.

Inferring Change Points in High-Dimensional Regression via Approximate Message Passing

arXiv:2404.07864v3 Announce Type: replace Abstract: We consider the problem of localizing change points in a generalized linear model (GLM), a model that covers many widely studied problems in statistical learning including linear, logistic, and rectified linear regression. We propose a novel and computationally efficient Approximate Message Passing (AMP) algorithm for estimating both the signals and the change point locations, and rigorously characterize its performance in the high-dimensional limit where the number of parameters $p$ is proportional to the number of samples $n$. This characterization is in terms of a state evolution recursion, which allows us to precisely compute performance measures such as the asymptotic Hausdorff error of our change point estimates, and allows us to tailor the algorithm to take advantage of any prior structural information on the signals and change points. Moreover, we show how our AMP iterates can be used to efficiently compute a Bayesian posterior distribution over the change point locations in the high-dimensional limit. We validate our theory via numerical experiments, and demonstrate the favorable performance of our estimators on both synthetic and real data in the settings of linear, logistic, and rectified linear regression.

Probabilistic operator learning: generative modeling and uncertainty quantification for foundation models of differential equations

arXiv:2509.05186v1 Announce Type: new Abstract: In-context operator networks (ICON) are a class of operator learning methods based on the novel architectures of foundation models. Trained on a diverse set of datasets of initial and boundary conditions paired with corresponding solutions to ordinary and partial differential equations (ODEs and PDEs), ICON learns to map example condition-solution pairs of a given differential equation to an approximation of its solution operator. Here, we present a probabilistic framework that reveals ICON as implicitly performing Bayesian inference, where it computes the mean of the posterior predictive distribution over solution operators conditioned on the provided context, i.e., example condition-solution pairs. The formalism of random differential equations provides the probabilistic framework for describing the tasks ICON accomplishes while also providing a basis for understanding other multi-operator learning methods. This probabilistic perspective provides a basis for extending ICON to emph{generative} settings, where one can sample from the posterior predictive distribution of solution operators. The generative formulation of ICON (GenICON) captures the underlying uncertainty in the solution operator, which enables principled uncertainty quantification in the solution predictions in operator learning.

Spectral Algorithms in Misspecified Regression: Convergence under Covariate Shift

arXiv:2509.05106v1 Announce Type: new Abstract: This paper investigates the convergence properties of spectral algorithms -- a class of regularization methods originating from inverse problems -- under covariate shift. In this setting, the marginal distributions of inputs differ between source and target domains, while the conditional distribution of outputs given inputs remains unchanged. To address this distributional mismatch, we incorporate importance weights, defined as the ratio of target to source densities, into the learning framework. This leads to a weighted spectral algorithm within a nonparametric regression setting in a reproducing kernel Hilbert space (RKHS). More importantly, in contrast to prior work that largely focuses on the well-specified setting, we provide a comprehensive theoretical analysis of the more challenging misspecified case, in which the target function does not belong to the RKHS. Under the assumption of uniformly bounded density ratios, we establish minimax-optimal convergence rates when the target function lies within the RKHS. For scenarios involving unbounded importance weights, we introduce a novel truncation technique that attains near-optimal convergence rates under mild regularity conditions, and we further extend these results to the misspecified regime. By addressing the intertwined challenges of covariate shift and model misspecification, this work extends classical kernel learning theory to more practical scenarios, providing a systematic framework for understanding their interaction.

Gradient Methods with Online Scaling Part I. Theoretical Foundations

arXiv:2505.23081v2 Announce Type: replace-cross Abstract: This paper establishes the theoretical foundations of the online scaled gradient methods (OSGM), a framework that utilizes online learning to adapt stepsizes and provably accelerate first-order methods. OSGM quantifies the effectiveness of a stepsize by a feedback function motivated from a convergence measure and uses the feedback to adjust the stepsize through an online learning algorithm. Consequently, instantiations of OSGM achieve convergence rates that are asymptotically no worse than the optimal stepsize. OSGM yields desirable convergence guarantees on smooth convex problems, including 1) trajectory-dependent global convergence on smooth convex objectives; 2) an improved complexity result on smooth strongly convex problems, and 3) local superlinear convergence. Notably, OSGM constitutes a new family of first-order methods with non-asymptotic superlinear convergence, joining the celebrated quasi-Newton methods. Finally, OSGM explains the empirical success of the popular hypergradient-descent heuristic in optimization for machine learning.

The Broader Landscape of Robustness in Algorithmic Statistics

arXiv:2412.02670v3 Announce Type: replace Abstract: The last decade has seen a number of advances in computationally efficient algorithms for statistical methods subject to robustness constraints. An estimator may be robust in a number of different ways: to contamination of the dataset, to heavy-tailed data, or in the sense that it preserves privacy of the dataset. We survey recent results in these areas with a focus on the problem of mean estimation, drawing technical and conceptual connections between the various forms of robustness, showing that the same underlying algorithmic ideas lead to computationally efficient estimators in all these settings.

The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations

arXiv:2507.05644v2 Announce Type: replace-cross Abstract: It is a central challenge in deep learning to understand how neural networks learn representations. A leading approach is the Neural Feature Ansatz (NFA) (Radhakrishnan et al. 2024), a conjectured mechanism for how feature learning occurs. Although the NFA is empirically validated, it is an educated guess and lacks a theoretical basis, and thus it is unclear when it might fail, and how to improve it. In this paper, we take a first-principles approach to understanding why this observation holds, and when it does not. We use first-order optimality conditions to derive the Features at Convergence Theorem (FACT), an alternative to the NFA that (a) obtains greater agreement with learned features at convergence, (b) explains why the NFA holds in most settings, and (c) captures essential feature learning phenomena in neural networks such as grokking behavior in modular arithmetic and phase transitions in learning sparse parities, similarly to the NFA. Thus, our results unify theoretical first-order optimality analyses of neural networks with the empirically-driven NFA literature, and provide a principled alternative that provably and empirically holds at convergence.

STADE: Standard Deviation as a Pruning Metric

arXiv:2503.22451v2 Announce Type: replace Abstract: Recently, Large Language Models (LLMs) have become very widespread and are used to solve a wide variety of tasks. To successfully handle these tasks, LLMs require longer training times and larger model sizes. This makes LLMs ideal candidates for pruning methods that reduce computational demands while maintaining performance. Previous methods require a retraining phase after pruning to maintain the original model's performance. However, state-of-the-art pruning methods, such as Wanda, prune the model without retraining, making the pruning process faster and more efficient. Building upon Wanda's work, this study provides a theoretical explanation of why the method is effective and leverages these insights to enhance the pruning process. Specifically, a theoretical analysis of the pruning problem reveals a common scenario in Machine Learning where Wanda is the optimal pruning method. Furthermore, this analysis is extended to cases where Wanda is no longer optimal, leading to the development of a new method, STADE, based on the standard deviation of the input. From a theoretical standpoint, STADE demonstrates better generality across different scenarios. Finally, extensive experiments on Llama and Open Pre-trained Transformers (OPT) models validate these theoretical findings, showing that depending on the training conditions, Wanda's optimal performance varies as predicted by the theoretical framework. These insights contribute to a more robust understanding of pruning strategies and their practical implications. Code is available at: https://github.com/Coello-dev/STADE/

Beyond I-Con: Exploring New Dimension of Distance Measures in Representation Learning

arXiv:2509.04734v1 Announce Type: new Abstract: The Information Contrastive (I-Con) framework revealed that over 23 representation learning methods implicitly minimize KL divergence between data and learned distributions that encode similarities between data points. However, a KL-based loss may be misaligned with the true objective, and properties of KL divergence such as asymmetry and unboundedness may create optimization challenges. We present Beyond I-Con, a framework that enables systematic discovery of novel loss functions by exploring alternative statistical divergences and similarity kernels. Key findings: (1) on unsupervised clustering of DINO-ViT embeddings, we achieve state-of-the-art results by modifying the PMI algorithm to use total variation (TV) distance; (2) on supervised contrastive learning, we outperform the standard approach by using TV and a distance-based similarity kernel instead of KL and an angular kernel; (3) on dimensionality reduction, we achieve superior qualitative results and better performance on downstream tasks than SNE by replacing KL with a bounded f-divergence. Our results highlight the importance of considering divergence and similarity kernel choices in representation learning optimization.

Q-learning with Posterior Sampling

arXiv:2506.00917v2 Announce Type: replace Abstract: Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $tilde O(H^2sqrt{SAT})$, closely matching the known lower bound of $Omega(Hsqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.