Archives AI News

Minimax optimal transfer learning for high-dimensional additive regression

arXiv:2509.06308v1 Announce Type: new Abstract: This paper studies high-dimensional additive regression under the transfer learning framework, where one observes samples from a target population together with auxiliary samples from different but potentially related regression models. We first introduce a target-only estimation procedure based on the smooth backfitting estimator with local linear smoothing. In contrast to previous work, we establish general error bounds under sub-Weibull($alpha$) noise, thereby accommodating heavy-tailed error distributions. In the sub-exponential case ($alpha=1$), we show that the estimator attains the minimax lower bound under regularity conditions, which requires a substantial departure from existing proof strategies. We then develop a novel two-stage estimation method within a transfer learning framework, and provide theoretical guarantees at both the population and empirical levels. Error bounds are derived for each stage under general tail conditions, and we further demonstrate that the minimax optimal rate is achieved when the auxiliary and target distributions are sufficiently close. All theoretical results are supported by simulation studies and real data analysis.

Convergence and Generalization of Anti-Regularization for Parametric Models

arXiv:2508.17412v2 Announce Type: replace-cross Abstract: Anti-regularization introduces a reward term with a reversed sign into the loss function, deliberately amplifying model expressivity in small-sample regimes while ensuring that the intervention gradually vanishes as the sample size grows through a power-law decay schedule. We formalize spectral safety conditions and trust-region constraints, and we design a lightweight safeguard that combines a projection operator with gradient clipping to guarantee stable intervention. Theoretical analysis extends to linear smoothers and the Neural Tangent Kernel regime, providing practical guidance on the choice of decay exponents through the balance between empirical risk and variance. Empirical results show that Anti-regularization mitigates underfitting in both regression and classification while preserving generalization and improving calibration. Ablation studies confirm that the decay schedule and safeguards are essential to avoiding overfitting and instability. As an alternative, we also propose a degrees-of-freedom targeting schedule that maintains constant per-sample complexity. Anti-regularization constitutes a simple and reproducible procedure that integrates seamlessly into standard empirical risk minimization pipelines, enabling robust learning under limited data and resource constraints by intervening only when necessary and vanishing otherwise.

Robust and Adaptive Spectral Method for Representation Multi-Task Learning with Contamination

arXiv:2509.06575v1 Announce Type: new Abstract: Representation-based multi-task learning (MTL) improves efficiency by learning a shared structure across tasks, but its practical application is often hindered by contamination, outliers, or adversarial tasks. Most existing methods and theories assume a clean or near-clean setting, failing when contamination is significant. This paper tackles representation MTL with an unknown and potentially large contamination proportion, while also allowing for heterogeneity among inlier tasks. We introduce a Robust and Adaptive Spectral method (RAS) that can distill the shared inlier representation effectively and efficiently, while requiring no prior knowledge of the contamination level or the true representation dimension. Theoretically, we provide non-asymptotic error bounds for both the learned representation and the per-task parameters. These bounds adapt to inlier task similarity and outlier structure, and guarantee that RAS performs at least as well as single-task learning, thus preventing negative transfer. We also extend our framework to transfer learning with corresponding theoretical guarantees for the target task. Extensive experiments confirm our theory, showcasing the robustness and adaptivity of RAS, and its superior performance in regimes with up to 80% task contamination.

Limit Theorems for Stochastic Gradient Descent with Infinite Variance

arXiv:2410.16340v4 Announce Type: replace Abstract: Stochastic gradient descent is a classic algorithm that has gained great popularity especially in the last decades as the most common approach for training models in machine learning. While the algorithm has been well-studied when stochastic gradients are assumed to have a finite variance, there is significantly less research addressing its theoretical properties in the case of infinite variance gradients. In this paper, we establish the asymptotic behavior of stochastic gradient descent in the context of infinite variance stochastic gradients, assuming that the stochastic gradient is regular varying with index $alphain(1,2)$. The closest result in this context was established in 1969 , in the one-dimensional case and assuming that stochastic gradients belong to a more restrictive class of distributions. We extend it to the multidimensional case, covering a broader class of infinite variance distributions. As we show, the asymptotic distribution of the stochastic gradient descent algorithm can be characterized as the stationary distribution of a suitably defined Ornstein-Uhlenbeck process driven by an appropriate stable L'evy process. Additionally, we explore the applications of these results in linear regression and logistic regression models.

Automated Hierarchical Graph Construction for Multi-source Electronic Health Records

arXiv:2509.06576v1 Announce Type: new Abstract: Electronic Health Records (EHRs), comprising diverse clinical data such as diagnoses, medications, and laboratory results, hold great promise for translational research. EHR-derived data have advanced disease prevention, improved clinical trial recruitment, and generated real-world evidence. Synthesizing EHRs across institutions enables large-scale, generalizable studies that capture rare diseases and population diversity, but remains hindered by the heterogeneity of medical codes, institution-specific terminologies, and the absence of standardized data structures. These barriers limit the interpretability, comparability, and scalability of EHR-based analyses, underscoring the need for robust methods to harmonize and extract meaningful insights from distributed, heterogeneous data. To address this, we propose MASH (Multi-source Automated Structured Hierarchy), a fully automated framework that aligns medical codes across institutions using neural optimal transport and constructs hierarchical graphs with learned hyperbolic embeddings. During training, MASH integrates information from pre-trained language models, co-occurrence patterns, textual descriptions, and supervised labels to capture semantic and hierarchical relationships among medical concepts more effectively. Applied to real-world EHR data, including diagnosis, medication, and laboratory codes, MASH produces interpretable hierarchical graphs that facilitate the navigation and understanding of heterogeneous clinical data. Notably, it generates the first automated hierarchies for unstructured local laboratory codes, establishing foundational references for downstream applications.

KD$^{2}$M: A unifying framework for feature knowledge distillation

arXiv:2504.01757v3 Announce Type: replace Abstract: Knowledge Distillation (KD) seeks to transfer the knowledge of a teacher, towards a student neural net. This process is often done by matching the networks' predictions (i.e., their output), but, recently several works have proposed to match the distributions of neural nets' activations (i.e., their features), a process known as emph{distribution matching}. In this paper, we propose an unifying framework, Knowledge Distillation through Distribution Matching (KD$^{2}$M), which formalizes this strategy. Our contributions are threefold. We i) provide an overview of distribution metrics used in distribution matching, ii) benchmark on computer vision datasets, and iii) derive new theoretical results for KD.

Sequential Least-Squares Estimators with Fast Randomized Sketching for Linear Statistical Models

arXiv:2509.06856v1 Announce Type: new Abstract: We propose a novel randomized framework for the estimation problem of large-scale linear statistical models, namely Sequential Least-Squares Estimators with Fast Randomized Sketching (SLSE-FRS), which integrates Sketch-and-Solve and Iterative-Sketching methods for the first time. By iteratively constructing and solving sketched least-squares (LS) subproblems with increasing sketch sizes to achieve better precisions, SLSE-FRS gradually refines the estimators of the true parameter vector, ultimately producing high-precision estimators. We analyze the convergence properties of SLSE-FRS, and provide its efficient implementation. Numerical experiments show that SLSE-FRS outperforms the state-of-the-art methods, namely the Preconditioned Conjugate Gradient (PCG) method, and the Iterative Double Sketching (IDS) method.

Catapult Dynamics and Phase Transitions in Quadratic Nets

arXiv:2301.07737v2 Announce Type: replace-cross Abstract: Neural networks trained with gradient descent can undergo non-trivial phase transitions as a function of the learning rate. In cite{lewkowycz2020large} it was discovered that wide neural nets can exhibit a catapult phase for super-critical learning rates, where the training loss grows exponentially quickly at early times before rapidly decreasing to a small value. During this phase the top eigenvalue of the neural tangent kernel (NTK) also undergoes significant evolution. In this work, we will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets. To do this, we show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large. We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.

Learning from one graph: transductive learning guarantees via the geometry of small random worlds

arXiv:2509.06894v1 Announce Type: new Abstract: Since their introduction by Kipf and Welling in $2017$, a primary use of graph convolutional networks is transductive node classification, where missing labels are inferred within a single observed graph and its feature matrix. Despite the widespread use of the network model, the statistical foundations of transductive learning remain limited, as standard inference frameworks typically rely on multiple independent samples rather than a single graph. In this work, we address these gaps by developing new concentration-of-measure tools that leverage the geometric regularities of large graphs via low-dimensional metric embeddings. The emergent regularities are captured using a random graph model; however, the methods remain applicable to deterministic graphs once observed. We establish two principal learning results. The first concerns arbitrary deterministic $k$-vertex graphs, and the second addresses random graphs that share key geometric properties with an ErdH{o}s-R'{e}nyi graph $mathbf{G}=mathbf{G}(k,p)$ in the regime $p in mathcal{O}((log (k)/k)^{1/2})$. The first result serves as the basis for and illuminates the second. We then extend these results to the graph convolutional network setting, where additional challenges arise. Lastly, our learning guarantees remain informative even with a few labelled nodes $N$ and achieve the optimal nonparametric rate $mathcal{O}(N^{-1/2})$ as $N$ grows.

Towards a General Time Series Forecasting Model with Unified Representation and Adaptive Transfer

arXiv:2405.17478v3 Announce Type: replace-cross Abstract: With the growing availability of multi-domain time series data, there is an increasing demand for general forecasting models pre-trained on multi-source datasets to support diverse downstream prediction scenarios. Existing time series foundation models primarily focus on scaling up pre-training datasets and model sizes to enhance generalization performance. In this paper, we take a different approach by addressing two critical aspects of general forecasting models: (1) how to derive unified representations from heterogeneous multi-domain time series data, and (2) how to effectively capture domain-specific features to enable adaptive transfer across various downstream scenarios. To address the first aspect, we propose Decomposed Frequency Learning as the pre-training task, which leverages frequency-based masking and reconstruction to decompose coupled semantic information in time series, resulting in unified representations across domains. For the second aspect, we introduce the Time Series Register, which captures domain-specific representations during pre-training and enhances adaptive transferability to downstream tasks. Our model achieves the state-of-the-art forecasting performance on seven real-world benchmarks, demonstrating remarkable few-shot and zero-shot capabilities.