Archives AI News

Inexact Column Generation for Bayesian Network Structure Learning via Difference-of-Submodular Optimization

arXiv:2505.11089v2 Announce Type: replace Abstract: In this paper, we consider a score-based Integer Programming (IP) approach for solving the Bayesian Network Structure Learning (BNSL) problem. State-of-the-art BNSL IP formulations suffer from the exponentially large number of variables and constraints. A standard approach in IP to address such challenges is to employ row and column generation techniques, which dynamically generate rows and columns, while the complex pricing problem remains a computational bottleneck for BNSL. For the general class of $ell_0$-penalized likelihood scores, we show how the pricing problem can be reformulated as a difference of submodular optimization problem, and how the Difference of Convex Algorithm (DCA) can be applied as an inexact method to efficiently solve the pricing problems. Empirically, we show that, for continuous Gaussian data, our row and column generation approach yields solutions with higher quality than state-of-the-art score-based approaches, especially when the graph density increases, and achieves comparable performance against benchmark constraint-based and hybrid approaches, even when the graph size increases.

Estimating the size of a set using cascading exclusion

arXiv:2508.05901v2 Announce Type: replace-cross Abstract: Let $S$ be a finite set, and $X_1,ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^frac{1}{2}$. On the other hand, if $S={1,2,ldots,|S|}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.

NestGNN: A Graph Neural Network Framework Generalizing the Nested Logit Model for Travel Mode Choice

arXiv:2509.07123v1 Announce Type: new Abstract: Nested logit (NL) has been commonly used for discrete choice analysis, including a wide range of applications such as travel mode choice, automobile ownership, or location decisions. However, the classical NL models are restricted by their limited representation capability and handcrafted utility specification. While researchers introduced deep neural networks (DNNs) to tackle such challenges, the existing DNNs cannot explicitly capture inter-alternative correlations in the discrete choice context. To address the challenges, this study proposes a novel concept - alternative graph - to represent the relationships among travel mode alternatives. Using a nested alternative graph, this study further designs a nested-utility graph neural network (NestGNN) as a generalization of the classical NL model in the neural network family. Theoretically, NestGNNs generalize the classical NL models and existing DNNs in terms of model representation, while retaining the crucial two-layer substitution patterns of the NL models: proportional substitution within a nest but non-proportional substitution beyond a nest. Empirically, we find that the NestGNNs significantly outperform the benchmark models, particularly the corresponding NL models by 9.2%. As shown by elasticity tables and substitution visualization, NestGNNs retain the two-layer substitution patterns as the NL model, and yet presents more flexibility in its model design space. Overall, our study demonstrates the power of NestGNN in prediction, interpretation, and its flexibility of generalizing the classical NL model for analyzing travel mode choice.

Counterfactual Cocycles: A Framework for Robust and Coherent Counterfactual Transports

arXiv:2405.13844v3 Announce Type: replace-cross Abstract: Estimating joint distributions (a.k.a. couplings) over counterfactual outcomes is central to personalized decision-making and treatment risk assessment. Two emergent frameworks with identifiability guarantees are: (i) bijective structural causal models (SCMs), which are flexible but brittle to mis-specified latent noise; and (ii) optimal-transport (OT) methods, which avoid latent noise assumptions but can produce incoherent counterfactual transports which fail to identify higher-order couplings. In this work, we bridge the gap with emph{counterfactual cocycles}: a framework for counterfactual transports that use algebraic structure to provide coherence and identifiability guarantees. Every counterfactual cocycle corresponds to an equivalence class of SCMs, however the cocycle is invariant to the latent noise distribution, enabling us to sidestep various mis-specification problems. We characterize the structure of all identifiable counterfactual cocycles; propose flexible model parameterizations; introduce a novel cocycle estimator that avoids any distributional assumptions; and derive mis-specification robustness properties of the resulting counterfactual inference method. We demonstrate state-of-the-art performance and noise-robustness of counterfactual cocycles across synthetic benchmarks and a 401(k) eligibility study.

Universality of High-Dimensional Logistic Regression and a Novel CGMT under Dependence with Applications to Data Augmentation

arXiv:2502.15752v3 Announce Type: replace-cross Abstract: Over the last decade, a wave of research has characterized the exact asymptotic risk of many high-dimensional models in the proportional regime. Two foundational results have driven this progress: Gaussian universality, which shows that the asymptotic risk of estimators trained on non-Gaussian and Gaussian data is equivalent, and the convex Gaussian min-max theorem (CGMT), which characterizes the risk under Gaussian settings. However, these results rely on the assumption that the data consists of independent random vectors--an assumption that significantly limits its applicability to many practical setups. In this paper, we address this limitation by generalizing both results to the dependent setting. More precisely, we prove that Gaussian universality still holds for high-dimensional logistic regression under block dependence, $m$-dependence and special cases of mixing, and establish a novel CGMT framework that accommodates for correlation across both the covariates and observations. Using these results, we establish the impact of data augmentation, a widespread practice in deep learning, on the asymptotic risk.

Nonparametric Envelopes for Flexible Response Reduction

arXiv:2509.07248v1 Announce Type: cross Abstract: Envelope methods improve the estimation efficiency in multivariate linear regression by identifying and separating the material and immaterial parts of the responses or the predictors and estimating the regression coefficients using only the material part. Though envelopes have been extended to other models, such as GLMs, these extensions still largely fall under the restrictive parametric modeling framework. In this paper, we introduce a flexible, nonparametric extension of response envelopes for improving efficiency in nonlinear multivariate regressions. We propose the kernel envelope (KENV) estimator for simultaneously estimating the response envelope subspace and the enveloped nonparametric conditional mean function in a reproducing kernel Hilbert space, with a novel penalty that accounts for the envelope structure. We prove that the prediction risk for KENV converges to the optimal risk as the sample size diverges and show that KENV achieves a lower in-sample prediction risk than kernel ridge regression when the response has a non-trivial immaterial component. We compare the prediction performance of KENV with other envelope methods and kernel regression methods in simulations and a real data example, finding that KENV delivers more accurate predictions than both the envelope-based and kernel-based alternatives in both low and high dimensions.

Efficient Methods for Non-stationary Online Learning

arXiv:2309.08911v3 Announce Type: replace-cross Abstract: Non-stationary online learning has drawn much attention in recent years. In particular, dynamic regret and adaptive regret are proposed as two principled performance measures for online convex optimization in non-stationary environments. To optimize them, a two-layer online ensemble is usually deployed due to the inherent uncertainty of non-stationarity, in which multiple base-learners are maintained and a meta-algorithm is employed to track the best one on the fly. However, the two-layer structure raises concerns about computational complexity -- such methods typically maintain $O(log T)$ base-learners simultaneously for a $T$-round online game and thus perform multiple projections onto the feasible domain per round, which becomes the computational bottleneck when the domain is complicated. In this paper, we present efficient methods for optimizing dynamic regret and adaptive regret that reduce the number of projections per round from $O(log T)$ to $1$. The proposed algorithms require only one gradient query and one function evaluation at each round. Our technique hinges on the reduction mechanism developed in parameter-free online learning and requires non-trivial modifications for non-stationary online methods. Furthermore, we study an even stronger measure, namely "interval dynamic regret", and reduce the number of projections per round from $O(log^2 T)$ to $1$ for minimizing it. Our reduction demonstrates broad generality and applies to two important applications: online stochastic control and online principal component analysis, resulting in methods that are both efficient and optimal. Finally, empirical studies verify our theoretical findings.

Toric geometry of ReLU neural networks

arXiv:2509.05894v1 Announce Type: cross Abstract: Given a continuous finitely piecewise linear function $f:mathbb{R}^{n_0} to mathbb{R}$ and a fixed architecture $(n_0,ldots,n_k;1)$ of feedforward ReLU neural networks, the exact function realization problem is to determine when some network with the given architecture realizes $f$. To develop a systematic way to answer these questions, we establish a connection between toric geometry and ReLU neural networks. This approach enables us to utilize numerous structures and tools from algebraic geometry to study ReLU neural networks. Starting with an unbiased ReLU neural network with rational weights, we define the ReLU fan, the ReLU toric variety, and the ReLU Cartier divisor associated with the network. This work also reveals the connection between the tropical geometry and the toric geometry of ReLU neural networks. As an application of the toric geometry framework, we prove a necessary and sufficient criterion of functions realizable by unbiased shallow ReLU neural networks by computing intersection numbers of the ReLU Cartier divisor and torus-invariant curves.

Asynchronous Gossip Algorithms for Rank-Based Statistical Methods

arXiv:2509.07543v1 Announce Type: new Abstract: As decentralized AI and edge intelligence become increasingly prevalent, ensuring robustness and trustworthiness in such distributed settings has become a critical issue-especially in the presence of corrupted or adversarial data. Traditional decentralized algorithms are vulnerable to data contamination as they typically rely on simple statistics (e.g., means or sum), motivating the need for more robust statistics. In line with recent work on decentralized estimation of trimmed means and ranks, we develop gossip algorithms for computing a broad class of rank-based statistics, including L-statistics and rank statistics-both known for their robustness to outliers. We apply our method to perform robust distributed two-sample hypothesis testing, introducing the first gossip algorithm for Wilcoxon rank-sum tests. We provide rigorous convergence guarantees, including the first convergence rate bound for asynchronous gossip-based rank estimation. We empirically validate our theoretical results through experiments on diverse network topologies.

Identifying Neural Signatures from fMRI using Hybrid Principal Components Regression

arXiv:2509.07300v1 Announce Type: new Abstract: Recent advances in neuroimaging analysis have enabled accurate decoding of mental state from brain activation patterns during functional magnetic resonance imaging scans. A commonly applied tool for this purpose is principal components regression regularized with the least absolute shrinkage and selection operator (LASSO PCR), a type of multi-voxel pattern analysis (MVPA). This model presumes that all components are equally likely to harbor relevant information, when in fact the task-related signal may be concentrated in specific components. In such cases, the model will fail to select the optimal set of principal components that maximizes the total signal relevant to the cognitive process under study. Here, we present modifications to LASSO PCR that allow for a regularization penalty tied directly to the index of the principal component, reflecting a prior belief that task-relevant signal is more likely to be concentrated in components explaining greater variance. Additionally, we propose a novel hybrid method, Joint Sparsity-Ranked LASSO (JSRL), which integrates component-level and voxel-level activity under an information parity framework and imposes ranked sparsity to guide component selection. We apply the models to brain activation during risk taking, monetary incentive, and emotion regulation tasks. Results demonstrate that incorporating sparsity ranking into LASSO PCR produces models with enhanced classification performance, with JSRL achieving up to 51.7% improvement in cross-validated deviance $R^2$ and 7.3% improvement in cross-validated AUC. Furthermore, sparsity-ranked models perform as well as or better than standard LASSO PCR approaches across all classification tasks and allocate predictive weight to brain regions consistent with their established functional roles, offering a robust alternative for MVPA.