Archives AI News

Quantum-Classical Hybrid Molecular Autoencoder for Advancing Classical Decoding

arXiv:2508.19394v1 Announce Type: new Abstract: Although recent advances in quantum machine learning (QML) offer significant potential for enhancing generative models, particularly in molecular design, a large array of classical approaches still face challenges in achieving high fidelity and validity. In particular, the integration of QML with sequence-based tasks, such as Simplified Molecular Input Line Entry System (SMILES) string reconstruction, remains underexplored and usually suffers from fidelity degradation. In this work, we propose a hybrid quantum-classical architecture for SMILES reconstruction that integrates quantum encoding with classical sequence modeling to improve quantum fidelity and classical similarity. Our approach achieves a quantum fidelity of approximately 84% and a classical reconstruction similarity of 60%, surpassing existing quantum baselines. Our work lays a promising foundation for future QML applications, striking a balance between expressive quantum representations and classical sequence models and catalyzing broader research on quantum-aware sequence models for molecular and drug discovery.

Vocoder-Projected Feature Discriminator

arXiv:2508.17874v2 Announce Type: replace-cross Abstract: In text-to-speech (TTS) and voice conversion (VC), acoustic features, such as mel spectrograms, are typically used as synthesis or conversion targets owing to their compactness and ease of learning. However, because the ultimate goal is to generate high-quality waveforms, employing a vocoder to convert these features into waveforms and applying adversarial training in the time domain is reasonable. Nevertheless, upsampling the waveform introduces significant time and memory overheads. To address this issue, we propose a vocoder-projected feature discriminator (VPFD), which uses vocoder features for adversarial training. Experiments on diffusion-based VC distillation demonstrated that a pretrained and frozen vocoder feature extractor with a single upsampling step is necessary and sufficient to achieve a VC performance comparable to that of waveform discriminators while reducing the training time and memory consumption by 9.6 and 11.4 times, respectively.

Kolmogorov-Arnold Representation for Symplectic Learning: Advancing Hamiltonian Neural Networks

arXiv:2508.19410v1 Announce Type: new Abstract: We propose a Kolmogorov-Arnold Representation-based Hamiltonian Neural Network (KAR-HNN) that replaces the Multilayer Perceptrons (MLPs) with univariate transformations. While Hamiltonian Neural Networks (HNNs) ensure energy conservation by learning Hamiltonian functions directly from data, existing implementations, often relying on MLPs, cause hypersensitivity to the hyperparameters while exploring complex energy landscapes. Our approach exploits the localized function approximations to better capture high-frequency and multi-scale dynamics, reducing energy drift and improving long-term predictive stability. The networks preserve the symplectic form of Hamiltonian systems, and thus maintain interpretability and physical consistency. After assessing KAR-HNN on four benchmark problems including spring-mass, simple pendulum, two- and three-body problem, we foresee its effectiveness for accurate and stable modeling of realistic physical processes often at high dimensions and with few known parameters.

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

arXiv:2508.20072v1 Announce Type: cross Abstract: Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions to robot actions. However, prevailing VLA decoders either generate actions autoregressively in a fixed left-to-right order or attach continuous diffusion or flow matching heads outside the backbone, demanding specialized training and iterative sampling that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a single-transformer policy that models discretized action chunks with discrete diffusion and is trained with the same cross-entropy objective as the VLM backbone. The design retains diffusion's progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary remasking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pretrained vision language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO, 71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv Bridge, improving over both autoregressive and continuous diffusion baselines. These findings indicate that discrete-diffusion action decoder supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets.

Even Heads Fix Odd Errors: Mechanistic Discovery and Surgical Repair in Transformer Attention

arXiv:2508.19414v1 Announce Type: new Abstract: We present a mechanistic case study of a format-dependent reasoning failure in Llama-3.1-8B-Instruct, where the model incorrectly judges "9.11" as larger than "9.8" in chat or Q&A formats, but answers correctly in simple format. Through systematic intervention, we discover transformers implement even/odd attention head specialization: even indexed heads handle numerical comparison, while odd heads serve incompatible functions. The bug requires exactly 8 even heads at Layer 10 for perfect repair. Any combination of 8+ even heads succeeds, while 7 or fewer completely fails, revealing sharp computational thresholds with perfect redundancy among the 16 even heads. SAE analysis reveals the mechanism: format representations separate (10% feature overlap at Layer 7), then re-entangle with different weightings (80% feature overlap at Layer 10), with specific features showing 1.5x amplification in failing formats. We achieve perfect repair using only 25% of attention heads and identify a 60% pattern replacement threshold, demonstrating that apparent full-module requirements hide sophisticated substructure with implications for interpretability and efficiency. All of our code is available at https://github.com/gussand/surgeon.

Generation of Geodesics with Actor-Critic Reinforcement Learning to Predict Midpoints

arXiv:2407.01991v4 Announce Type: replace Abstract: To find the shortest paths for all pairs on manifolds with infinitesimally defined metrics, we introduce a framework to generate them by predicting midpoints recursively. To learn midpoint prediction, we propose an actor-critic approach. We prove the soundness of our approach and show experimentally that the proposed method outperforms existing methods on several planning tasks, including path planning for agents with complex kinematics and motion planning for multi-degree-of-freedom robot arms.

Differentiable multiphase flow model for physics-informed machine learning in reservoir pressure management

arXiv:2508.19419v1 Announce Type: new Abstract: Accurate subsurface reservoir pressure control is extremely challenging due to geological heterogeneity and multiphase fluid-flow dynamics. Predicting behavior in this setting relies on high-fidelity physics-based simulations that are computationally expensive. Yet, the uncertain, heterogeneous properties that control these flows make it necessary to perform many of these expensive simulations, which is often prohibitive. To address these challenges, we introduce a physics-informed machine learning workflow that couples a fully differentiable multiphase flow simulator, which is implemented in the DPFEHM framework with a convolutional neural network (CNN). The CNN learns to predict fluid extraction rates from heterogeneous permeability fields to enforce pressure limits at critical reservoir locations. By incorporating transient multiphase flow physics into the training process, our method enables more practical and accurate predictions for realistic injection-extraction scenarios compare to previous works. To speed up training, we pretrain the model on single-phase, steady-state simulations and then fine-tune it on full multiphase scenarios, which dramatically reduces the computational cost. We demonstrate that high-accuracy training can be achieved with fewer than three thousand full-physics multiphase flow simulations -- compared to previous estimates requiring up to ten million. This drastic reduction in the number of simulations is achieved by leveraging transfer learning from much less expensive single-phase simulations.

Stochastic Control for Fine-tuning Diffusion Models: Optimality, Regularity, and Convergence

arXiv:2412.18164v3 Announce Type: replace Abstract: Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, fine-tuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this problem, much of the progress has been empirical, with limited theoretical understanding. To bridge this gap, we propose a stochastic control framework for fine-tuning diffusion models. Building on denoising diffusion probabilistic models as the pre-trained reference dynamics, our approach integrates linear dynamics control with Kullback-Leibler regularization. We establish the well-posedness and regularity of the stochastic control problem and develop a policy iteration algorithm (PI-FT) for numerical solution. We show that PI-FT achieves global convergence at a linear rate. Unlike existing work that assumes regularities throughout training, we prove that the control and value sequences generated by the algorithm maintain the regularity. Additionally, we explore extensions of our framework to parametric settings and continuous-time formulations, and demonstrate the practical effectiveness of the proposed PI-FT algorithm through numerical experiments. Our code is available at https://github.com/yinbinhan/fine-tuning-of-diffusion-models.

MS-ConTab: Multi-Scale Contrastive Learning of Mutation Signatures for Pan Cancer Representation and Stratification

arXiv:2508.19424v1 Announce Type: new Abstract: Motivation. Understanding the pan-cancer mutational landscape offers critical insights into the molecular mechanisms underlying tumorigenesis. While patient-level machine learning techniques have been widely employed to identify tumor subtypes, cohort-level clustering, where entire cancer types are grouped based on shared molecular features, has largely relied on classical statistical methods. Results. In this study, we introduce a novel unsupervised contrastive learning framework to cluster 43 cancer types based on coding mutation data derived from the COSMIC database. For each cancer type, we construct two complementary mutation signatures: a gene-level profile capturing nucleotide substitution patterns across the most frequently mutated genes, and a chromosome-level profile representing normalized substitution frequencies across chromosomes. These dual views are encoded using TabNet encoders and optimized via a multi-scale contrastive learning objective (NT-Xent loss) to learn unified cancer-type embeddings. We demonstrate that the resulting latent representations yield biologically meaningful clusters of cancer types, aligning with known mutational processes and tissue origins. Our work represents the first application of contrastive learning to cohort-level cancer clustering, offering a scalable and interpretable framework for mutation-driven cancer subtyping.

R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning

arXiv:2504.11195v2 Announce Type: replace Abstract: Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional vision models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and lacks of flexibility for downstream tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in https://github.com/TomSheng21/R-TPT.