Archives AI News

Building a Speech Enhancement and Automatic Speech Recognition (ASR) Pipeline in Python Using SpeechBrain

25b6

In this tutorial, we walk through an advanced yet practical workflow using SpeechBrain. We start by generating our own clean speech samples with gTTS, deliberately adding noise to simulate real-world scenarios, and then applying SpeechBrain’s MetricGAN+ model to enhance the audio. Once the audio is denoised, we run automatic speech recognition with a language model–rescored […] The post Building a Speech Enhancement and Automatic Speech Recognition (ASR) Pipeline in Python Using SpeechBrain appeared first on MarkTechPost.

MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes

arXiv:2508.17180v2 Announce Type: replace Abstract: A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images, moving beyond their established success in semantic description. Mathematical surface plots provide a rigorous testbed for this capability, as they isolate the task of reasoning from the semantic noise common in natural images. To measure progress on this frontier, we introduce MaRVL-QA (Mathematical Reasoning over Visual Landscapes), a new benchmark designed to quantitatively evaluate these core reasoning skills. The benchmark comprises two novel tasks: Topological Counting, identifying and enumerating features like local maxima; and Transformation Recognition, recognizing applied geometric transformations. Generated from a curated library of functions with rigorous ambiguity filtering, our evaluation on MaRVL-QA reveals that even state-of-the-art MLLMs struggle significantly, often resorting to superficial heuristics instead of robust spatial reasoning. MaRVL-QA provides a challenging new tool for the research community to measure progress, expose model limitations, and guide the development of MLLMs with more profound reasoning abilities.

That’s So FETCH: Fashioning Ensemble Techniques for LLM Classification in Civil Legal Intake and Referral

arXiv:2509.07170v1 Announce Type: new Abstract: Each year millions of people seek help for their legal problems by calling a legal aid program hotline, walking into a legal aid office, or using a lawyer referral service. The first step to match them to the right help is to identify the legal problem the applicant is experiencing. Misdirection has consequences. Applicants may miss a deadline, experience physical abuse, lose housing or lose custody of children while waiting to connect to the right legal help. We introduce and evaluate the FETCH classifier for legal issue classification and describe two methods for improving accuracy: a hybrid LLM/ML ensemble classification method, and the automatic generation of follow-up questions to enrich the initial problem narrative. We employ a novel data set of 419 real-world queries to a nonprofit lawyer referral service. Ultimately, we show classification accuracy (hits@2) of 97.37% using a mix of inexpensive models, exceeding the performance of the current state-of-the-art GPT-5 model. Our approach shows promise in significantly reducing the cost of guiding users of the legal system to the right resource for their problem while achieving high accuracy.

A Hybrid CNN-LSTM Deep Learning Model for Intrusion Detection in Smart Grid

arXiv:2509.07208v1 Announce Type: new Abstract: The evolution of the traditional power grid into the "smart grid" has resulted in a fundamental shift in energy management, which allows the integration of renewable energy sources with modern communication technology. However, this interconnection has increased smart grids' vulnerability to attackers, which might result in privacy breaches, operational interruptions, and massive outages. The SCADA-based smart grid protocols are critical for real-time data collection and control, but they are vulnerable to attacks like unauthorized access and denial of service (DoS). This research proposes a hybrid deep learning-based Intrusion Detection System (IDS) intended to improve the cybersecurity of smart grids. The suggested model takes advantage of Convolutional Neural Networks' (CNN) feature extraction capabilities as well as Long Short-Term Memory (LSTM) networks' temporal pattern recognition skills. DNP3 and IEC104 intrusion detection datasets are employed to train and test our CNN-LSTM model to recognize and classify the potential cyber threats. Compared to other deep learning approaches, the results demonstrate considerable improvements in accuracy, precision, recall, and F1-score, with a detection accuracy of 99.70%.

Autoencoder-Based Denoising of Muscle Artifacts in ECG to Preserve Skin Nerve Activity (SKNA) for Cognitive Stress Detection

arXiv:2509.07146v1 Announce Type: new Abstract: The sympathetic nervous system (SNS) plays a central role in regulating the body's responses to stress and maintaining physiological stability. Its dysregulation is associated with a wide range of conditions, from cardiovascular disease to anxiety disorders. Skin nerve activity (SKNA) extracted from high-frequency electrocardiogram (ECG) recordings provides a noninvasive window into SNS dynamics, but its measurement is highly susceptible to electromyographic (EMG) contamination. Traditional preprocessing based on bandpass filtering within a fixed range (e.g., 500--1000 Hz) is susceptible to overlapping EMG and SKNA spectral components, especially during sustained muscle activity. We present a denoising approach using a lightweight one-dimensional convolutional autoencoder with a long short-term memory (LSTM) bottleneck to reconstruct clean SKNA from EMG-contaminated recordings. Using clean ECG-derived SKNA data from cognitive stress experiments and EMG noise from chaotic muscle stimulation recordings, we simulated contamination at realistic noise levels (--4 dB, --8 dB signal-to-noise ratio) and trained the model in the leave-one-subject-out cross-validation framework. The method improved signal-to-noise ratio by up to 9.65 dB, increased cross correlation with clean SKNA from 0.40 to 0.72, and restored burst-based SKNA features to near-clean discriminability (AUROC $geq$ 0.96). Classification of baseline versus sympathetic stimulation (cognitive stress) conditions reached accuracies of 91--98% across severe noise levels, comparable to clean data. These results demonstrate that deep learning--based reconstruction can preserve physiologically relevant sympathetic bursts during substantial EMG interference, enabling more robust SKNA monitoring in naturalistic, movement-rich environments.

PaVeRL-SQL: Text-to-SQL via Partial-Match Rewards and Verbal Reinforcement Learning

arXiv:2509.07159v1 Announce Type: new Abstract: Text-to-SQL models allow users to interact with a database more easily by generating executable SQL statements from natural-language questions. Despite recent successes on simpler databases and questions, current Text-to-SQL methods still suffer from low execution accuracy on industry-scale databases and complex questions involving domain-specific business logic. We present emph{PaVeRL-SQL}, a framework that combines emph{Partial-Match Rewards} and emph{Verbal Reinforcement Learning} to drive self-improvement in reasoning language models (RLMs) for Text-to-SQL. To handle practical use cases, we adopt two pipelines: (1) a newly designed in-context learning framework with group self-evaluation (verbal-RL), using capable open- and closed-source large language models (LLMs) as backbones; and (2) a chain-of-thought (CoT) RL pipeline with a small backbone model (OmniSQL-7B) trained with a specially designed reward function and two-stage RL. These pipelines achieve state-of-the-art (SOTA) results on popular Text-to-SQL benchmarks -- Spider, Spider 2.0, and BIRD. For the industrial-level Spider2.0-SQLite benchmark, the verbal-RL pipeline achieves an execution accuracy 7.4% higher than SOTA, and the CoT pipeline is 1.4% higher. RL training with mixed SQL dialects yields strong, threefold gains, particularly for dialects with limited training data. Overall, emph{PaVeRL-SQL} delivers reliable, SOTA Text-to-SQL under realistic industrial constraints. The code is available at https://github.com/PaVeRL-SQL/PaVeRL-SQL.

Neuro-Symbolic Frameworks: Conceptual Characterization and Empirical Comparative Analysis

arXiv:2509.07122v1 Announce Type: new Abstract: Neurosymbolic (NeSy) frameworks combine neural representations and learning with symbolic representations and reasoning. Combining the reasoning capacities, explainability, and interpretability of symbolic processing with the flexibility and power of neural computing allows us to solve complex problems with more reliability while being data-efficient. However, this recently growing topic poses a challenge to developers with its learning curve, lack of user-friendly tools, libraries, and unifying frameworks. In this paper, we characterize the technical facets of existing NeSy frameworks, such as the symbolic representation language, integration with neural models, and the underlying algorithms. A majority of the NeSy research focuses on algorithms instead of providing generic frameworks for declarative problem specification to leverage problem solving. To highlight the key aspects of Neurosymbolic modeling, we showcase three generic NeSy frameworks - textit{DeepProbLog}, textit{Scallop}, and textit{DomiKnowS}. We identify the challenges within each facet that lay the foundation for identifying the expressivity of each framework in solving a variety of problems. Building on this foundation, we aim to spark transformative action and encourage the community to rethink this problem in novel ways.

Instruction Agent: Enhancing Agent with Expert Demonstration

arXiv:2509.07098v1 Announce Type: new Abstract: Graphical user interface (GUI) agents have advanced rapidly but still struggle with complex tasks involving novel UI elements, long-horizon actions, and personalized trajectories. In this work, we introduce Instruction Agent, a GUI agent that leverages expert demonstrations to solve such tasks, enabling completion of otherwise difficult workflows. Given a single demonstration, the agent extracts step-by-step instructions and executes them by strictly following the trajectory intended by the user, which avoids making mistakes during execution. The agent leverages the verifier and backtracker modules further to improve robustness. Both modules are critical to understand the current outcome from each action and handle unexpected interruptions(such as pop-up windows) during execution. Our experiments show that Instruction Agent achieves a 60% success rate on a set of tasks in OSWorld that all top-ranked agents failed to complete. The Instruction Agent offers a practical and extensible framework, bridging the gap between current GUI agents and reliable real-world GUI task automation.

Statistical Methods in Generative AI

arXiv:2509.07054v1 Announce Type: new Abstract: Generative Artificial Intelligence is emerging as an important technology, promising to be transformative in many areas. At the same time, generative AI techniques are based on sampling from probabilistic models, and by default, they come with no guarantees about correctness, safety, fairness, or other properties. Statistical methods offer a promising potential approach to improve the reliability of generative AI techniques. In addition, statistical methods are also promising for improving the quality and efficiency of AI evaluation, as well as for designing interventions and experiments in AI. In this paper, we review some of the existing work on these topics, explaining both the general statistical techniques used, as well as their applications to generative AI. We also discuss limitations and potential future directions.

Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks

arXiv:2509.07756v1 Announce Type: cross Abstract: Next to decision tree and k-nearest neighbours algorithms deep convolutional neural networks (CNNs) are widely used to classify audio data in many domains like music, speech or environmental sounds. To train a specific CNN various spectral and rhythm features like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCC), cyclic tempograms, short-time Fourier transform (STFT) chromagrams, constant-Q transform (CQT) chromagrams and chroma energy normalized statistics (CENS) chromagrams can be used as digital image input data for the neural network. The performance of these spectral and rhythm features for audio category level as well as audio class level classification is investigated in detail with a deep CNN and the ESC-50 dataset with 2,000 labeled environmental audio recordings using an end-to-end deep learning pipeline. The evaluated metrics accuracy, precision, recall and F1 score for multiclass classification clearly show that the mel-scaled spectrograms and the mel-frequency cepstral coefficients (MFCC) perform significantly better then the other spectral and rhythm features investigated in this research for audio classification tasks using deep CNNs.