Archives AI News

NEWSAGENT: Benchmarking Multimodal Agents as Journalists with Real-World Newswriting Tasks

arXiv:2509.00446v1 Announce Type: new Abstract: Recent advances in autonomous digital agents from industry (e.g., Manus AI and Gemini's research mode) highlight potential for structured tasks by autonomous decision-making and task decomposition; however, it remains unclear to what extent the agent-based systems can improve multimodal web data productivity. We study this in the realm of journalism, which requires iterative planning, interpretation, and contextual reasoning from multimodal raw contents to form a well structured news. We introduce NEWSAGENT, a benchmark for evaluating how agents can automatically search available raw contents, select desired information, and edit and rephrase to form a news article by accessing core journalistic functions. Given a writing instruction and firsthand data as how a journalist initiates a news draft, agents are tasked to identify narrative perspectives, issue keyword-based queries, retrieve historical background, and generate complete articles. Unlike typical summarization or retrieval tasks, essential context is not directly available and must be actively discovered, reflecting the information gaps faced in real-world news writing. NEWSAGENT includes 6k human-verified examples derived from real news, with multimodal contents converted to text for broad model compatibility. We evaluate open- and closed-sourced LLMs with commonly-used agentic frameworks on NEWSAGENT, which shows that agents are capable of retrieving relevant facts but struggling with planning and narrative integration. We believe that NEWSAGENT serves a realistic testbed for iterating and evaluating agent capabilities in terms of multimodal web data manipulation to real-world productivity.

Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion

arXiv:2506.23137v3 Announce Type: replace-cross Abstract: Knowledge graph completion demands effective modeling of multifaceted semantic relationships between entities. Yet, prevailing methods, which rely on static scoring functions over learned embeddings, struggling to simultaneously capture rich semantic context and the dynamic nature of relations. To overcome this limitation, we propose the Flow-Modulated Scoring (FMS) framework, conceptualizing a relation as a dynamic evolutionary process governed by its static semantic environment. FMS operates in two stages: it first learns context-aware entity embeddings via a Semantic Context Learning module, and then models a dynamic flow between them using a Conditional Flow-Matching module. This learned flow dynamically modulates a base static score for the entity pair. By unifying context-rich static representations with a conditioned dynamic flow, FMS achieves a more comprehensive understanding of relational semantics. Extensive experiments demonstrate that FMS establishes a new state of the art across both canonical knowledge graph completion tasks: relation prediction and entity prediction. On the standard relation prediction benchmark FB15k-237, FMS achieves a near-perfect MRR of 99.8% and Hits@1 of 99.7% using a mere 0.35M parameters, while also attaining a 99.9% MRR on WN18RR. Its dominance extends to entity prediction, where it secures a 25.2% relative MRR gain in the transductive setting and substantially outperforms all baselines in challenging inductive settings. By unifying a dynamic flow mechanism with rich static contexts, FMS offers a highly effective and parameter-efficient new paradigm for knowledge graph completion. Code published at: https://github.com/yuanwuyuan9/FMS.

Multi-Agent Data Visualization and Narrative Generation

arXiv:2509.00481v1 Announce Type: new Abstract: Recent advancements in the field of AI agents have impacted the way we work, enabling greater automation and collaboration between humans and agents. In the data visualization field, multi-agent systems can be useful for employing agents throughout the entire data-to-communication pipeline. We present a lightweight multi-agent system that automates the data analysis workflow, from data exploration to generating coherent visual narratives for insight communication. Our approach combines a hybrid multi-agent architecture with deterministic components, strategically externalizing critical logic from LLMs to improve transparency and reliability. The system delivers granular, modular outputs that enable surgical modifications without full regeneration, supporting sustainable human-AI collaboration. We evaluated our system across 4 diverse datasets, demonstrating strong generalizability, narrative quality, and computational efficiency with minimal dependencies.

Learning local and global prototypes with optimal transport for unsupervised anomaly detection and localization

arXiv:2508.12927v2 Announce Type: replace-cross Abstract: Unsupervised anomaly detection aims to detect defective parts of a sample by having access, during training, to a set of normal, i.e. defect-free, data. It has many applications in fields, such as industrial inspection or medical imaging, where acquiring labels is costly or when we want to avoid introducing biases in the type of anomalies that can be spotted. In this work, we propose a novel UAD method based on prototype learning and introduce a metric to compare a structured set of embeddings that balances a feature-based cost and a spatial-based cost. We leverage this metric to learn local and global prototypes with optimal transport from latent representations extracted with a pre-trained image encoder. We demonstrate that our approach can enforce a structural constraint when learning the prototypes, allowing to capture the underlying organization of the normal samples, thus improving the detection of incoherencies in images. Our model achieves performance that is on par with strong baselines on two reference benchmarks for anomaly detection on industrial images.

Artificial Intelligence-Based Analysis of Ice Cream Melting Behavior Under Various Ingredients

arXiv:2509.00507v1 Announce Type: new Abstract: The stability of ice cream during melting is a critical factor for consumer's acceptance and product quality. With the commonly added stabilizer to improve texture, structure and slower melting as the factors to analyze. This report explores the effects of locust bean gum, guar gum, maltodextrin, and carrageenan on the melting behavior of homemade ice cream. The main objective was to assess how these additives influence melting resistance and to identify a more cost-effective recipe formulation. Ice cream samples incorporating each additive were prepared and subjected to melting tests under controlled conditions. Timelapse recordings were used to capture and analyze the progression of melting over time. Python and OpenCV is used for process and analysis. Observations revealed that all samples retained a foam-like structure even after melting, suggesting the stabilizers contributed to the formation of a stable air-cell matrix. Furthermore, when the melted samples were re-frozen and subsequently melted again, they displayed increased sturdiness, indicating improved resilience of the ice cream structure. Comparative analysis of the different stabilizers highlighted variations in their effectiveness, with some offering stronger melting resistance and structural support than others. Overall, the findings provide insights into the functional roles of commonly used food additives in ice cream formulation. By evaluating both performance and cost, this study demonstrates the potential for developing recipes that balance durability with economic efficiency, contributing to practical applications in both small-scale and commercial ice cream production.

Deep Learning-Based Rock Particulate Classification Using Attention-Enhanced ConvNeXt

arXiv:2509.01704v1 Announce Type: cross Abstract: Accurate classification of rock sizes is a vital component in geotechnical engineering, mining, and resource management, where precise estimation influences operational efficiency and safety. In this paper, we propose an enhanced deep learning model based on the ConvNeXt architecture, augmented with both self-attention and channel attention mechanisms. Building upon the foundation of ConvNext, our proposed model, termed CNSCA, introduces self-attention to capture long-range spatial dependencies and channel attention to emphasize informative feature channels. This hybrid design enables the model to effectively capture both fine-grained local patterns and broader contextual relationships within rock imagery, leading to improved classification accuracy and robustness. We evaluate our model on a rock size classification dataset and compare it against three strong baseline. The results demonstrate that the incorporation of attention mechanisms significantly enhances the models capability for fine-grained classification tasks involving natural textures like rocks.

LLM-Assisted Iterative Evolution with Swarm Intelligence Toward SuperBrain

arXiv:2509.00510v1 Announce Type: new Abstract: We propose a novel SuperBrain framework for collective intelligence, grounded in the co-evolution of large language models (LLMs) and human users. Unlike static prompt engineering or isolated agent simulations, our approach emphasizes a dynamic pathway from Subclass Brain to Superclass Brain: (1) A Subclass Brain arises from persistent, personalized interaction between a user and an LLM, forming a cognitive dyad with adaptive learning memory. (2) Through GA-assisted forward-backward evolution, these dyads iteratively refine prompts and task performance. (3) Multiple Subclass Brains coordinate via Swarm Intelligence, optimizing across multi-objective fitness landscapes and exchanging distilled heuristics. (4) Their standardized behaviors and cognitive signatures integrate into a Superclass Brain, an emergent meta-intelligence capable of abstraction, generalization and self-improvement. We outline the theoretical constructs, present initial implementations (e.g., UAV scheduling, KU/KI keyword filtering) and propose a registry for cross-dyad knowledge consolidation. This work provides both a conceptual foundation and an architectural roadmap toward scalable, explainable and ethically aligned collective AI.

Doctoral Thesis: Geometric Deep Learning For Camera Pose Prediction, Registration, Depth Estimation, and 3D Reconstruction

arXiv:2509.01873v1 Announce Type: cross Abstract: Modern deep learning developments create new opportunities for 3D mapping technology, scene reconstruction pipelines, and virtual reality development. Despite advances in 3D deep learning technology, direct training of deep learning models on 3D data faces challenges due to the high dimensionality inherent in 3D data and the scarcity of labeled datasets. Structure-from-motion (SfM) and Simultaneous Localization and Mapping (SLAM) exhibit robust performance when applied to structured indoor environments but often struggle with ambiguous features in unstructured environments. These techniques often struggle to generate detailed geometric representations effective for downstream tasks such as rendering and semantic analysis. Current limitations require the development of 3D representation methods that combine traditional geometric techniques with deep learning capabilities to generate robust geometry-aware deep learning models. The dissertation provides solutions to the fundamental challenges in 3D vision by developing geometric deep learning methods tailored for essential tasks such as camera pose estimation, point cloud registration, depth prediction, and 3D reconstruction. The integration of geometric priors or constraints, such as including depth information, surface normals, and equivariance into deep learning models, enhances both the accuracy and robustness of geometric representations. This study systematically investigates key components of 3D vision, including camera pose estimation, point cloud registration, depth estimation, and high-fidelity 3D reconstruction, demonstrating their effectiveness across real-world applications such as digital cultural heritage preservation and immersive VR/AR environments.

Text-to-Layout: A Generative Workflow for Drafting Architectural Floor Plans Using LLMs

arXiv:2509.00543v1 Announce Type: new Abstract: This paper presents the development of an AI-powered workflow that uses Large Language Models (LLMs) to assist in drafting schematic architectural floor plans from natural language prompts. The proposed system interprets textual input to automatically generate layout options including walls, doors, windows, and furniture arrangements. It combines prompt engineering, a furniture placement refinement algorithm, and Python scripting to produce spatially coherent draft plans compatible with design tools such as Autodesk Revit. A case study of a mid-sized residential layout demonstrates the approach's ability to generate functional and structured outputs with minimal manual effort. The workflow is designed for transparent replication, with all key prompt specifications documented to enable independent implementation by other researchers. In addition, the generated models preserve the full range of Revit-native parametric attributes required for direct integration into professional BIM processes.

Fantastic Pretraining Optimizers and Where to Find Them

arXiv:2509.02046v1 Announce Type: cross Abstract: AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B-1.2B parameters) and data-to-model ratios (1-8x the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1x for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners -- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4x over AdamW for 0.1B parameter models to merely 1.1x for 1.2B parameter models.