Archives AI News

Automate app deployment and security analysis with new Gemini CLI extensions

Find and fix security vulnerabilities. Deploy your app to the cloud. All without leaving your command-line. Today, we’re closing the gap between your terminal and the cloud with a first look at the future of Gemini CLI, delivered through two…

September 10, 2025

The Best Google Pixel Phones of 2025, Tested and Reviewed: Which Model to Buy, Cases and Accessories, Feature Drops

Here’s a guide to all the models—plus Pixel case recommendations and smart software tricks to try.

September 10, 2025

Why Task-Based Evaluations Matter

This article is adapted from a lecture series I gave at Deeplearn 2025: From Prototype to Production: Evaluation Strategies for Agentic Applications. Task-based evaluations, which measure an AI system’s performance in use-case-specific, real-world settings, are underadopted and understudied. There is…

September 10, 2025

Here’s What to Know About Poland Shooting Down Russian Drones

On Wednesday morning, Poland shot down several Russian drones that entered its airspace—a first since Moscow’s invasion of Ukraine. The incident disrupted air travel and set the region on edge.

September 10, 2025

Presentation: The Data Backbone of LLM Systems

Drawing from his 8 years of experience in AI, Paul Iusztin breaks down the core components of a scalable architecture, emphasizing the importance of RAG. He shares practical patterns, including the Feature Training Inference architecture, and provides a detailed use…

September 10, 2025

13 Best Electrolyte Powders (2025): Tasty and Effective

Get those lost minerals back with the help of our top picks.

September 10, 2025

GeForce NOW | How Install-to-Play Works on GeForce NOW

September 10, 2025

Certainty-Guided Reasoning in Large Language Models: A Dynamic Thinking Budget Approach

arXiv:2509.07820v1 Announce Type: new Abstract: The rise of large reasoning language models (LRLMs) has unlocked new potential for solving complex tasks. These models operate with a thinking budget, that is, a predefined number of reasoning tokens used to arrive at a solution. We propose a novel approach, inspired by the generator/discriminator framework in generative adversarial networks, in which a critic model periodically probes its own reasoning to assess whether it has reached a confident conclusion. If not, reasoning continues until a target certainty threshold is met. This mechanism adaptively balances efficiency and reliability by allowing early termination when confidence is high, while encouraging further reasoning when uncertainty persists. Through experiments on the AIME2024 and AIME2025 datasets, we show that Certainty-Guided Reasoning (CGR) improves baseline accuracy while reducing token usage. Importantly, extended multi-seed evaluations over 64 runs demonstrate that CGR is stable, reducing variance across seeds and improving exam-like performance under penalty-based grading. Additionally, our token savings analysis shows that CGR can eliminate millions of tokens in aggregate, with tunable trade-offs between certainty thresholds and efficiency. Together, these findings highlight certainty as a powerful signal for reasoning sufficiency. By integrating confidence into the reasoning process, CGR makes large reasoning language models more adaptive, trustworthy, and resource efficient, paving the way for practical deployment in domains where both accuracy and computational cost matter.

September 10, 2025

HueManity: Probing Fine-Grained Visual Perception in MLLMs

arXiv:2506.03194v3 Announce Type: replace-cross Abstract: Multimodal Large Language Models (MLLMs) excel at high-level visual reasoning, but their performance on nuanced perceptual tasks remains surprisingly limited. We present HueManity, a benchmark designed to assess visual perception in MLLMs. The dataset comprises 83,850 images featuring two-character alphanumeric strings embedded in Ishihara test style dot patterns, challenging models on precise pattern recognition. Our evaluation of nine state-of-the-art MLLMs on HueManity demonstrates a significant performance deficit compared to human and traditional computer vision baselines. The best-performing MLLM achieved a 33.6% accuracy on the numeric `easy' task and a striking 3% on the alphanumeric `hard' task. In contrast, human participants achieved near-perfect scores (100% and 95.6%), and a fine-tuned ResNet50 model reached accuracies of 96.5% and 94.5%. These results highlight a critical gap in the visual capabilities of current MLLMs. Our analysis further explores potential architectural and training-paradigm factors contributing to this perceptual gap in MLLMs. We open-source HueManity dataset and code to foster further research in improving perceptual robustness of MLLMs.

September 10, 2025

Aligning LLMs for the Classroom with Knowledge-Based Retrieval — A Comparative RAG Study

arXiv:2509.07846v1 Announce Type: new Abstract: Large language models like ChatGPT are increasingly used in classrooms, but they often provide outdated or fabricated information that can mislead students. Retrieval Augmented Generation (RAG) improves reliability of LLMs by grounding responses in external resources. We investigate two accessible RAG paradigms, vector-based retrieval and graph-based retrieval to identify best practices for classroom question answering (QA). Existing comparative studies fail to account for pedagogical factors such as educational disciplines, question types, and practical deployment costs. Using a novel dataset, EduScopeQA, of 3,176 questions across academic subjects, we measure performance on various educational query types, from specific facts to broad thematic discussions. We also evaluate system alignment with a dataset of systematically altered textbooks that contradict the LLM's latent knowledge. We find that OpenAI Vector Search RAG (representing vector-based RAG) performs well as a low-cost generalist, especially for quick fact retrieval. On the other hand, GraphRAG Global excels at providing pedagogically rich answers to thematic queries, and GraphRAG Local achieves the highest accuracy with the dense, altered textbooks when corpus integrity is critical. Accounting for the 10-20x higher resource usage of GraphRAG (representing graph-based RAG), we show that a dynamic branching framework that routes queries to the optimal retrieval method boosts fidelity and efficiency. These insights provide actionable guidelines for educators and system designers to integrate RAG-augmented LLMs into learning environments effectively.

September 10, 2025