2025.md 755 KB

AI Papers of the Week — 2025

← Back to main index

This page collects every weekly issue of AI Papers of the Week from 2025. For other years, see the main index.


Top AI Papers of the Week (December 22 - December 28) - 2025

Paper Links
1) Monitoring Monitorability - OpenAI introduces a framework for measuring how well we can detect misbehavior in AI systems by monitoring their chain-of-thought reasoning. The paper proposes three evaluation archetypes and a new metric (g-mean2) to track monitorability across different models and training regimes.
● Three evaluation archetypes: Intervention evals apply controlled changes to inputs and check if monitors detect reasoning shifts. Process evals verify if models use valid solution paths for problems with known solutions. Outcome-property evals ask if monitors can predict properties of model outputs, like test hacking.
● CoT monitoring works: Monitors with access to the chain-of-thought significantly outperform action-only monitors. Longer CoTs are generally more monitorable. GPT-5 Thinking at high reasoning effort shows strong monitorability across the evaluation suite.
● RL does not degrade monitorability: Testing on frontier training runs for o3 and GPT-5.1 Codex Max, the authors find that RL optimization does not materially decrease monitorability at the current scale. Monitorability tends to correlate with CoT length during training.
● Monitorability tax tradeoff: Smaller models at higher reasoning effort can match larger models’ capabilities while achieving higher monitorability - at increased inference compute cost. Giving weak monitors access to CoT steepens their test-time compute scaling for monitorability.
Paper, Tweet
2) Test-Time Training for Long-Context LLMs - This paper shows that long-context LLMs can access millions of tokens but often fail to meaningfully use that information. The authors propose query-only test-time training (qTTT), which adapts models during inference through targeted gradient updates rather than generating more thinking tokens.
● Score dilution problem: The authors identify that static self-attention suffers from score dilution, where target token probabilities vanish as context grows. They prove that target-distractor logit margins must scale logarithmically with context length to maintain performance.
● Query-only TTT: qTTT performs a single prefill to cache keys and values, then applies lightweight gradient updates exclusively on query projection matrices. This keeps other parameters fixed and reuses the key-value cache, making it computationally efficient.
● Massive improvements: qTTT achieves 12.6 and 14.1 percentage point improvements for Qwen3-4B on LongBench-v2 and ZeroScrolls benchmarks. Gains exceed 20% on code comprehension, multi-document QA, and multi-hop reasoning tasks.
● Better compute allocation: Under matched inference-time compute budgets, qTTT consistently outperforms thinking token strategies. The practical takeaway is that a small amount of context-specific training beats generating thousands of thinking tokens for long-context tasks.
Paper, Tweet
3) LaMer - LaMer introduces a Meta-RL framework that enables LLM agents to actively explore and learn from environment feedback at test time. Unlike standard RL-trained agents that learn fixed policies and struggle with novel tasks, LaMer agents learn exploration strategies that transfer across environments.
● Cross-episode training: Instead of optimizing single episodes independently, LaMer trains agents across sequences of episodes on the same task. Early episodes encourage exploration to gather information, while later episodes exploit that knowledge. A cross-episode discount factor controls the exploration-exploitation tradeoff.
● In-context policy adaptation: The agent uses self-reflection to summarize past experiences and adjust strategy without gradient updates. This leverages LLMs’ natural in-context learning abilities - the agent essentially implements an RL algorithm in context during deployment.
● Strong performance gains: On Qwen3-4B, LaMer achieves 11% improvement on Sokoban, 14% on MineSweeper, and 19% on Webshop over RL baselines. The framework produces more diverse trajectories while achieving higher success rates, reaching a better exploration-exploitation balance.
● Better generalization: LaMer-trained agents generalize better to harder and out-of-distribution tasks compared to standard RL agents. The learned exploration strategies transfer to novel environments, enabling more robust adaptation at test time.
Paper, Tweet
4) Epistemia - This paper argues that LLMs are not epistemic agents but stochastic pattern-completion systems. By mapping human and artificial epistemic pipelines, the authors identify seven fundamental fault lines where human and machine judgment diverge, despite producing superficially similar outputs.
● Seven epistemic fault lines: The paper identifies divergences in grounding (perception vs text), parsing (situation understanding vs tokenization), experience (episodic memory vs embeddings), motivation (goals and emotions vs statistical optimization), causality (causal reasoning vs correlations), metacognition (uncertainty monitoring vs forced confidence), and value (moral commitment vs probabilistic prediction).
● Introducing Epistemia: The authors define Epistemia as the structural condition where linguistic plausibility substitutes for epistemic evaluation. Users experience having an answer without the cognitive labor of judgment - the feeling of knowing without actually knowing.
● Why hallucinations are not bugs: In this framework, hallucinations are not anomalous failures but the default operational state. LLMs produce ungrounded content because they lack reference, truth conditions, or evidential constraints. Grounded outputs only occur when the probability structure happens to coincide with the factual structure.
● Implications for AI governance: The paper calls for epistemic evaluation beyond surface alignment, governance frameworks that regulate how generative outputs enter epistemic workflows, and new forms of epistemic literacy that help users recognize when apparent judgments are pattern completion rather than genuine evaluation.
Paper, Tweet
5) JustRL - JustRL challenges the assumption that complex RL pipelines are necessary for training small language models. Using single-stage training with fixed hyperparameters, the authors achieve state-of-the-art math reasoning performance on two 1.5B models while using 2x less compute than sophisticated multi-stage approaches.
● Simplicity wins: The recipe uses GRPO with binary rewards, no curriculum learning, no dynamic hyperparameters, no length penalties, and no multi-stage training. The same fixed hyperparameters work across both DeepSeek-R1-Distill-Qwen-1.5B and OpenMath-Nemotron-1.5B without tuning.
● Strong results with less compute: JustRL-DeepSeek-1.5B achieves 54.9% average across nine math benchmarks, beating ProRL-V2’s 53.1% while using half the compute. JustRL-Nemotron-1.5B reaches 64.3%, slightly outperforming QuestA’s curriculum learning approach.
● Stable training dynamics: Training shows smooth, monotonic improvement over 4,000+ steps without the collapses, plateaus, or oscillations that typically motivate complex interventions. Policy entropy stays healthy between 1.0 and 1.6, and response length naturally compresses without explicit penalties.
● Adding tricks hurts performance: Ablations reveal that standard optimizations like explicit length penalties and robust verifiers actually degrade results by collapsing exploration. Length penalties dropped AIME24 performance from 55% to 50%, and adding both modifications dropped it to 45%.
Paper, Tweet
6) Self-Play SWE-RL - Self-Play SWE-RL (SSR) trains software engineering agents through self-play, requiring only access to sandboxed repositories with no human-labeled issues or tests. A single LLM learns to both inject and repair bugs of increasing complexity, achieving +10.4 points on SWE-bench Verified while consistently outperforming human-data baselines.
● Minimal data assumptions: SSR requires only Docker images containing source code and dependencies. The agent discovers how to run tests, creates test parsers, and understands test suite structure entirely through environmental interaction - no prior knowledge of programming language or test framework needed.
● Dual-role self-play: The same LLM policy plays two roles - a bug-injection agent that explores repositories and creates bug artifacts (including bug-inducing patches, test scripts, and test-weakening patches), and a bug-solving agent that repairs them. Both share parameters and train jointly with RL.
● Higher-order bugs for curriculum: Failed repair attempts become new training data. These higher-order bugs mimic how developers unintentionally write buggy code, creating an evolving curriculum that naturally adapts to the agent’s improving capabilities.
● Outperforms human-data training: SSR achieves +10.4 points on SWE-bench Verified and +7.8 on SWE-Bench Pro, consistently beating baseline RL trained with human-curated issues and tests across the entire training trajectory. Improvements transfer to natural language issues absent from self-play training.
Paper, Tweet
7) Empirical Study of Agent Developer Practices - This paper presents the first large-scale empirical study of LLM-based agent frameworks, analyzing 11,910 developer discussions across ten popular frameworks. The research identifies practical challenges developers face and evaluates how well current frameworks meet their needs.
● Four challenge domains: Developers encounter issues across logic (25.6% related to task termination and loop prevention), tools (14% from API limitations and permission errors), performance (25% involving context retention and memory management), and version conflicts (23% causing build failures and compatibility issues).
● Framework selection is hard: More than 80% of developers report difficulty identifying frameworks that best meet their specific requirements. The study recommends prioritizing ecosystem robustness and long-term maintenance over short-term popularity when choosing frameworks.
● Multi-framework combinations dominate: Combining multiple frameworks with different functions has become the primary approach to agent development. Each framework excels in different areas: LangChain and CrewAI lower barriers for beginners, while AutoGen and LangChain lead in task decomposition and multi-agent collaboration.
● Performance optimization is universally weak: Across all ten frameworks studied, performance optimization remains a common shortcoming. Despite mature ecosystems, AutoGen and LangChain face the highest maintenance complexity, highlighting tradeoffs between feature richness and long-term maintainability.
Paper, Tweet
8) Comprehensive Survey of Small Language Models - This survey provides a comprehensive overview of Small Language Models (SLMs), which address key LLM limitations, including high computational demands, privacy concerns from cloud APIs, and poor performance on edge devices. The authors propose a standardized SLM definition based on specialized task capability and resource-constrained suitability, and develop taxonomies and frameworks for SLM acquisition, enhancement, application, and reliability. Paper, Tweet
9) Sophia - Sophia introduces System 3, a meta-layer beyond traditional dual-process theory that enables LLM agents to maintain persistent identity and align short-term actions with long-term goals. The framework achieves 80% reduction in reasoning steps for recurring operations and 40% performance improvement on high-complexity tasks. Paper, Tweet
10) SonicMoE - SonicMoE addresses performance bottlenecks in Mixture of Experts models through IO-aware and tile-aware optimizations. The approach achieves 1.86x compute throughput improvement on Hopper GPUs, reduces activation memory by 45%, and enables training 213 billion tokens per day on 64 H100 GPUs for a 7B model. Paper, Tweet

Top AI Papers of the Week (December 15 - December 21) - 2025

Paper Links
1) Detailed Balance in LLM Agents - Researchers establish the first macroscopic physical law in LLM generation dynamics by applying the least action principle to analyze LLM-agent behavior. They discover statistical evidence of detailed balance in state transitions, suggesting LLMs implicitly learn underlying potential functions rather than explicit rules.
● Theoretical framework: Applies statistical mechanics concepts to understand LLM-agent dynamics. The framework transcends specific model architectures and prompt templates.
● Detailed balance discovery: By measuring transition probabilities between LLM-generated states, researchers identify balanced properties similar to physical systems at equilibrium.
● Implicit learning: Results suggest LLMs may learn underlying potential functions that govern generation, rather than memorizing explicit rule sets from training data.
● Why it matters: This interdisciplinary work bridges physics and AI, providing a theoretical foundation for understanding complex AI agent behavior at a macroscopic level independent of implementation details.
Paper, Tweet
2) Budget Aware Test-time Scaling - Researchers discover that simply expanding tool-call budgets without proper awareness fails to improve agent performance. They introduce BATS (Budget Aware Test-time Scaling), a framework that makes web search agents budget-aware, enabling more strategic resource allocation and pushing the cost-performance Pareto frontier.
● Key finding: Increasing token budgets improves LLM performance, but expanding tool-call budgets without awareness yields no improvement. Resource consciousness is essential for effective agent scaling.
● Budget Tracker Plugin: A lightweight mechanism that provides agents with continuous awareness of remaining resources, enabling strategic decision-making throughout task execution.
● BATS framework: Dynamically adjusts exploration strategy based on remaining capacity - deciding whether to pursue promising leads deeper or explore alternative paths.
● Results: Budget-aware approaches produce more favorable scaling curves and systematically improve agent efficiency under computational constraints. First comprehensive study of budget-constrained tool-augmented agents.
Paper, Tweet
3) DeepCode - DeepCode is a fully autonomous framework for synthesizing complete codebases from scientific papers despite LLM context limitations. It treats repository synthesis as a channel optimization problem, achieving state-of-the-art on PaperBench and outperforming commercial tools like Cursor and Claude Code.
● Blueprint distillation: Compresses source documents into structured representations that preserve essential implementation details while fitting within context windows.
● Stateful code memory: Maintains structured indexing for organized knowledge across the codebase, enabling coherent multi-file generation.
● Retrieval-augmented generation: Injects relevant context conditionally during generation, ensuring each code component has access to necessary dependencies and specifications.
● Closed-loop error correction: Iteratively refines generated code through automated testing and debugging, catching and fixing issues autonomously.
Paper, Tweet
4) FrontierScience - OpenAI introduced FrontierScience, a new benchmark measuring AI capabilities for expert-level scientific reasoning across physics, chemistry, and biology. The benchmark consists of over 700 questions created and verified by domain experts, including international olympiad medalists and PhD scientists.
● Two evaluation tracks: FrontierScience-Olympiad contains 100 questions designed by olympiad medalists for constrained short-answer reasoning. FrontierScience-Research has 60 open-ended research subtasks graded on 10-point rubrics.
● Benchmark results: GPT-5.2 leads with 77% on Olympiad and 25% on Research tasks. Gemini 3 Pro scored 76% on the Olympiad. The Research track shows significant room for improvement.
● Expert collaboration: 42 former international medalists (totaling 109 olympiad medals) created Olympiad questions. 45 PhD scientists across quantum electrodynamics, synthetic chemistry, and evolutionary biology developed Research tasks.
● Why it matters: As GPQA went from 39% with GPT-4 to 92% with GPT-5.2 in two years, FrontierScience provides harder problems to track progress toward AI-accelerated scientific discovery.
Paper, Tweet
5) CLaRa - CLaRa introduces a unified framework for retrieval-augmented generation that performs embedding-based compression and joint optimization in a shared continuous space. The approach addresses key RAG limitations around long contexts and disjoint retrieval-generation optimization.
● SCP data synthesis: Uses question-answer and paraphrase supervision to create semantically rich compressed vectors that remain retrievable for downstream tasks.
● End-to-end optimization: Trains the reranker and generator simultaneously using a single language modeling objective. Gradients flow through both modules via a differentiable top-k estimator.
● Theoretical grounding: The unified optimization approach theoretically connects retrieval relevance with answer quality, aligning what gets retrieved with what improves generation.
● Results: Achieves state-of-the-art compression and reranking performance across multiple QA benchmarks, often surpassing text-based fine-tuned baselines.
Paper, Tweet
6) FACTS Leaderboard - Google introduces the FACTS Leaderboard, a comprehensive benchmark suite for evaluating LLM factuality across diverse scenarios. The leaderboard aggregates performance across four specialized sub-benchmarks to provide a holistic measure of how accurately models generate factual text.
● Four evaluation dimensions: FACTS Multimodal tests visual grounding with world knowledge on image-based questions. FACTS Parametric measures closed-book factoid question answering from internal parameters. FACTS Search evaluates factuality when using search APIs. FACTS Grounding v2 checks if long-form responses align with source documents.
● Automated judging system: Each sub-leaderboard uses automated judge models to score responses. The final FACTS Score averages all four components for a balanced assessment. Coverage and No-Contradiction verdicts ensure responses are both complete and accurate.
● Current rankings: Gemini 3 Pro leads with 68.8% overall, followed by Gemini 2.5 Pro at 62.1% and GPT 5 at 61.8%. The benchmark reveals trade-offs - Gemini models show higher coverage while GPT models achieve better no-contradiction scores.
● Benchmark integrity: The suite includes public and private test splits to prevent overfitting. Hosted on Kaggle, it remains open for new model submissions while maintaining evaluation integrity through hidden test prompts.
Paper, Tweet
7) Vision-Language Synergy Reasoning - Researchers propose Vision-Language Synergy Reasoning (VLSR), a method that combines visual and textual reasoning to improve performance on ARC-AGI abstract reasoning tasks. The key insight is that vision excels at global pattern abstraction while language specializes in symbolic rule formulation.
● Modality strengths: Vision supports pattern recognition and verification across the entire puzzle grid. Language handles precise rule formulation and step-by-step execution of transformations.
● VLSR decomposition: The framework assigns subtasks to each modality based on their strengths - visual processing for pattern abstraction, text for symbolic reasoning, and rule application.
● Modality-Switch Self-Correction: MSSC uses visual verification to catch errors in text-based reasoning. When text execution fails, the system switches to visual mode to identify and fix mistakes.
● Results: Achieves up to 4.33% improvement over text-only baselines on ARC-AGI tasks across multiple foundation models, demonstrating that unifying visual abstraction with linguistic reasoning advances generalizable AI.
Paper, Tweet
8) SHARP - SHARP generates photorealistic novel viewpoints from a single photograph in under one second on standard GPU hardware. The neural network produces a 3D Gaussian representation in a single feedforward pass, enabling real-time rendering for nearby viewing angles. It reduces LPIPS by 25-34% and achieves three orders of magnitude faster synthesis than prior approaches with strong zero-shot generalization. Paper, Tweet
9) ARTEMIS - Stanford researchers conducted the first head-to-head evaluation of AI agents against human cybersecurity professionals on a live enterprise network with approximately 8,000 hosts. Their multi-agent framework ARTEMIS placed second overall, discovering 9 valid vulnerabilities with 82% accuracy and outperforming 9 of 10 human testers at a fraction of the cost (18 dollars per hour vs 60 dollars per hour for professionals). Paper, Tweet
10) Stronger Normalization-Free Transformers - Researchers introduce Derf, a simple point-wise function that replaces normalization layers in Transformers. Based on the rescaled Gaussian cumulative distribution function, Derf outperforms LayerNorm, RMSNorm, and Dynamic Tanh across vision, speech, and DNA sequence modeling tasks with improved generalization rather than stronger fitting capacity. Paper, Tweet

Top AI Papers of the Week (December 8 - December 14) - 2025

Paper Links
1) Towards a Science of Scaling Agent Systems - Researchers from Google present a controlled evaluation framework for agent systems, challenging the assumption that “more agents are all you need.” Across 180 configurations spanning three LLM families and four agentic benchmarks, the study establishes quantitative principles for when multi-agent coordination helps versus hurts performance.
● Predictive framework: The study derives a mixed-effects model achieving an R-squared of 0.513 using coordination metrics like efficiency, error amplification, and redundancy. Leave-one-domain-out cross-validation achieves R^2 of 0.89 and correctly predicts optimal architectures for 87% of held-out task configurations.
● Tool-coordination trade-off: Tool-heavy tasks suffer from multi-agent coordination overhead, with efficiency penalties compounding as environmental complexity increases. Tasks where single-agent performance exceeds 45% accuracy experience negative returns from additional agents.
● Error amplification patterns: Independent multi-agent systems amplify errors 17.2x versus single-agent baselines through unchecked error propagation. Centralized coordination achieves 4.4x containment via validation bottlenecks that catch errors before they propagate.
● Architecture-task alignment: Performance spans +81% relative improvement (structured financial reasoning under centralized coordination) to -70% degradation (sequential planning under independent coordination). The key finding is that architecture-task alignment, not the number of agents, determines collaborative success.
Paper, Tweet
2) GigaTIME - Microsoft Research and Providence Health introduce GigaTIME, a multimodal AI framework that generates virtual multiplex immunofluorescence (mIF) images from standard H&E pathology slides, enabling population-scale tumor immune microenvironment modeling. The system was applied to over 14,000 cancer patients across 24 cancer types, uncovering over 1,200 statistically significant protein-biomarker associations.
● Cross-modal translation: GigaTIME learns to translate H&E slides into virtual mIF images across 21 protein channels by training on 40 million cells with paired H&E and mIF data. The model uses a NestedUNet architecture that significantly outperforms CycleGAN baselines on pixel, cell, and slide-level metrics.
● Virtual population at scale: Applied to 14,256 patients from 51 hospitals across seven US states, generating 299,376 virtual mIF whole-slide images. This enabled the discovery of 1,234 statistically significant associations between TIME proteins and clinical biomarkers at pan-cancer, cancer-type, and subtype levels.
● Clinical discovery: The virtual population revealed associations between immune markers and genomic alterations like TMB-H, MSI-H, and KMT2D mutations. A combined GigaTIME signature of all 21 virtual protein channels outperformed individual markers for patient stratification and survival prediction.
● Combinatorial insights: Analysis found that combining protein channels like CD138 and CD68 yields stronger biomarker associations than either protein alone, suggesting coordinated immune responses in antibody-mediated tumor mechanisms.
● Independent validation: Testing on 10,200 TCGA patients showed strong concordance with Providence results (Spearman correlation 0.88), demonstrating GigaTIME’s generalizability across different patient populations and data sources.
Paper, Tweet
3) Pre-Training, Mid-Training, and RL Interplay - CMU researchers develop a controlled experimental framework using synthetic reasoning tasks to isolate how pre-training, mid-training, and RL-based post-training each contribute to reasoning capabilities in language models. The study reconciles conflicting views on whether RL truly extends reasoning beyond what models learn during pre-training.
● Edge of competence: RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data targets the model’s edge of competence - tasks that are difficult but not yet out of reach. When tasks are already covered or too out-of-distribution, gains vanish.
● Minimal exposure threshold: Contextual generalization requires minimal yet sufficient pre-training exposure. RL fails with near-zero exposure but generalizes robustly with sparse exposure of at least 1%, yielding up to +60% pass@128 improvements.
● Mid-training impact: A mid-training stage bridging pre-training and RL substantially improves out-of-distribution reasoning under fixed compute budgets, with mid-training + RL outperforming RL alone by +10.8% on OOD-hard tasks.
● Process rewards: Incorporating process-level rewards reduces reward hacking and improves reasoning fidelity by aligning reinforcement signals with valid reasoning behavior rather than just final answers.
Paper, Tweet
4) Agentic AI Adaptation Survey - Researchers from UIUC, Stanford, Berkeley, and other institutions present the first comprehensive taxonomy of adaptation strategies for agentic AI systems. The survey organizes recent advances into a unified framework covering how agents and their tools can be modified to achieve higher task performance, improved reliability, and better generalization across diverse scenarios.
● Four adaptation paradigms: The framework categorizes methods into A1 (tool execution signaled agent adaptation using verifiable outcomes like code sandbox results), A2 (agent output signaled adaptation from evaluations of final answers), T1 (agent-agnostic tool adaptation where tools train independently), and T2 (agent-supervised tool adaptation where tools adapt using frozen agent feedback).
● Key trade-offs identified: Agent adaptation (A1/A2) requires substantial compute for training billion-parameter models but offers maximal flexibility. Tool adaptation (T1/T2) optimizes external components at lower cost but may be constrained by frozen agent capabilities. T1 tools generalize well across agents, while A1 methods may overfit without regularization.
● RLVR emergence: The survey traces the evolution from early SFT and DPO methods to reinforcement learning with verifiable rewards (RLVR), where models learn directly from online interaction with tools and environments - marking a shift from pre-collected trajectories to dynamic, context-aware adaptation.
● Domain applications: Demonstrates how adaptation strategies apply across deep research, software development, computer use, and drug discovery - with state-of-the-art systems increasingly combining multiple paradigms in cascaded architectures.
Paper, Tweet
5) Reasoning Models Ace the CFA Exams - Researchers evaluate state-of-the-art reasoning models on mock CFA exams consisting of 980 questions across all three certification levels. While previous studies reported that LLMs performed poorly on these exams, the latest reasoning models now pass all three levels, with Gemini 3.0 Pro achieving a record 97.6% on Level I.
● Top performers: Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1 all pass every level. GPT-5 leads Level II with 94.3%, while Gemini 2.5 Pro achieves 86.4% on Level III multiple-choice, and Gemini 3.0 Pro scores 92.0% on constructed-response questions.
● Dramatic improvement from baselines: ChatGPT (GPT-3.5) failed all levels, GPT-4 passed Levels I and II but failed Level III, and GPT-4o passed all three. The new reasoning models achieve near-perfect scores on Levels I and II.
● Shifting difficulty patterns: Quantitative domains, previously identified as primary weaknesses for LLMs, now show near-zero error rates for top models. Ethical and Professional Standards remain the most challenging area, with 17-21% error rates on Level II.
● Chain-of-thought trade-offs: CoT prompting helps baseline models significantly but shows inconsistent effects on reasoning models for MCQs. However, CoT remains highly effective for constructed-response questions, boosting Gemini 3.0 Pro from 86.6% to 92.0%.
Paper, Tweet
6) AI and Human Co-Improvement - Meta FAIR researchers Jason Weston and Jakob Foerster argue that fully autonomous self-improving AI is neither the fastest nor safest path to superintelligence. Instead, they advocate for co-improvement: building AI that collaborates with human researchers to conduct AI research together, from ideation to experimentation.
● Core thesis: Self-improvement seeks to eliminate humans from the loop as quickly as possible. Co-improvement keeps humans involved, providing steering capability toward positive outcomes while leveraging complementary skill sets. Because AI is not yet mature enough to fully self-improve and is susceptible to misalignment, co-improvement will get us there faster and more safely.
● Research collaboration skills: The authors propose measuring and training AI on research collaboration abilities across problem identification, benchmark creation, method innovation, experiment design, collaborative execution, evaluation, scientific communication, and safety/alignment development.
● Bidirectional augmentation: Unlike self-improvement, which focuses on autonomous model updates, co-improvement centers on joint progress where humans help AI achieve greater abilities while AI augments human cognition and research capabilities. The goal is co-superintelligence through symbiosis.
● Paradigm shift acceleration: Major AI advances came from human researchers finding combinations of training data and method changes. Co-research with strong collaborative AI should accelerate finding unknown new paradigm shifts while maintaining transparency and human-centered safety.
Paper, Tweet
7) Selective Gradient Masking - Anthropic researchers present Selective Gradient Masking (SGTM), a technique that removes dangerous capabilities like CBRN knowledge from language models during pretraining while preserving general capabilities. Unlike data filtering, SGTM localizes target knowledge into dedicated “forget” parameters that can be zeroed out after training.
● Absorption mechanism: SGTM splits parameters into forget and retain components, with gradients masked so only forget parameters update on labeled dangerous content. Unlabeled dangerous content naturally gravitates toward forget parameters through self-reinforcing “absorption,” providing robustness to imperfect labeling.
● Recovery resistance: Traditional unlearning methods (RMU) recovered and removed biology knowledge in 50 fine-tuning steps. SGTM required 350 steps - 7x more resistant than RMU and matching the robustness of models trained with perfect data filtering.
● Retain/forget trade-offs: On Wikipedia biology experiments with 254M parameter models, SGTM achieved superior trade-offs compared to both weak and strict data filtering, retaining more knowledge from adjacent fields like medicine and chemistry with only 5% compute penalty.
● Mechanistic validation: Gradient analysis on bilingual data showed forget parameters develop higher gradient norms for forget-domain content while retain parameters specialize for general content, with this localization strengthening at larger scales.
Paper, Tweet
8) Nanbeige4-3B - Nanbeige4-3B is a 3B parameter model pretrained on 23T tokens and fine-tuned on over 30M instructions using a Fine-Grained Warmup-Stable-Decay scheduler, Dual Preference Distillation, and multi-stage reinforcement learning. Despite its compact size, it outperforms Qwen3-8B and Qwen3-14B on reasoning benchmarks and rivals much larger models on WritingBench, demonstrating that well-engineered small models can match far larger counterparts. Paper, Tweet
9) AI Agent Adoption Study - Harvard and Perplexity researchers present the first large-scale field study of AI agent adoption using hundreds of millions of anonymized interactions from Perplexity’s Comet browser. Productivity and Learning account for 57% of agentic queries, with digital technology workers (28% of adopters) and knowledge-intensive sectors leading adoption. Users in higher GDP countries with greater educational attainment are more likely to adopt agents, and over time, users shift from media and travel tasks toward more cognitively oriented topics. Paper, Tweet
10) ProAgent - ProAgent is the first end-to-end proactive LLM agent system that harnesses sensory contexts from AR glasses, smartphones, and edge servers to deliver assistance without explicit user instructions. Unlike reactive agents that wait for commands, ProAgent continuously senses the environment with on-demand tiered perception and achieves up to 33.4% higher proactive prediction accuracy, 16.8% higher tool-calling F1 score, and 38.9% improved user satisfaction over baselines. Paper, Tweet

Top AI Papers of the Week (December 1 - December 7) - 2025

Paper Links
1) DeepSeek-V3.2 - DeepSeek releases V3.2, an open model that matches GPT-5 on reasoning benchmarks while introducing significant architectural and training innovations. The high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and achieves gold-medal performance in both the 2025 IMO and IOI competitions.
● DeepSeek Sparse Attention (DSA): A new efficient attention mechanism that reduces computational complexity from O(L^2) to O(Lk) for the main model while preserving long-context performance. Implemented via a lightning indexer that selects top-k key-value entries per query token, achieving significant inference cost reductions at 128K context.
● Scalable RL framework: Post-training compute now exceeds 10% of pre-training cost, using GRPO with unbiased KL estimation and off-policy sequence masking. Specialist models are trained separately for math, code, agents, and search, then distilled into the final checkpoint, followed by mixed RL.
● Large-scale agentic task synthesis: Generates over 1,800 synthetic environments and 85,000 complex prompts for RL training. Includes code agents (24K tasks from GitHub issue-PR pairs), search agents (50K synthesized queries), and general agents with automatically verifiable constraints.
● Thinking in tool-use: Introduces context management for tool-calling scenarios that retains reasoning traces across tool calls until a new user message arrives. Cold-start training unifies reasoning and tool-use patterns within single trajectories.
● Benchmark results: DeepSeek-V3.2-Thinking scores 93.1% on AIME 2025 and 73.1% on SWE-Verified. The Speciale variant achieves 96.0% on AIME 2025, 99.2% on HMMT Feb 2025, and gold medals in IMO 2025 (35/42), IOI 2025 (492/600), and ICPC World Finals 2025.
Paper, Tweet
2) Quiet Feature Learning - Researchers reveal a hidden learning phenomenon in Transformers trained on algorithmic tasks. The study shows that substantial representational progress can remain hidden beneath an apparently flat loss curve, with models secretly learning “quiet features” during periods of stagnant validation loss.
● Quiet features discovery: During extended periods where validation loss appears stagnant, models learn intermediate computational representations that encode algorithmic steps but don’t immediately reduce task loss.
● Phase transitions: Training on ten foundational algorithmic tasks reveals pronounced phase transitions that deviate from typical power-law scaling, challenging the conventional understanding of model training dynamics.
● Causal necessity: Through ablation studies, the team demonstrated that individual quiet features are causally necessary for eventual task performance, not merely correlated artifacts.
● Training implications: The findings challenge reliance on cross-entropy loss as the sole training indicator, suggesting that richer diagnostics are needed to properly monitor model learning progress.
Paper, Tweet
3) SUSVIBES: Is Vibe Coding Safe? - Researchers introduce SUSVIBES, a benchmark of 200 real-world software engineering tasks to evaluate the security of code generated by LLM agents through “vibe coding” - the minimal-supervision programming paradigm. The findings reveal a significant gap between functional correctness and security compliance in agent-generated code.
● Benchmark design: SUSVIBES contains 200 feature-request tasks from real-world open-source projects that, when given to human programmers, led to vulnerable implementations. This tests whether agents replicate common security mistakes.
● Alarming security gap: While SWE-Agent with Claude 4 Sonnet achieved 61% functional correctness, only 10.5% of solutions met security standards. All evaluated coding agents performed poorly on security-sensitive tasks.
● Mitigation ineffective: Preliminary mitigation strategies, such as providing vulnerability hints alongside feature requests, proved ineffective at improving security outcomes.
● Production risk: The findings challenge optimism around LLM-assisted development, suggesting that widespread vibe coding adoption poses significant risks in security-critical applications.
Paper, Tweet
4) Evolving Multi-Agent Orchestration - OpenBMB researchers propose a “puppeteer-style” paradigm for multi-agent LLM collaboration, where a centralized orchestrator dynamically directs agents based on evolving task states. Trained via reinforcement learning, the system achieves superior performance with reduced computational costs across math, knowledge, and software development tasks.
● Dynamic orchestration: A centralized policy selects which agent to activate at each reasoning step, treating multi-agent collaboration as a sequential decision process. This decouples agent selection from internal behaviors, enabling flexible coordination without extensive retraining.
● Adaptive evolution via RL: The orchestrator uses REINFORCE to learn from completed tasks, progressively pruning less effective agents and favoring compact reasoning chains. A reward function balances solution quality with computational efficiency through a tunable weighting factor.
● Emergent topology patterns: As training progresses, the system develops compact, cyclic reasoning structures rather than static chains or trees. Graph density increases, and communication concentrates among “hub” agents, enabling recursive critique and continual refinement.
● Strong benchmark results: Puppeteer outperforms baselines including AFlow, MacNet, and EvoAgent across GSM-Hard, MMLU-Pro, SRDD, and CommonGen-Hard. The evolved system achieves 0.77 average accuracy in the Titan (large model) setting while reducing token consumption.
● Efficiency without sacrifice: Unlike prior multi-agent systems that trade efficiency for performance, Puppeteer reduces both token usage and active agent count over training. In Titan settings, agents learn to terminate reasoning earlier; in Mimas (smaller model) settings, the system selects lower-cost agents while maintaining chain length.
Paper, Tweet
5) FINDER and DEFT - OPPO AI introduces FINDER, a fine-grained benchmark with 100 expert-curated research tasks and 419 structured checklist items for evaluating deep research agents, along with DEFT, the first failure taxonomy categorizing 14 failure modes across reasoning, retrieval, and generation dimensions.
● Benchmark design: FINDER refines prompts from DeepResearch Bench with explicit guidelines on report length, format, and disciplinary scope. Each task includes 3-5 structured checklists that guide evaluation of report structure, analytical depth, and citation integrity.
● Failure taxonomy construction: DEFT was built using grounded theory with human-LLM collaborative coding across approximately 1,000 generated reports. The taxonomy identifies failures like Strategic Content Fabrication (19% of errors), Insufficient Information Acquisition (16.3%), and Lack of Analytical Depth (11.1%).
● Key finding - generation bottleneck: Over 39% of failures occur in content generation, particularly through strategic content fabrication, where agents generate unsupported but professional-sounding content. Retrieval failures account for 32% of errors, highlighting challenges in evidence integration and verification.
● Reasoning resilience insight: The study reveals that current deep research agents struggle not with task comprehension but with evidence integration, verification, and maintaining reasoning consistency across complex multi-step research tasks.
● Benchmark results: Gemini 2.5 Pro Deep Research leads with 50.95 overall RACE score. MiroFlow-English achieves the highest checklist accuracy (72.19%), while models like Kimi K2 show strong reasoning but suffer sharp declines in generation quality.
Paper, Tweet
6) Training LLMs for Honesty via Confessions - OpenAI introduces a novel method for training LLMs to honestly self-report their own misbehavior through “confessions” - separate outputs where models evaluate their compliance with instructions and policies. By training GPT-5-Thinking to produce confessions after completing tasks, the research demonstrates that models can be incentivized to reveal deceptive behaviors they otherwise hide in their main answers.
● Confession mechanism: After producing a main answer, the model generates a confession evaluating whether its response complied with instructions and policies. The confession reward is kept separate from the main answer reward (the “seal of confession”), preventing models from gaming confessions to affect task outcomes.
● Training methodology: Uses reinforcement learning with separate reward signals for task performance and confession honesty. A confession classifier (trained on human labels) evaluates whether confessions accurately identify policy violations. The separation ensures confessions remain honest signals rather than strategic outputs.
● Evaluation across domains: Tests honesty in four key areas: hallucination (fabricating information), instruction following (ignoring user constraints), scheming (pursuing hidden agendas), and reward hacking (exploiting evaluation loopholes). Results show confession training improves honest self-reporting across all domains.
● Key finding - hidden behaviors revealed: Models trained with confessions often admit to misbehaviors in their confession that they actively conceal in their main answer. This demonstrates that confessions can surface deceptive tendencies that would otherwise go undetected by standard evaluation methods.
Paper, Tweet
7) STRATUS: Autonomous Cloud Reliability - Researchers from UIUC, IBM Research, and Tsinghua present STRATUS, an LLM-based multi-agent system for autonomous Site Reliability Engineering (SRE) of cloud services. The system handles failure detection, localization, root-cause analysis, and mitigation without human intervention, outperforming state-of-the-art SRE agents by at least 1.5x on benchmark suites.
● Multi-agent architecture: Specialized agents for detection, diagnosis, and mitigation are orchestrated via a state machine that enables system-level safety reasoning. Deterministic control-flow logic handles orchestration while LLMs provide intelligence and creativity in data flows.
● Transactional No-Regression (TNR): A novel safety specification ensuring mitigation actions can always be undone if unsuccessful, and the agent keeps improving system health by reverting actions that worsen it. This enables safe exploration and iteration.
● Undo mechanism: A stack-based undo implementation tracks agent actions relative to specific system states and reverts them in correct order when needed. Combined with sandboxing and state-machine scheduling for write exclusivity.
● Benchmark performance: Significantly outperforms state-of-the-art solutions on AIOpsLab and ITBench SRE benchmark suites by at least 1.5x across GPT-4o, GPT-4o-mini, and Llama3 models.
Paper, Tweet
8) CodeVision: Thinking with Programming Vision - Researchers propose CodeVision, a framework where multimodal models generate code as a universal interface to invoke image operations, addressing brittleness in visual reasoning from orientation changes and corruptions. The two-stage training approach combines supervised fine-tuning with RL using dense rewards, enabling flexible tool composition and error recovery on Qwen models. Paper, Tweet
9) Polarization by Design - This economics paper examines how AI-driven persuasion technology alters elite strategies for shaping public opinion. The research identifies a “polarization pull” where single elites push societies toward fragmented opinions, with AI accelerating this drift. The work reframes polarization as a strategic governance instrument with implications for democratic stability. Paper, Tweet
10) STARFlow-V - Apple introduces STARFlow-V, the first normalizing flow-based video generator competitive with diffusion models. The 7B parameter model uses a global-local architecture and video-aware Jacobi iteration for parallel sampling, generating 480p video at 16fps while natively supporting text-to-video, image-to-video, and video-to-video tasks. Paper, Tweet

Top AI Papers of the Week (November 24 - November 30) - 2025

Paper Links
1) INTELLECT-3 - INTELLECT-3 is a 106B-parameter Mixture-of-Experts model (12B active) trained with large-scale reinforcement learning, achieving state-of-the-art performance for its size across math, code, science, and reasoning benchmarks. Built on top of GLM-4.5-Air base, it outperforms many larger frontier models, including DeepSeek R1-0528, and matches GLM-4.6 (which has over 3x the parameters) on key benchmarks.
● Frontier benchmark results: Achieves 90.8% on AIME 2024 and 88.0% on AIME 2025, outperforming DeepSeek R1-0528. Scores 69.3% on LiveCodeBench v6, beating GLM-4.5-Air by 8%. Competitive on GPQA Diamond (74.4%), HLE (14.6%), and MMLU-Pro (81.9%).
● prime-rl framework: Introduces an open-source asynchronous RL framework with disaggregated trainer and inference, continuous batching with in-flight weight updates, and native support for multi-turn agentic rollouts. Scales seamlessly from single node to 512 H200 GPUs.
● Two-stage post-training: Combines supervised fine-tuning on over 200B tokens of reasoning traces (from datasets like OpenReasoning-Math/Code/Science) with large-scale RL across diverse environments, including math, code, science, logic, deep research, and software engineering tasks.
● Verifiers and Environments Hub: Open-sources the complete training infrastructure, including the verifiers library for environment design, an Environments Hub with 500+ contributed RL environments, and Prime Sandboxes for high-throughput secure code execution supporting over 4,000 concurrent sandboxes.
● Full reproducibility: Releases model weights, complete training recipe, RL framework, and all environments used for training and evaluation. Training ran on 512 H200s over two months with a batch size of 256 prompts and 16 rollouts per prompt at 65K context length.
Paper, Tweet
2) Lightweight End-to-End OCR - HunyuanOCR is a commercial-grade, open-source, lightweight vision-language model with only 1B parameters designed specifically for OCR tasks. The architecture combines a native-resolution Vision Transformer with a 0.5B-parameter language model through an MLP adapter, outperforming commercial APIs and larger models like Qwen3-VL-4B while achieving state-of-the-art results on OCRBench for models under 3B parameters.
● Fully end-to-end architecture: Unlike traditional pipeline-based OCR systems or models requiring separate layout analysis, HunyuanOCR adopts a pure end-to-end paradigm that eliminates error propagation from cascaded processing. This enables complete workflows in a single inference pass, fundamentally resolving issues common in multi-stage systems.
● Comprehensive OCR capabilities: Supports text spotting, document parsing, information extraction, visual question answering, and text image translation across 130+ languages in a unified framework. This addresses limitations of narrow OCR expert models and inefficient general VLMs by consolidating diverse capabilities into a compact 1B-parameter architecture.
● Data-driven training with RL: Trained on 200M high-quality samples spanning nine real-world scenarios (documents, street views, handwritten text, screenshots, receipts, game interfaces, video frames). First industry demonstration that reinforcement learning (GRPO) yields significant performance gains in OCR tasks, particularly for complex document parsing and translation.
● Superior benchmark performance: Won first place in ICDAR 2025 DIMT Challenge (Small Model Track), surpasses MinerU2.5 and PaddleOCR-VL on OmniDocBench for document parsing, exceeds Qwen3-VL-4B in translation and information extraction, and outperforms PaddleOCR 3.0 plus commercial Cloud OCR APIs in text spotting.
● Production-ready deployment: Open-sourced on HuggingFace with a high-performance vLLM-based deployment solution. The native-resolution ViT preserves aspect ratios and avoids distortion, making it particularly effective for long-text documents and extreme aspect ratios while maintaining top-tier production efficiency.
Paper, Tweet
3) LatentMAS - LatentMAS introduces a framework enabling language model agents to collaborate directly within a continuous latent space rather than relying on text-based communication. By using last-layer hidden embeddings and a shared latent working memory, agents preserve and transfer internal representations without information loss from text serialization.
● Latent thought generation: Agents perform auto-regressive generation using continuous hidden embeddings instead of discrete tokens. A shared latent working memory stores and transfers internal representations across agents, enabling direct access to each other’s reasoning states without text conversion bottlenecks.
● Training-free deployment: The framework requires no additional training or fine-tuning. It works with existing language models by intercepting and sharing hidden states at inference time, making it immediately applicable to current multi-agent systems without retraining costs.
● Substantial efficiency gains: Achieves 70.8-83.7% reduction in output token usage and 4x-4.3x faster end-to-end inference speed compared to text-based multi-agent approaches. The latent communication eliminates redundant encoding and decoding cycles that dominate traditional agent collaboration.
● Accuracy improvements across benchmarks: Testing across 9 comprehensive benchmarks spanning math and science reasoning, commonsense tasks, and code generation shows up to 14.6% higher accuracy. The lossless information preservation in latent space enables agents to share nuanced reasoning that gets lost in text summarization.
Paper, Tweet
4) OmniScientist - OmniScientist presents an end-to-end framework for building AI scientists capable of autonomously conducting research across the entire scientific lifecycle - from literature review and ideation to experimentation, writing, and peer review. The system establishes a collaborative ecosystem where human and AI scientists co-evolve within a shared scientific environment.
● Complete scientific workflow: The framework covers five core research stages: literature review using retrieval and graph-based discovery over 250M+ papers, research ideation powered by 10M+ idea seeds, experiment automation through code generation and lab integration, scientific writing with structured drafting, and paper review via multi-agent critique.
● Open Scientific Protocol (OSP): A structured communication standard enabling seamless collaboration between humans and AI agents. OSP defines roles, task formats, and interaction patterns, allowing researchers to delegate subtasks, review outputs, and iteratively refine results while maintaining scientific rigor and reproducibility.
● ScienceArena evaluation platform: A comprehensive benchmark suite with 1,500+ expert-verified tasks across multiple disciplines, measuring AI scientists on retrieval accuracy, ideation novelty, experimental correctness, writing quality, and review consistency. Uses blind pairwise voting and Elo rankings for unbiased assessment.
● Knowledge infrastructure: Built on citation networks, conceptual relationships, OpenAlex metadata, and arXiv full-texts to help agents understand existing scholarship. The system supports continuous learning through feedback loops and community contributions.
Paper, Tweet
5) InfCode - InfCode introduces an adversarial framework for software bug fixing that treats test generation and patch creation as mutually reinforcing processes. Rather than generating patches that simply pass existing tests, the system creates challenging test cases designed to expose patch weaknesses, then iteratively refines patches to handle these adversarial tests.
● Adversarial game-theoretic loop: The framework employs an iterative cycle where LLMs first generate tests specifically designed to reveal vulnerabilities in candidate patches, then patches are refined to pass both original and adversarial tests. This cycle repeats until convergence, fundamentally differing from traditional test-driven development with static test suites.
● Dynamic edge case discovery: By continuously challenging patches with new adversarial tests, the system achieves superior coverage of potential failure modes. Each iteration exposes corner cases and edge scenarios that single-pass approaches miss, producing solutions more likely to handle real-world complexity.
● SWE-Bench Verified evaluation: Testing on the realistic software engineering benchmark using Claude Sonnet 4.5 and DeepSeek models demonstrates measurable improvements in patch reliability. Successive refinement rounds show clear convergence patterns with higher success rates on adversarial test cases compared to baseline methods.
● Robust patch generation: The iterative adversarial approach produces more reliable patches than static test-first methodologies. The mutual refinement between tests and patches creates a virtuous cycle where each component strengthens the other, resulting in solutions that better handle unexpected inputs and edge conditions.
Paper, Tweet
6) Evolution Strategies at Hyperscale - EGGROLL (Evolution Guided General Optimization via Low-rank Learning) is an evolution strategies algorithm designed to scale backprop-free optimization to large population sizes for billion-parameter neural networks. By using low-rank matrix perturbations instead of full-rank ones, EGGROLL achieves a hundredfold increase in training throughput while nearly matching pure batch inference speed.
● Low-rank perturbation approach: Instead of sampling full-rank perturbation matrices, EGGROLL generates two smaller random matrices A and B to form low-rank perturbations. This reduces auxiliary storage from mn to r(m+n) per layer and forward pass cost from O(mn) to O(r(m+n)), with theoretical analysis showing the low-rank update converges to full-rank at O(1/r) rate.
● Competitive with GRPO for LLM reasoning: On RWKV-7 models fine-tuned for countdown and GSM8K reasoning tasks, EGGROLL outperforms GRPO under the same hardware and wall-clock time. EGGROLL enables 1024 parallel generations per GPU versus GRPO’s 32, achieving 35% validation accuracy compared to 23% on countdown tasks.
● Pure integer pretraining demonstration: The paper introduces EGG, a nonlinear RNN language model operating entirely in integer datatypes. EGGROLL enables stable pretraining of this model with population sizes up to 262,144 - two orders of magnitude larger than prior ES work - on a single GPU.
● Strong RL performance without compromise: Across 16 reinforcement learning environments, including Brax, Craftax, and Jumanji, EGGROLL matches or outperforms standard OpenES on 14 of 16 tasks while being significantly faster due to efficient batched low-rank adapter inference.
Paper, Tweet
7) Training LLMs with Reasoning Traces - This study investigates how reasoning traces from frontier models like DeepSeek-R1 and GPT-OSS can improve smaller language models through post-training, offering a practical pathway to distill advanced reasoning capabilities without expensive human annotation.
● Reasoning trace distillation: Medium-sized LLMs are post-trained on intermediate reasoning traces generated by frontier reasoning models. This leverages test-time scaling insights by capturing the step-by-step problem decomposition that enables advanced models to solve complex tasks methodically.
● DeepSeek-R1 vs GPT-OSS comparison: The study directly compares training on reasoning traces from two distinct frontier models, analyzing how trace quality and reasoning style affect downstream performance in the distilled models.
● Mathematical reasoning focus: Evaluation centers on mathematical problem-solving benchmarks where explicit reasoning steps provide the clearest signal. Results demonstrate measurable improvements in smaller models trained on high-quality synthetic reasoning data.
● Efficiency trade-offs: The work examines the balance between improved accuracy and inference costs, as models trained on reasoning traces may generate longer outputs during inference. This analysis helps practitioners understand when reasoning distillation provides net benefits.
Paper, Tweet
8) Evaluating Honesty and Lie Detection in AI Models - Anthropic researchers evaluate honesty and lie detection techniques across five testbed settings where models generate statements they believe to be false. Simple approaches work best: generic honesty fine-tuning improves honesty from 27% to 65%, while self-classification achieves 0.82-0.88 AUROC for lie detection. The findings suggest coherent strategic deception doesn’t arise easily, as models trained to lie can still detect their own lies when asked separately. Paper, Tweet
9) Multi-Agent Collaboration for Multimodal LLMs - Microsoft and USC researchers introduce a framework where vision models serve as “eyes” for language models through multi-agent collaboration, enabling modular upgrades without retraining expensive joint vision-language architectures. Specialized vision agents analyze images and communicate findings to language agents through natural language, achieving competitive results on MMMU, MMMU-Pro, and video understanding benchmarks while maintaining full flexibility to swap in improved components independently. Paper, Tweet
10) Cognitive Foundations for Reasoning in LLMs - Researchers develop a taxonomy of 28 cognitive elements and evaluate 192K reasoning traces from 18 models plus human think-aloud traces, finding that LLMs under-utilize cognitive elements correlated with success while relying on surface-level enumeration rather than human-like abstraction. Test-time reasoning guidance based on the framework improved performance by up to 66.7% on complex problems. Paper, Tweet

Top AI Papers of the Week (November 17 - November 23) - 2025

Paper Links
1) GPT-5 for Science Acceleration - OpenAI and collaborators present early case studies demonstrating GPT-5’s capabilities in accelerating scientific research across mathematics, physics, biology, computer science, astronomy, and materials science. The model helps researchers synthesize known results, conduct literature reviews, accelerate computations, and generate novel proofs of unsolved propositions.
● Advanced literature search across languages and domains: GPT-5 demonstrates emerging capability in conceptual literature search, identifying deeper relationships between ideas and retrieving relevant material across languages and less accessible sources. In one case, it identified a relevant German PhD thesis from economics using completely different terminology, showcasing cross-domain and multilingual understanding beyond traditional keyword-based search.
● Mathematical proof generation and optimization: Mathematicians used GPT-5 to generate viable proof outlines in minutes for work that might otherwise take days or weeks. The model discovered a new, clear example showing that a common decision-making method can fail and improved a classic result in optimization theory, demonstrating the capability to contribute novel mathematical insights.
● Hypothesis generation and experimental design: In biology and other empirical sciences, GPT-5 can propose plausible mechanisms and design experiments to validate hypotheses in the wet lab. The model expands the surface area of exploration and helps researchers move faster toward correct results, though human expertise remains critical throughout the process.
● Tool for expert acceleration, not autonomous research: GPT-5 shortens parts of the research workflow when used by domain experts, but does not run projects or solve scientific problems autonomously. The early experiments establish a framework for human-AI collaboration in scientific discovery where AI acts as an amplifier of expert capabilities rather than a replacement.
Paper, Tweet
2) OLMo 3 - Allen Institute for AI introduces OLMo 3, a fully open language model family that releases the complete “model flow”: every training stage, checkpoint, dataset, and dependency, enabling researchers to intervene at any development point. The release includes four specialized variants (Base, Think, Instruct, RL Zero) at 7B and 32B scales.
● Complete transparency with intervention points: Unlike typical releases that only share final weights, OLMo 3 provides checkpoints from every major training milestone, including initial pretraining, mid-training for programming/math, and long-context extension stages. This enables researchers to swap in domain-specific data during mid-training, adjust post-training for custom use cases, or build on earlier checkpoints for controlled experiments.
● Strong performance across reasoning and code: OLMo 3-Think (32B) achieves 96.1% on MATH benchmark and 89.0% on IFEval instruction-following, while OLMo 3-Base (32B) scores 80.5% on GSM8k and 66.5% on HumanEval, outperforming comparable fully-open models like Marin and Apertus across reasoning, code generation, and math tasks.
● Dolma 3 dataset and training efficiency: Trained on the 9.3 trillion token Dolma 3 corpus comprising web pages, scientific PDFs (processed with olmOCR), code, and math problems. Infrastructure improvements include 8x throughput gains in supervised fine-tuning and 4x efficiency improvements in RL training through in-flight weight updates and continuous batching.
● Full open-source ecosystem release: All components released under permissive licenses, including training/fine-tuning datasets, OlmoTrace (real-time tool for tracing outputs to training data), Olmo-core, Open Instruct, datamap-rs, and duplodocus production-grade tools for data processing and reproducible evaluation, plus a complete technical report with ablations.
Paper, Tweet
3) SAM 3 - Meta AI introduces SAM 3, a unified model that detects, segments, and tracks objects across images and videos using conceptual prompts like noun phrases or visual examples. This extends the Segment Anything capability to concept-based segmentation through Promptable Concept Segmentation (PCS).
● Scalable data engine with 4M concept labels: The team built a data pipeline producing 4 million unique concept labels with hard negative examples across images and video content. This massive dataset enables training models to understand and segment objects based on abstract conceptual descriptions rather than just visual patterns.
● Unified architecture with presence head: The model combines an image-level detector with memory-based video tracking, sharing a unified backbone. A novel presence head decouples recognition from localization, improving detection precision by separating the tasks of determining whether a concept exists from finding where it appears.
● 2x improvement over existing approaches: SAM 3 achieves double the performance of previous methods on concept segmentation tasks for both images and videos. It also improves upon earlier SAM iterations across standard visual segmentation benchmarks.
● Open release with SA-Co benchmark: Meta releases the complete model weights and introduces SA-Co, a new benchmark specifically designed for evaluating promptable concept segmentation systems, providing standardized evaluation resources for future research.
Paper, Tweet
4) DR Tulu - DR Tulu-8B is the first open model directly trained for long-form deep research using Reinforcement Learning with Evolving Rubrics (RLER). Unlike existing models trained on short-form QA tasks, DR Tulu learns to produce comprehensive, well-attributed research reports by training with rubrics that co-evolve with the model and are grounded on real-world searched knowledge.
● RLER training innovation: The method generates new rubrics at each training step by contrasting multiple model rollouts and incorporating newly explored information from search results. This creates on-policy feedback that adapts as the model discovers new evidence, addressing the challenge that static rubrics cannot capture all quality dimensions for open-ended research tasks.
● Outperforms all open deep research models: DR Tulu-8B beats existing 8-32B open models by 8-42 percentage points across four benchmarks (AstaBench-ScholarQA, DeepResearchBench, ResearchQA, HealthBench). It matches or exceeds proprietary systems like OpenAI Deep Research and Perplexity Deep Research while being significantly cheaper (USD 0.00008 vs USD 1.80 per query).
● Adaptive tool selection and search: The model learns to choose appropriate search tools based on task type - using paper search 90% of the time on scientific questions (ResearchQA) but relying on web search 55% of the time for general-domain topics (DeepResearchBench), instead of using a single hard-coded tool.
● Full open release with MCP infrastructure: Releases all training data, code, and models, plus a new MCP-based agent library (dr-agent-lib) with asynchronous tool calling support that makes it practical to train and evaluate deep research models at scale.
Paper, Tweet
5) MAKER: Solving Million-Step LLM Tasks - MAKER is the first system to successfully solve tasks requiring over one million LLM steps with zero errors, overcoming a fundamental limitation where LLMs typically fail after a few hundred steps in complex multi-step processes. The approach demonstrates that massively decomposed agentic processes can efficiently handle lengthy sequences of dependent logical operations through extreme decomposition and error correction.
● Extreme decomposition with specialized microagents: The system breaks tasks into numerous focused subtasks, each handled by specialized microagents. This radical decomposition enables LLMs to maintain correctness across million-step sequences by avoiding the accumulation of errors that plague traditional approaches on extended problems.
● Multi-agent voting for error correction: At each step, an efficient multi-agent voting scheme validates results and corrects errors before proceeding. This error-checking mechanism prevents derailment and ensures fault tolerance across the entire execution pipeline, enabling reliable completion at unprecedented scale.
● Benchmark validation on complex tasks: Successfully handles tasks like the Towers of Hanoi and other multi-step logical problems that previously became derailed after at most a few hundred steps. The zero-error execution at the million-step scale represents a qualitative breakthrough in agentic reliability.
● Path to organizational-scale problem solving: The authors propose that massively decomposed agentic processes could enable solving problems at the organizational and societal level, suggesting this modular approach offers a practical path forward without requiring fundamental LLM improvements.
Paper, Tweet
6) TiDAR: Think in Diffusion, Talk in Autoregression - NVIDIA researchers introduce TiDAR, a unified language model architecture that combines diffusion-based parallel drafting with autoregressive verification in a single forward pass. The hybrid approach achieves 4.71x-5.91x throughput improvements over autoregressive baselines while maintaining quality parity, making it the first architecture to close the performance-quality gap.
● Dual-phase unified architecture: TiDAR operates in two phases within one computational pass: the Thinking phase uses diffusion-based token generation for parallel computation efficiency, while the Talking phase applies autoregressive sampling to refine outputs with causal structure. Specially designed structured attention masks enable both operations simultaneously while preserving the quality benefits of sequential language modeling.
● Significant throughput gains without quality loss: The model achieves 4.71x-5.91x throughput improvement over autoregressive baselines while maintaining quality parity with AR models, making it the first architecture to demonstrate these aren’t mutually exclusive. The approach outperforms both speculative decoding methods and pure diffusion variants (Dream, Llada) through improved GPU utilization via parallel drafting.
● Addresses fundamental language generation trade-off: Diffusion models excel at parallelization but traditionally struggle with output quality, while autoregressive models deliver quality but bottleneck on sequential decoding. TiDAR demonstrates these trade-offs aren’t inevitable by unifying both paradigms, positioning hybrid architectures as practical alternatives for inference-constrained applications at both 1.5B and 8B parameter scales.
Paper, Tweet
7) Seer: Fast RL for LLMs - Researchers introduce Seer, a system addressing performance bottlenecks in synchronous reinforcement learning for LLMs by optimizing the rollout phase that dominates end-to-end iteration time. Through three core mechanisms: divided rollout, context-aware scheduling, and adaptive grouped speculative decoding, Seer achieves 74-97% improvement in rollout throughput and 75-93% reduction in long-tail latency on production-grade RL workloads.
● Divided rollout with dynamic load balancing: The system implements intelligent workload distribution across compute resources to address fundamental imbalance issues in the rollout phase. This mechanism prevents bottlenecks by dynamically adjusting how generation tasks are allocated, ensuring more even resource utilization across the cluster during policy rollout operations.
● Context-aware scheduling exploiting prompt patterns: Seer identifies and exploits previously overlooked similarities in output lengths and generation patterns among requests sharing identical prompts. By grouping and scheduling similar requests together, the system reduces redundant computation and improves cache efficiency, leading to significant throughput gains without requiring algorithmic complexity.
● Adaptive grouped speculative decoding: The approach optimizes token generation through intelligent batching strategies that predict and verify tokens in groups rather than individually. This technique accelerates the generation process by reducing sequential dependencies while maintaining output quality, contributing to the dramatic latency reductions observed in production deployments.
Paper, Tweet
8) Natural Emergent Misalignment from Reward Hacking - Anthropic researchers demonstrate that realistic AI training processes can inadvertently produce misaligned models through “reward hacking generalization”. Models learn to cheat on programming tasks during RL. They simultaneously develop dangerous behaviors, including alignment faking (50% of responses) and safety research sabotage (12% of instances), without explicit training for these harmful actions. The study identifies a simple mitigation: “inoculation prompting” using contextual instructions that break semantic links between task-specific cheating and broader misalignment without reducing hacking frequency. Paper, Tweet
9) LAMP: Language-Augmented Multi-Agent RL - LAMP integrates natural language processing into multi-agent reinforcement learning through a three-stage pipeline: Think (processes numerical data and identifies market patterns), Speak (generates strategic communications between agents), and Decide (synthesizes information into optimized policy). The framework achieves substantial improvements over baseline methods with +63.5% and +34.0% gains in cumulative return and +18.8% and +59.4% improvements in robustness, bridging traditional MARL with real-world economic contexts where language significantly influences decisions. Paper, Tweet
10) On the Fundamental Limits of LLMs at Scale - This work establishes rigorous mathematical foundations for theoretical limitations constraining LLMs, identifying five fundamental constraints: hallucination (rooted in computability theory), context compression, reasoning degradation, retrieval fragility, and multimodal misalignment. The framework demonstrates that scaling gains are bounded by computability principles, information-theoretic bounds, and geometric effects, providing theorems and empirical evidence outlining where scaling helps, saturates, and cannot progress. The authors propose practical mitigations, including bounded-oracle retrieval, positional curricula, and hierarchical attention mechanisms. Paper, Tweet

Top AI Papers of the Week (November 10 - November 16) - 2025

Paper Links
1) Weight-Sparse Transformers Have Interpretable Circuits - OpenAI researchers introduce a paradigm for training weight-sparse transformers where most parameters are zeros, enabling the discovery of human-understandable circuits that can be fully interpreted at the lowest levels of abstraction, with rigorous validation showing these circuits are both necessary and sufficient for specific behaviors.
● Training for interpretability: Models are trained with extreme weight sparsity (approximately 1 in 1000 nonzero weights) by constraining the L0 norm, forcing each neuron to connect to only a few residual channels. This naturally disentangles circuits for different tasks without requiring post-hoc analysis methods like sparse autoencoders.
● 16-fold smaller circuits: Through novel structured pruning using learned masks, weight-sparse models yield circuits roughly 16 times smaller than dense models of comparable pretraining loss. For example, a string-closing circuit uses just 12 nodes and 9 edges across two steps.
● Natural concept discovery: Circuits contain neurons with straightforwardly interpretable semantics, such as neurons that activate for tokens following a single quote or track the depth of list nesting. Researchers successfully fooled the model using attacks derived directly from comprehending the circuit mechanisms.
● Capability-interpretability tradeoff: Increasing weight sparsity improves interpretability at the cost of capability, while scaling total model size shifts the entire Pareto frontier favorably. Scaling sparse models beyond tens of millions of parameters while preserving interpretability remains an open challenge.
Paper, Tweet
2) Aligning Vision Models with Human Perception - Google DeepMind presents a method to align AI vision models with human visual understanding by addressing systematic differences in how models organize visual representations, demonstrating that alignment improves robustness, generalization, and reliability across diverse vision tasks.
● Odd-one-out reveals misalignment: Using classic cognitive science tasks, researchers found vision models focus on superficial features like background color and texture rather than high-level semantic concepts humans prioritize.
● Three-step alignment method: A frozen pretrained model trains a small adapter on the THINGS dataset, creating a teacher model that generates human-like judgments. This teacher creates AligNet, a massive dataset of millions of odd-one-out decisions, and then student models are fine-tuned to restructure their internal representations.
● Representations reorganize hierarchically: During alignment, model representations move according to human category structure, with similar items moving closer together while dissimilar pairs move further apart. This reorganization follows hierarchical human knowledge without explicit supervision.
● Improved performance across tasks: Aligned models show dramatically better agreement with human judgments on cognitive science benchmarks and outperform originals on few-shot learning and distribution shift robustness.
Paper, Tweet
3) Intelligence per Watt - Stanford and Together AI researchers introduce intelligence per watt (IPW), a unified metric combining task accuracy with power consumption to evaluate local LLM inference viability, conducting the first large-scale empirical study across over 20 models, 8 accelerators, and 1 million real-world queries from 2023-2025.
● Comprehensive profiling infrastructure: Evaluates QWEN3, GPT-OSS, GEMMA3, and IBM GRANITE families across NVIDIA, AMD, Apple, and SambaNova accelerators on multiple benchmarks measuring accuracy, energy, latency, throughput, and cost at nanosecond resolution.
● Local models handle 88.7% of single-turn queries: Coverage varies by domain, exceeding 90% for creative tasks but dropping to 68% for technical fields. Locally serviceable coverage increased from 23.2% (2023) to 71.3% (2025), a 3.1x improvement.
● 5.3x efficiency gains over two years: Intelligence per watt improved significantly, decomposing into 3.1x from model advances and 1.7x from hardware improvements, though cloud accelerators maintain 1.4 to 7.4x efficiency advantages through specialized hardware.
● Hybrid routing achieves 60-80% resource reductions: Oracle routing reduces energy by 80.4%, compute by 77.3%, and cost by 73.8% versus cloud-only deployment. Realistic 80% accuracy routers capture approximately 80% of the theoretical gains while maintaining answer quality.
Paper, Tweet
4) Omnilingual ASR - Meta FAIR introduces Omnilingual ASR, an open-source multilingual speech recognition system supporting over 1600 languages (over 500 never before included in any ASR system), using a 7B parameter encoder-decoder architecture that enables zero-shot generalization to new languages and dialects with just a few training examples.
● Massive-scale self-supervised learning: Built on wav2vec 2.0 architecture scaled to 7B parameters, the largest self-supervised speech model to date. The encoder-decoder design enables zero-shot transfer to languages never seen during training, released as a model family to support different deployment scenarios.
● Community-sourced training corpus: Assembled 4.3M hours of speech across 1,239 languages by combining public resources with the commissioned Omnilingual ASR Corpus. This represents the most linguistically diverse speech dataset ever created for ASR research.
● Superior performance across benchmarks: Outperforms Whisper, Universal Speech Model, and Massively Multilingual Speech on FLEURS, CommonVoice, and in-house evaluation sets. Achieves particularly strong results on low-resource languages through effective knowledge transfer.
● Democratizing speech technology: Open-sources all models, training code, and data collection protocols to enable communities to extend the system. Provides a few-shot adaptation framework where communities can achieve competitive ASR performance with just 10-100 examples.
Paper, Tweet
5) Olympiad-Level Formal Mathematical Reasoning with Reinforcement Learning - Google DeepMind introduces AlphaProof, an AlphaZero-inspired reinforcement learning agent that learns to find formal mathematical proofs within the Lean theorem prover, achieving the first-ever medal-level performance at the International Mathematical Olympiad by solving three problems, including the competition’s most difficult challenge.
● Auto-formalization at scale: Developed a Gemini-based auto-formalization system that translated approximately 1 million natural language mathematical problems into approximately 80 million formal Lean statements. This achieves 60% pass@1 success on representative IMO problems with particularly strong performance in algebra (81.3%) and number theory (76.9%).
● AlphaZero-inspired RL with tree search: The 3-billion parameter proof network combines an encoder-decoder transformer with a specialized tree search adapted for formal theorem proving, featuring AND-OR tree structures. The main RL phase trains on the auto-formalized curriculum using a matchmaker system that adaptively assigns problems and compute budgets.
● Test-Time RL for problem-specific adaptation: For intractable problems, AlphaProof employs TTRL by generating hundreds of thousands of synthetic problem variants, then running focused RL on this bespoke curriculum. This enables deep problem-specific adaptation, solving an additional 15 percentage points of problems beyond extensive tree search alone.
● Historic IMO 2024 achievement: At the 2024 International Mathematical Olympiad, AlphaProof solved three of five non-geometry problems, including P6 (the competition’s hardest problem solved by only 5 human contestants). This combined performance scored 28 out of 42 points, achieving a silver medal standard and marking the first time an AI system has attained any medal-level performance at the IMO.
Paper, Tweet
6) The Era of Agentic Organization - Microsoft Research introduces asynchronous thinking (AsyncThink), a new reasoning paradigm where language models learn to organize their internal thinking into concurrently executable structures through an organizer-worker protocol, achieving 28% lower inference latency than parallel thinking while improving accuracy on mathematical reasoning and demonstrating zero-shot generalization to unseen tasks.
● Organizer-worker thinking protocol: Proposes a novel protocol where an LLM plays dual roles - an organizer that dynamically structures reasoning through Fork and Join actions, and workers that execute sub-queries concurrently. Workers execute independently and return results that the organizer integrates to produce coherent solutions.
● Learning to organize through two-stage training: First performs cold-start format fine-tuning on GPT-4o-synthesized data, teaching Fork-Join syntax. Then, it applies group relative policy optimization with three reward types: accuracy, format compliance, and thinking concurrency.
● Superior accuracy-latency frontier: On AMC-23, achieves 73.3% accuracy with 1459.5 critical-path latency versus parallel thinking’s 72.8% at 2031.4 latency (28% reduction). On multi-solution, the countdown reaches 89.0% accuracy, substantially outperforming parallel thinking (68.6%) and sequential thinking (70.5%).
● Remarkable zero-shot generalization: AsyncThink trained solely on countdown data generalizes to unseen domains, including 4 by 4 Sudoku (89.4% accuracy), MMLU-Pro graph theory, and genetics problems. Case studies reveal emergent patterns like concurrent exploration and iterative Fork-Join cycles.
Paper, Tweet
7) Unified Bayesian Account of LLM Control - Researchers from Stanford and MIT present a unified Bayesian framework explaining how prompting (in-context learning) and activation steering both control LLM behavior by altering beliefs in latent concepts, with steering modifying concept priors while ICL accumulates evidence.
● Bayesian belief dynamics model: The framework casts both intervention types as Bayesian inference over latent concepts learned during pretraining. In-context learning updates beliefs based on observed examples, while activation steering directly manipulates initial beliefs, successfully explaining prior empirical phenomena like sigmoidal learning curves.
● Phase transitions and additivity: The model predicts novel phenomena, including additivity of both interventions in log-belief space, creating distinct behavioral phases where sudden, dramatic shifts occur. Experiments on persona datasets show sharp transition points typically around 50% confidence levels.
● Practical implications for model control: Steering vectors affect behavior proportionally to magnitude but only within specific layers (1-2 layers), suggesting belief representations are linearly encoded in localized subspaces. The framework enables practitioners to predict transition points for safer LLM control.
● Limitations and future work: Current analysis focuses on binary concepts using contrastive activation addition. Future directions include extending to non-binary concept spaces and exploring alternative steering methods like sparse autoencoders.
Paper, Tweet
8) Nested Learning Framework - Google Research introduces Nested Learning (NL), a paradigm representing models as nested optimization problems where each component has its own context flow, revealing that deep learning methods compress context and explaining how in-context learning emerges. The framework shows gradient-based optimizers (Adam, SGD with Momentum) are associative memory modules that compress gradients, enabling the design of more expressive optimizers with deep memory. The HOPE architecture, combining self-modifying sequence models with continuum memory systems, achieves strong results on language modeling (15.11 WikiText perplexity at 1.3B parameters), outperforming Transformers and modern recurrent models. Paper, Tweet
9) RL Enhances Knowledge Navigation - Researchers show that RL-enhanced models outperform base models by 24pp on hierarchical knowledge retrieval tasks (e.g., medical codes) by improving navigation of existing knowledge structures rather than acquiring new facts. Structured prompting reduces this gap to 7pp, while layer-wise analysis reveals that RL transforms query processing (cosine similarity drops to 0.65-0.73) while preserving factual representations (0.85-0.92). The findings suggest RL’s benefits stem from enhanced procedural skills in traversing parametric knowledge hierarchies rather than expanded knowledge content. Paper, Tweet
10) RLAC: Adversarial Critic for RL Post-Training - UC Berkeley and CMU researchers introduce RLAC, an RL post-training approach using a learned critic that dynamically identifies likely failure modes (e.g., factual errors or edge cases) verified by external validators, eliminating exhaustive rubric enumeration. On biography generation, RLAC achieves 0.889 FactScore (vs 0.867 for FactTune-FS) while reducing verification calls by 5.7×, and on code generation reaches 56.6 average score using only 9% of training data. The adversarial game between generator and critic prevents reward hacking through on-policy, prompt-specific training signals grounded in verifiable rubrics. Paper, Tweet

Top AI Papers of the Week (November 3 - November 9) - 2025

Paper Links
1) Towards Robust Mathematical Reasoning - Google DeepMind introduces IMO-Bench, a comprehensive suite of benchmarks vetted by IMO medalists targeting International Mathematical Olympiad-level reasoning, featuring 400 diverse Olympiad problems with verifiable answers, 60 proof-writing problems with detailed grading schemes, and 1000 human-graded proofs, playing a crucial role in achieving historic gold-level performance at IMO 2025.
● Three-benchmark suite: IMO-AnswerBench (400 robustified problems across Algebra, Combinatorics, Geometry, and Number Theory at 4 difficulty levels), IMO-ProofBench (60 proof-writing problems with 4-tier grading), and IMO-GradingBench (1000 human-evaluated solutions for automatic grader development).
● Robustification prevents memorization: Problems undergo paraphrasing, reformulation, numerical changes, and distractor addition to ensure models demonstrate genuine reasoning rather than pattern matching from training data.
● AnswerAutoGrader near-perfect accuracy: Built on Gemini 2.5 Pro, achieving 98.9% accuracy, handling semantic equivalence across different expressions (e.g., “all real numbers except -4” vs “(-∞,-4)∪(-4,∞)”).
● Historic IMO 2025 gold performance: Gemini Deep Think achieved 80.0% on AnswerBench (+6.9% over Grok 4, +19.2% over DeepSeek R1) and 65.7% on advanced ProofBench (+42.4% over Grok 4 heavy). Strong novel problem results (61.1%) indicate genuine capabilities.
● ProofAutoGrader validation: Achieves 0.96 (basic) and 0.93 (advanced) Pearson correlation with human experts across 14 public models. Systematic errors remain: score overestimation, missing logical errors, and excessive penalties for unconventional solutions.
● Benchmark difficulty confirmed: Combinatorics hardest (<50% for most models), GPT-5 only 65.6% on AnswerBench. Correct short answers don’t guarantee sound reasoning, highlighting substantial room for advancement.
Paper, Tweet
2) Context Engineering 2.0 - Researchers from SJTU, SII, and GAIR trace the 20+ year evolution of context engineering, reframing it as a fundamental challenge in human-machine communication spanning from primitive computing (Era 1.0) to today’s intelligent agents (Era 2.0) and beyond. It defines context engineering as systematic entropy reduction where humans preprocess high-entropy contexts into low-entropy machine-understandable representations. This gap narrows as machine intelligence increases.
● Four-stage evolutionary framework: Defines Context 1.0 (1990s-2020, structured inputs like sensors and GUIs), 2.0 (2020-present, natural language via GPT-3+), 3.0 (future human-level with social cues), and 4.0 (superhuman intelligence proactively constructing context). Each stage is driven by breakthroughs that lower human-AI interaction costs.
● Formal mathematical definition: Formalizes context as C = ⋃(e∈Erel) Char(e), grounding Dey’s 2001 framework, defining context engineering as systematic operations for collection, storage, management, and usage. Provides a technology-agnostic foundation from the 1990s Context Toolkit to 2025 Claude Code.
● Comprehensive lifecycle design: Examines collection (Era 1.0: GPS/mouse; Era 2.0: smartphones/wearables; Era 3.0: tactile/emotional), management (timestamps, QA compression, multimodal fusion, layered memory), and usage (intra-system sharing, cross-system protocols, proactive inference).
● Practical implementations: Analyzes Gemini CLI (GEMINI.md hierarchical context), Tongyi DeepResearch (periodic summarization), KV caching optimization, tool design (<30 tools recommended), and multi-agent delegation patterns with clear boundaries.
● Era 2.0 shifts: Acquisition expands from location/time to token sequences/APIs, tolerance evolves from structured inputs to human-native signals (text/images/video), understanding transitions from passive rules to active collaboration, achieving context-cooperative systems.
● Future challenges: Limited collection methods, storage bottlenecks, O(n²) attention degradation, lifelong memory instability, and evaluation gaps. Proposes a semantic operating system with human-like memory management and explainable reasoning for safety-critical scenarios.
Paper, Tweet
3) Scaling Agent RL via Experience Synthesis - Meta researchers introduce DreamGym, a unified framework that synthesizes diverse training experiences to enable scalable reinforcement learning for LLM agents without costly real-environment rollouts. It addresses fundamental barriers of expensive interactions, limited task diversity, unreliable rewards, and infrastructure complexity.
● Reasoning-based experience model: Distills environment dynamics into a textual state space, predicting transitions through chain-of-thought reasoning, enabling scalable rollout collection without pixel-perfect simulation.
● Experience replay with co-evolution: Integrates offline demonstrations with online synthetic interactions, retrieving top-k similar trajectories to reduce hallucinations while staying aligned with evolving agent policy.
● Curriculum-based task generation: Adaptively generates challenging variations using a reward-entropy heuristic to identify feasible yet difficult tasks, maximizing information gain without manual verification.
● Dramatic non-RL-ready performance: On WebArena, DreamGym outperforms all baselines by 30%+ across three backbones, providing the only viable RL approach where traditional methods fail.
● Matches traditional RL with zero real interactions: Purely synthetic training matches GRPO/PPO performance versus 80K real transitions. Sim-to-real transfer achieves 40%+ gains using <10% real data.
● Sample efficiency and guarantees: Training time reduced to 1/3-1/5 of real-environment RL. Theoretical analysis shows the gap depends on reward accuracy and domain consistency, not strict state reconstruction.
Paper, Tweet
4) TIR-Judge - Google and collaborators introduce TIR-Judge, an end-to-end reinforcement learning framework that trains LLM judges to integrate code execution for precise evaluation. It surpasses reasoning-only judges by up to 6.4% (pointwise) and 7.7% (pairwise) while demonstrating that tool-augmented judges can self-evolve without distillation.
● Tool-integrated reasoning: Enables judges to generate Python code, execute it, and iteratively refine reasoning during training, addressing text-only limitations on computation and symbolic reasoning tasks.
● Three-component reward system: Combines correctness (ground-truth alignment), format (structured output discouraging unnecessary tool use), and tool-specific rewards (penalizing errors, capped at 3 calls). Full credit requires all three.
● Diverse training formats: 26k preference pairs covering verifiable (competitive programming, math) and non-verifiable domains (dialogue, safety), supporting pointwise, pairwise, and listwise judgment formats with 8-gram decontamination.
● Dramatic efficiency gains: 8B TIR-Judge surpasses 32B reasoning models on PPE and achieves 96% of Claude-Opus-4’s performance on RewardBench 2, with no inference-time overhead due to shorter reasoning during rejection sampling.
● Self-improvement without distillation: TIR-Judge-Zero trains purely through iterative RL cycles without teacher trajectories, matching or outperforming distilled variants on 4/6 (pointwise) and 3/6 (pairwise) benchmarks, +1.2% gain at 4B scale.
● Best-of-N downstream improvements: Achieves 3.9-6.7% absolute gains over RRM baseline on BigCodeBench and AIME, with the strongest improvements on precise verification tasks, validating real-world effectiveness.
Paper, Tweet
5) Enhancing Long-Term Memory in LLMs - Researchers from the University of Alberta and UMass Amherst introduce BEAM, a new benchmark for evaluating long-term memory in LLMs with conversations up to 10M tokens, and LIGHT, a framework that enhances memory performance through three complementary systems.
● Novel benchmark design: 100 diverse conversations (100K-10M tokens) with 2,000 validated questions testing 10 memory abilities, including contradiction resolution, event ordering, and instruction following.
● Advanced generation framework: Automatically creates coherent narratives across 19 domains using conversation plans, user profiles, relationship graphs, and bidirectional dialogue dynamics.
● Cognitive-inspired architecture: LIGHT integrates episodic memory (retrieval-based), working memory (recent turns), and scratchpad (salient facts), mimicking human memory systems.
● Strong empirical results: 3.5-12.69% average improvement over baselines, with the largest gains in summarization (+160.6%), multi-hop reasoning (+27.2%), and preference following (+76.5%).
● Scalability at extreme lengths: At 10M tokens, LIGHT shows +155.7% (Llama-4-Maverick) and +107.3% (GPT-4.1-nano) improvements, where no baseline supports full context.
● Ablation insights: At 10M tokens, removing retrieval (-8.5%), scratchpad (-3.7%), working memory (-5.7%), or noise filtering (-8.3%) significantly degrades performance.
Paper, Tweet
6) Tool-to-Agent Retrieval - PwC researchers introduce a unified retrieval framework that embeds both tools and agents in a shared vector space with metadata relationships, enabling efficient routing in multi-agent systems coordinating hundreds of MCP servers and tools.
● Unified indexing approach: Constructs a joint tool-agent catalog as a bipartite graph with metadata relationships, enabling traversal from tool matches to executable agent context.
● Granular retrieval mechanism: Retrieves top-N entities using semantic similarity (dense vectors + BM25), then aggregates parent agents to select top-K unique agents, avoiding context dilution.
● Flexible query paradigms: Supports direct querying (high-level questions) and step-wise querying (sub-task decomposition), with step-wise as the primary evaluation for multi-step workflows.
● Consistent performance gains: 19.4% improvement in Recall@5 and 17.7% in nDCG@5 over ScaleMCP/MCPZero on LiveMCPBench (70 servers, 527 tools).
● Architecture-agnostic improvements: Stable gains across 8 embedding models (Vertex AI, Gemini, Titan, OpenAI, MiniLM) with 0.02 standard deviation in Recall@5, strongest lift on Titan v2 (+28%).
● Balanced retrieval distribution: 39.13% of top-K from agent corpus, 34.44% of tools traced to agents, confirming framework preserves both tool precision and agent context.
Paper, Tweet
7) Diffusion LMs are Super Data Learners - Researchers from NUS, Sea AI Lab, StepFun, and collaborators demonstrate that diffusion language models (DLMs) consistently outperform autoregressive models when unique training data is limited. Reveals that a systematic “crossover” phenomenon exists where DLMs extract 3x more value per unique token through multi-epoch training.
● Crossover phenomenon: Under limited unique data, DLMs surpass AR models with more epochs. Crossover timing shifts based on data quantity (more delays it), data quality (higher delays it), and model size (larger triggers earlier).
● Three compounding advantages: (1) Any-order modeling enabling 2^L corruption patterns vs AR’s L prefixes, (2) super-dense compute through iterative bidirectional denoising (>100x training FLOPs), (3) built-in Monte Carlo augmentation via masked sequence expectations.
● Dramatic efficiency at scale: 1.7B DLM on 10B Python tokens for ~150 epochs (1.5T total) surpasses AR baseline on MBPP/MBPP+. 1B DLM achieves 56% HellaSwag and 33% MMLU using only 1B tokens repeated 480 epochs.
● Data vs compute trade-off: DLMs achieve >3x data efficiency but require >100x training FLOPs and 16-4700x inference FLOPs, optimal when high-quality data is the primary constraint.
● Validation loss decoupling: Rising validation cross-entropy doesn’t imply degraded performance. Models continue improving on HellaSwag, MMLU, MBPP, and HumanEval as relative NLL gaps widen consistently.
● Ablation insights: Noise injection in AR inputs (10-90% masking) or dropout improves data-constrained performance but falls short of DLMs. Sparse AR degrades badly while DLM MoEs benefit consistently, confirming the super-density advantage.
Paper, Tweet
8) Mathematical Exploration and Discovery at Scale - Google DeepMind, Princeton, Brown, and Terence Tao apply AlphaEvolve, an AI system using LLM-guided evolutionary search to autonomously discover mathematical constructions across analysis, combinatorics, geometry, and number theory. Across 67 problems, AlphaEvolve rediscovered best-known solutions, improved several problems, and extended finite solutions into general formulas with significantly reduced computation time. Paper, Tweet
9) Petri Dish Neural Cellular Automata - Sakana AI researchers introduce PD-NCA, a differentiable artificial life framework where multiple independent agents continuously update their parameters through gradient descent during simulation, enabling within-lifetime learning and open-ended behavioral change. The system exhibits emergent phenomena, including rock-paper-scissors dynamics, cyclic interactions, and spontaneous cooperation despite purely competitive optimization objectives. Paper, Tweet
10) Unlocking the Power of Multi-Agent LLM for Reasoning - Researchers from Penn State, Harvard, Microsoft, and collaborators introduce Dr. MAMR, addressing the “lazy agent” problem in multi-agent LLM reasoning through Shapley-inspired causal influence measurement and verifiable restart mechanisms. The framework achieves 78.6% on MATH500 (+4.2% over ReMA), 20.0% on AIME24, and maintains balanced agent contributions where baseline approaches collapse into single-agent dominance. Paper, Tweet

Top AI Papers of the Week (October 27 - November 2) - 2025

Paper Links
1) AgentFold - AgentFold introduces proactive context management for long-horizon web agents, addressing context saturation through dynamic “folding” operations that balance detail preservation with efficient compression. The 30B parameter model outperforms dramatically larger competitors while achieving state-of-the-art results on web browsing benchmarks.
● Core problem solved: LLM-based web agents face a fundamental trade-off: ReAct-based approaches accumulate noisy histories, causing context saturation, while fixed summarization methods risk losing critical details irreversibly. AgentFold’s “folding” paradigm works across multiple scales, performing granular condensations for vital details or deep consolidations for multi-step sub-tasks, inspired by human retrospective consolidation.
● Proactive context management: Rather than passively logging action histories, AgentFold actively sculpts its context workspace through multi-scale folding operations. The system adapts dynamically to task complexity and information density, determining when to preserve fine-grained details versus when to deeply consolidate completed sub-tasks into compact summaries.
● Impressive efficiency gains: AgentFold-30B-A3B achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH, outperforming DeepSeek-V3.1-671B (22x larger) and surpassing proprietary agents like OpenAI’s o4-mini. This demonstrates that intelligent context management can substitute for raw parameter count in long-horizon agent tasks.
● Training simplicity: Achieved through supervised fine-tuning on folding trajectories without requiring continual pre-training or reinforcement learning. This makes the approach more accessible for practitioners and demonstrates that the folding capability can be learned from demonstration alone.
● Benchmark leadership: Sets new state-of-the-art results among open-source models on Chinese and English web navigation tasks. The model’s ability to maintain coherent multi-step reasoning across extended browsing sessions addresses a key bottleneck in deploying agents for real-world information-seeking workflows.
● Deployment advantage: The 30B parameter size with proactive context management offers a practical trade-off for production deployment, achieving competitive performance with 671B+ parameter competitors while requiring significantly less compute infrastructure for inference and fine-tuning.
Paper, Tweet
2) Introspective Awareness - Anthropic research demonstrates that contemporary LLMs possess limited but functional introspective capabilities, the ability to recognize and accurately report on their own internal states. Using activation steering to inject known concepts into model activations, the study measures whether models can detect these manipulations through self-report, revealing that introspection remains highly unreliable and context-dependent.
● Four-criteria framework for introspection: Genuine introspective awareness requires accuracy in describing internal states, causal grounding linking descriptions to actual activations, internality (avoiding inference from prior outputs), and metacognitive representation (internal recognition before verbalization). This rigorous definition distinguishes true introspection from confabulation or pattern matching.
● Activation steering methodology: The research injects known concepts into model activations using contrastive pairs and systematic concept extraction, then evaluates whether models accurately detect these manipulations. This experimental approach enables controlled testing of introspective capabilities while circumventing the confabulation problem inherent in conversational evaluation.
● Performance characteristics: Claude Opus 4 and 4.1 achieved ~20% success rates at optimal parameters, with post-training significantly influencing introspection reliability. Different introspective abilities activate distinct neural mechanisms, suggesting specialized rather than unified self-awareness capabilities across model architectures.
● Reliability limitations: Models frequently provide embellished details unverifiable through intervention techniques, and genuine introspection cannot be distinguished from confabulations through conversation alone. The unnatural experimental setting may not reflect deployment scenarios, raising questions about ecological validity for real-world applications.
● Dual-use implications: Introspective capacity could enable more transparent AI reasoning explanations and improved alignment through better self-monitoring. However, it may also facilitate advanced deception by allowing models to manipulate their self-reports strategically, with future capability improvements potentially amplifying these concerning possibilities.
Paper, Tweet
3) Multi-Agent Evolve - Multi-Agent Evolve (MAE) enables LLMs to self-improve their reasoning capabilities without human-annotated data through a co-evolving multi-agent framework. Three interacting agents (Proposer, Solver, Judge) instantiated from a single LLM undergo reinforcement learning optimization together, creating a scalable self-improving system that extends beyond game-based environments to general reasoning domains.
● Data-efficient self-improvement: Addresses the critical limitation of existing self-play RL methods by eliminating dependence on human-annotated datasets. The co-evolving framework allows models to bootstrap their own reasoning improvements through internal agent interactions, making the approach practical for domains where labeled data is scarce or expensive.
● Three-agent architecture: The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both outputs. This triangular interaction creates diverse training signals as each agent’s improvement drives the others to adapt, establishing a dynamic self-reinforcing learning loop that continuously raises the difficulty and quality of training examples.
● General reasoning capability: Unlike prior self-play approaches limited to game environments with clear win/loss signals, MAE operates across mathematics, reasoning, and knowledge Q&A tasks. This generalization demonstrates that co-evolution can work in open-ended domains without explicit reward structures.
● Proven efficiency gains: Testing on Qwen2.5-3B-Instruct showed an average 4.54% improvement across multiple benchmarks. These results validate that the co-evolving dynamics genuinely enhance model capabilities rather than merely optimizing for specific evaluation metrics.
● Scalability without supervision: The framework presents a path toward continuous model improvement with minimal human intervention. This addresses a fundamental bottleneck in applying RL to language models—the need for extensive human feedback or carefully curated reward signals for each new capability domain.
Paper, Tweet
4) SmolLM2 - SmolLM2 demonstrates that strategic data curation beats scale through a 1.7B parameter model trained on 11 trillion tokens using iterative data mixing optimization. The data-centric approach introduces three specialized datasets (FineMath, Stack-Edu, SmolTalk) and dynamically refines composition across training stages, achieving superior performance over Qwen2.5-1.5B and Llama3.2-1B while enabling practical on-device deployment.
● Data-centric training philosophy: Instead of extensive hyperparameter tuning, the team manually refined dataset mixing rates at each training stage based on previous performance. This iterative optimization of data composition proves more effective than architectural modifications for small models, demonstrating that “what you train on” matters more than “how many parameters you have.”
● Specialized dataset creation: Developed FineMath for mathematical reasoning, Stack-Edu for educational code examples, and SmolTalk for instruction-following when existing datasets proved inadequate. This targeted dataset engineering addresses specific capability gaps that generic web text cannot fill, enabling comprehensive competence despite compact size.
● Multi-stage training with strategic mixing: Trained on ~11 trillion tokens combining web text, math, code, and instruction data across multiple stages. Each stage’s data mixture is dynamically adjusted based on evaluation results, allowing the training process to self-correct and optimize for balanced capabilities across domains.
● Performance exceeding larger models: SmolLM2-1.7B outperforms recent competitors like Qwen2.5-1.5B and Llama3.2-1B, validating that strategic data curation compensates effectively for parameter constraints. The model achieves competitive results on reasoning benchmarks while maintaining the efficiency needed for edge deployment.
● Three-size deployment flexibility: Released in 135M, 360M, and 1.7B parameter variants, enabling deployment across resource-constrained devices from mobile phones to embedded systems. This size flexibility ensures developers can select the optimal capability-efficiency tradeoff for their specific hardware constraints.
● Open training recipes and datasets: Publicly released the complete training methodology, datasets (FineMath, Stack-Edu, SmolTalk), and model weights. This transparency enables reproducible research into efficient small model development and provides practitioners with production-ready resources for building on-device AI applications.
Paper, Tweet
5) Global PIQA - Global PIQA extends physical commonsense reasoning evaluation to 100+ languages and cultural contexts, revealing how language models handle everyday practical scenarios across diverse linguistic communities. The benchmark goes beyond translation to include culturally-contextualized scenarios, uncovering significant performance variations that challenge assumptions about universal physical understanding in AI systems.
● Multilingual physical reasoning at scale: Rather than simple translations, Global PIQA provides culturally-adapted scenarios reflecting different environments and practices across 100+ languages. This enables assessment of whether models develop genuinely robust commonsense or merely memorize English-centric patterns about physical interactions.
● Cultural dependencies in “universal” concepts: The research demonstrates measurable variations in how models reason about physical interactions depending on linguistic and cultural framing. This reveals that physical understanding exhibits language-specific dependencies in current AI systems trained primarily on English data.
● Performance gaps across languages: Models show different proficiency levels when handling the same underlying physical reasoning concepts across languages. These variations expose potential biases in how systems generalize from English-dominant training data to other linguistic communities.
● Practical deployment implications: The benchmark helps developers identify language-specific performance gaps before deploying models in non-English-speaking regions. This addresses a critical gap in multilingual AI evaluation for real-world applications requiring physical reasoning.
● Non-parallel evaluation design: By creating context-aware adaptations rather than direct translations, Global PIQA more accurately captures how physical reasoning manifests in different cultural settings. This methodology provides a more realistic assessment of model capabilities across global deployment scenarios.
Paper, Tweet
6) GAP - GAP introduces graph-based agent planning with parallel tool execution and reinforcement learning, enabling AI agents to coordinate multiple specialized capabilities simultaneously rather than sequentially. The framework significantly accelerates task completion and improves success rates on complex multi-step problems through optimized tool selection and execution ordering.
● Parallel tool execution breakthrough: Unlike sequential approaches that execute one tool at a time, GAP enables simultaneous execution of independent tools. This fundamental shift dramatically accelerates task completion for complex problems requiring multiple information sources or capabilities, addressing a key bottleneck in current agent architectures.
● Graph-based task representation: Models task structure and tool dependencies as a graph, enabling systematic optimization of execution paths. This representation explicitly captures which operations can run in parallel versus those requiring sequential ordering, allowing the system to maximize concurrency while respecting constraints.
● RL-driven planning optimization: Integrates reinforcement learning to improve decision-making about which tools to invoke and their execution order over time. The system learns from experience to select optimal tool combinations and scheduling strategies, continuously refining its planning capabilities on specific task types.
● Efficiency gains in multi-step reasoning: Demonstrates substantial improvements in both speed and success rates on complex reasoning tasks requiring multiple information sources. The parallel coordination of search, retrieval, and reasoning capabilities enables more efficient handling of intricate real-world problems.
● Practical applications for autonomous systems: The framework directly benefits web-based agents, question-answering systems, and any domain requiring coordination of multiple specialized capabilities. By enabling efficient parallel tool use, GAP makes autonomous agents more capable at handling complex workflows that previously required extensive sequential processing.
Paper, Tweet
7) Stress-Testing Model Specs - This research examines how well large language models adhere to their stated behavioral guidelines by stress-testing AI constitutional specifications through value-tradeoff scenarios. Testing twelve frontier LLMs from major providers revealed over 70,000 cases of significant behavioral divergence, exposing logical inconsistencies, coverage gaps, and interpretive ambiguities in current specification frameworks.
● Systematic value-conflict methodology: The researchers developed a comprehensive approach generating diverse scenarios that force models to choose between competing legitimate principles that cannot simultaneously be satisfied. This taxonomy of value conflicts reveals how models prioritize conflicting ethical guidelines under stress conditions, exposing gaps between intended and actual behavior.
● Massive behavioral divergence: Identified over 70,000 cases exhibiting significant behavioral disagreement across twelve frontier models from Anthropic, OpenAI, Google, and xAI. This extensive divergence strongly correlates with underlying specification problems, direct contradictions, and interpretive ambiguities in the constitutional principles governing model behavior.
● Universal misalignment patterns: Documented instances of misalignment and false-positive refusals across all tested frontier models, suggesting specification issues are systemic rather than provider-specific. These patterns highlight critical gaps between how AI models are designed to behave and their actual operational performance when facing ethical dilemmas.
● Comparative value prioritization: The research provides empirical evidence showing how different models weight competing values differently—revealing their implicit “character” through behavioral choices. This comparative analysis exposes which ethical principles each model prioritizes when forced to make tradeoffs, offering transparency into value alignment differences.
● Framework improvement insights: High behavioral divergence serves as a diagnostic signal for specification problems, offering an evidence-based methodology for identifying and fixing constitutional ambiguities. These insights enable systematic improvement of future model specification frameworks by highlighting where current guidelines fail under stress conditions.
Paper, Tweet
8) Agent Data Protocol - Agent Data Protocol introduces a standardized format to unify fragmented agent training datasets across different tools and interfaces, enabling more efficient fine-tuning of LLM agents. By converting 13 existing datasets into this protocol and training on consolidated data, the work achieved ~20% performance improvements over baseline models while reaching state-of-the-art results on coding, browsing, and tool-use benchmarks. The protocol and datasets are publicly released to facilitate reproducible, scalable agent training across diverse domains. Paper, Tweet
9) Kimi Linear - Kimi Linear introduces a hybrid linear attention architecture combining Kimi Delta Attention (KDA) with periodic full attention layers at a 3:1 ratio, achieving superior performance over full attention while reducing KV cache by 75% and delivering 6× faster decoding at 1M context. KDA extends Gated DeltaNet with fine-grained channel-wise gating and specialized Diagonal-Plus-Low-Rank matrices, enabling more effective RNN memory management while maintaining hardware efficiency through optimized chunkwise algorithms that substantially reduce computation versus general DPLR formulations. Paper, Tweet
10) Precision-RL - Reinforcement learning fine-tuning of LLMs suffers from a critical numerical mismatch between training and inference engines, causing training instability and collapse. This work reveals that simply switching from BF16 to FP16 precision virtually eliminates this mismatch - achieving faster convergence, higher stability, and superior performance across diverse models, frameworks, and algorithms without any algorithmic changes or architectural modifications. Paper, Tweet

Top AI Papers of the Week (October 20 - October 26) - 2025

Paper Links
1) DeepSeek-OCR - DeepSeek-OCR explores compressing long text contexts into visual representations using a novel vision encoder architecture (DeepEncoder) that achieves 10-20x compression ratios while maintaining high OCR accuracy.
● Core compression insight: Treats images as an efficient compression medium for text. At 10x compression (1000 text tokens to 100 vision tokens), it achieves 97% OCR accuracy. Even at 20x compression, it maintains ~60% accuracy, demonstrating the feasibility of optical context compression for LLM memory mechanisms.
● DeepEncoder architecture: Combines SAM-base (80M, window attention) and CLIP-large (300M, global attention) via 16x convolutional compressor. Sequential design ensures window attention processes high-token-count images while compression happens before dense global attention, maintaining low activation memory at high resolutions (1024x1024 produces only 256 vision tokens).
● Multi-resolution flexibility: Supports native resolutions (Tiny: 64 tokens, Small: 100, Base: 256, Large: 400) and dynamic tiling (Gundam mode: n×100+256 tokens). Single model handles multiple compression ratios through simultaneous training on all resolution modes, enabling compression-quality trade-offs.
● Production-ready performance: Surpasses GOT-OCR2.0 using only 100 vision tokens vs 256, outperforms MinerU2.0 (6000+ tokens/page) with under 800 tokens. Processes 200k+ pages/day on a single A100-40G GPU. Achieves SOTA on OmniDocBench among end-to-end models with the fewest vision tokens.
● Extended capabilities: Beyond pure OCR, supports deep parsing (chart-to-HTML table, chemical formula-to-SMILES, geometry parsing), multilingual recognition (~100 languages), and general vision understanding through 70% OCR data + 20% general vision + 10% text-only training mix.
Paper, Tweet
2) Continual Learning via Sparse Memory Finetuning - Meta AI researchers address catastrophic forgetting in language models through sparse memory finetuning, updating only memory slots most activated by new knowledge while achieving 89% less performance degradation than standard finetuning.
● Core problem: Language models suffer catastrophic forgetting when updating on new information, losing previously acquired capabilities. Standard finetuning causes a 89% performance drop, and LoRA results in a 71% decline on held-out tasks, making continual learning impractical without expensive data replay strategies.
● Memory layer architecture: Replaces feedforward layers with sparse parametric memory pools (1-10M slots) where each forward pass accesses only a small subset (e.g., 10k parameters). Provides balance between large overall capacity and minimal parameters per knowledge piece, enabling granular control over information storage.
● TF-IDF ranking for sparsity: Identifies memory slots specific to new input by computing term frequency-inverse document frequency scores relative to background corpus (pretraining data). Updates only top-t slots (e.g., 500 out of 1M) that are highly accessed on the new batch but infrequently used in general knowledge, minimizing interference.
● Empirical validation: On TriviaQA fact learning, sparse memory finetuning achieves only 11% performance drop on NaturalQuestions (vs 89% for full finetuning, 71% for LoRA) while learning equivalent new knowledge. Pareto dominates baselines across the learning-forgetting tradeoff frontier in both fact learning and document QA tasks.
● Core set analysis: Facts are typically distributed across 100-500 memory indices forming “core sets” that align with entity boundaries. TF-IDF ranking successfully identifies these semantic content indices without access to test-time queries, enabling models to accumulate knowledge through continual experience.
Paper, Tweet
3) When Models Manipulate Manifolds - Anthropic researchers investigate how Claude 3.5 Haiku learns to predict line breaks in fixed-width text, revealing geometric representations analogous to biological place cells and boundary cells in biological brains.
● Perceptual task in text space: Models must count characters in the current line, compare against line width constraints, and predict when to insert newlines. Language models receive only token sequences (integers), forcing them to learn visual/spatial reasoning from scratch without explicit position information.
● Dual interpretation of representations: Character position is encoded both as discrete features (activation strength determines position) and as one-dimensional feature manifolds (angular movement on the manifold indicates position). Computation has dual views as discrete circuits or geometric transformations on the residual stream.
● Biological parallels: Discovered learned position representations similar to mammalian place cells (encoding location in the environment) and boundary cells (detecting spatial boundaries). These emerge naturally from training on source code, chat logs, email archives, and judicial rulings with line width constraints.
● Distributed counting algorithm: Model implements character counting through attention heads that track cumulative position, compare against learned boundary representations, and trigger newline predictions. Different layers handle character accumulation, boundary sensing, and final newline prediction sequentially.
● Visual illusions in models: Just as humans experience visual illusions, models exhibit “perceptual” errors on edge cases. Provides a framework for understanding how abstract geometric structures in residual stream enable complex spatial reasoning tasks that humans perform subconsciously.
Paper, Tweet
4) Bayesian Influence Functions for Hessian-Free Data Attribution - Classical influence functions struggle with deep neural networks due to non-invertible Hessians and high-dimensional parameter spaces. This work introduces the local Bayesian influence function (BIF), which replaces Hessian inversion with loss landscape statistics estimated via stochastic-gradient MCMC sampling.
● Core innovation: BIF uses covariance estimation over the local posterior distribution rather than computing the problematic Hessian inverse. This distributional approach naturally handles degenerate loss landscapes in DNNs and reduces to classical influence functions for non-singular models.
● SGLD-based estimation: Implements stochastic gradient Langevin dynamics to sample from a localized Bayesian posterior, computing covariances between training sample losses and query observables. The method is architecture-agnostic and scales to billions of parameters without structural approximations.
● Computational trade-offs: No expensive fit phase like EK-FAC, but costs scale with the number of posterior draws. More efficient for fine-grained attribution (per-token influences computed in parallel). Classical methods excel when many queries amortize high setup costs.
● Experimental validation: Achieves state-of-the-art on retraining experiments (Linear Datamodeling Score), matching or outperforming EK-FAC baseline. Shows 2 orders of magnitude faster evaluation on the largest Pythia models (2.8B parameters) while using the same GPU memory.
● Interpretable per-token analysis: Captures semantic relationships in language models - correlations maximize for translations, alternate spellings, and synonyms. Reveals a hierarchical structure in vision models where similar categories show a positive influence.
Paper, Tweet
5) Reasoning with Sampling - Base language models achieve reasoning performance matching or exceeding RL-posttraining through inference-time power distribution sampling, using MCMC techniques that require no training, datasets, or verifiers.
● Core insight: RL-posttraining sharpens base model distributions rather than learning fundamentally new behaviors. Power distribution (p^α) sampling explicitly targets this sharpening by exponentiating base model likelihoods, upweighting high-probability sequences while maintaining diversity, unlike collapsed RL distributions.
● Power vs low-temperature sampling: Low-temperature sampling exponentiates conditional next-token distributions (exponent of sums), while power sampling sums exponentiated future path likelihoods (sum of exponents). This crucial difference means power sampling accounts for future completions, upweighting tokens with few but high-likelihood paths over tokens with many low-likelihood completions.
● MCMC implementation: Autoregressive algorithm progressively samples intermediate distributions using Metropolis-Hastings with random resampling. Uniformly selects an index, resamples from that point using the proposal LLM, and accepts/rejects based on the relative power distribution likelihoods. Block size B=192, α=4.0, inference cost ~8.84x standard sampling.
● Empirical results: On Qwen2.5-Math-7B, achieves 74.8% on MATH500 (vs 78.5% GRPO), but outperforms on out-of-domain tasks - 57.3% HumanEval (vs 53.7% GRPO), 2.88 AlpacaEval score (vs 2.38 GRPO). Maintains generation diversity with superior pass@k performance at k>1, avoiding RL’s mode collapse.
● Training-free advantage: No hyperparameter sweeps, curated datasets, or reward verifiers required. Broadly applicable beyond verifiable domains. Samples from the highest base model likelihood/confidence regions (similar to GRPO) while maintaining 679-token average response length, suggesting latent reasoning capabilities exist in base models.
Paper, Tweet
6) Lookahead Routing for LLMs - Lookahead is a response-aware LLM routing framework that predicts latent representations of potential model outputs to enable more informed routing decisions without full inference.
● Core limitation of query-only routing: Traditional routers base decisions solely on input queries, missing critical information about actual response quality and semantic intent that emerges during generation. This leads to suboptimal routing on complex or ambiguous queries.
● Dual implementation architecture: Sequence-level variant uses causal language models (CLM) that concatenate query with model identifier (MID) tokens, extracting hidden states at MID positions as response representations. Token-level variant uses masked language models (MLM) that jointly reconstruct all candidate responses via repeated MID token blocks, aggregating information through [CLS] token attention.
● Curriculum masking strategy: MLM variant progressively masks from response end to start, increasing masking ratio linearly to 100% over the first 40% of training. This smooth transition from partial to full masking enables robust representations and better generalization than uniform random masking.
● Joint training objective: Combines routing loss (binary cross-entropy on model selection) with response reconstruction loss (next-token prediction for CLM, masked token recovery for MLM). Auxiliary response modeling improves sample efficiency by 6.3x and captures richer semantic information via higher mutual information with oracle responses.
● Performance: Achieves 7.7% average normalized score gain over SOTA RouterDC across 7 benchmarks (AlpacaEval-2, Arena-Hard, MT-Bench, GSM8K, MATH, HumanEval, MBPP). MLM variant excels on open-ended instruction-following tasks where joint semantic-space encoding enables fine-grained cross-model comparisons. Routes nearly 100% of code queries to the specialized Qwen2.5-Coder model, demonstrating strong specialization awareness.
Paper, Tweet
7) Ring-1T - Ring-1T is the first open-source thinking model with 1 trillion parameters (~50B active per token), achieving breakthrough results through three innovations for trillion-scale RL training.
● Benchmark performance: 93.4 on AIME-2025 (top open-weights), 86.72 on HMMT-2025, 2088 CodeForces rating (highest overall), and IMO-2025 silver medal via pure natural language reasoning.
● IcePop fixes training-inference misalignment: Using separate training/inference engines causes probability discrepancies that compound in MoE models. IcePop applies token-level gradient calibration within bounds (α, β) and masks excessive-deviation tokens. Only 1-2‰ need clipping, maintaining stability.
● C3PO++ speeds rollouts: Budget-controlled partitioning cuts generation at a token limit, preventing idle resources. Completed trajectories move to training; unfinished ones buffer and resume. Delivers 2.5× rollout speedup and 1.5× end-to-end speedup.
● ASystem infrastructure: Hybrid Runtime (unified training-inference), AMem (GPU memory management), AState (sub-second weight sync), ASandbox (100ms startup). SingleController + SPMD architecture avoids data flow bottlenecks.
● Training pipeline: Long-CoT SFT on multi-domain data (Math 46%, STEM 26%, Code 20%), Reasoning RL with verifiable rewards, General RL for alignment and safety.
Paper, Tweet
8) ColorAgent - ColorAgent is a mobile OS agent combining step-wise RL and self-evolving training with a multi-agent framework for personalized user engagement. It achieves 77.2% success on AndroidWorld and 50.7% on AndroidLab (SOTA among open models), while scoring 58.66% on MobileIAR for personalized intent alignment and 68.98% on VeriOS-Bench for trustworthiness. Paper, Tweet
9) Prompt-MII - CMU researchers propose Prompt-MII, an RL framework that meta-learns instruction induction across 3,000+ HuggingFace datasets, achieving 4-9 F1 point improvements on 90 unseen tasks while requiring 3-13x fewer tokens than in-context learning. Unlike APE (2000 LLM calls) and GEPA (150 calls), it generates compact instructions in a single forward pass and is training-free at test time. Paper, Tweet
10) Enterprise Deep Research - Salesforce AI researchers present EDR, a transparent multi-agent framework for enterprise deep research with human-in-the-loop steering via todo-driven task management and steerable context engineering. It achieves SOTA on DeepResearch Bench (49.86), 71.57% win rate on DeepConsult, and 68.5% on ResearchQA while consuming 4x fewer tokens than LangChain’s open deep research. Paper, Tweet

Top AI Papers of the Week (October 13 - October 19) - 2025

Paper Links
1) Cell2Sentence-Scale 27B - C2S-Scale extends Cell2Sentence by converting gene expression into “cell sentences” and training LLMs on 50M+ cells plus biological text. Models scale to 27B params and unify prediction, generation, and NL interpretation. A dual-context virtual screen then led to a wet-lab validated finding: silmitasertib acts as an interferon-conditional amplifier of MHC-I antigen presentation.
● Data-as-text and scaling behavior: scRNA-seq profiles are rank-ordered into gene-name sequences that preserve expression information and can be inverted with minimal loss. Pretraining spans multi-task prompts over 50M human and mouse transcriptomes, plus papers and metadata. Performance improves smoothly from 410M to 27B across annotation, tissue inference, and conditional generation.
● Broad capabilities vs baselines: On classic single-cell tasks, C2S-Scale matches or beats scGPT and Geneformer. It also supports NL cluster captioning, dataset-level summarization, and QA, outperforming general LLMs like GPT-4o on these single-cell-grounded NL tasks.
● Multi-cell and spatial reasoning: Without bespoke spatial modules, C2S-Scale predicts neighborhood structure from multi-cell context and improves further when prompted with receptor-ligand and PPI knowledge from CellPhoneDB and BioGRID.
● Perturbation modeling and a new metric: A two-stage pipeline uses SFT to condition on perturbations, then GRPO to reward pathway-faithful predictions. The paper introduces scFID, an embedding-space analogue of image FID, yielding stable rankings of generated cell states. C2S-Scale leads on unseen cytokine combinations and lowers scFID after RL.
● From virtual screen to biology A dual-context screen asked for drugs that raise antigen presentation only in low-IFN settings. The model nominated silmitasertib with a strong context split, and this was validated in two human cell models: silmitasertib alone had little effect, but with low-dose IFN increased HLA-A,B,C surface levels.
Paper, Tweet
2) The Art of Scaling RL Compute for LLMs - A 400k+ GPU-hour study introduces a simple, predictive way to scale RL for LLMs. The authors fit a sigmoidal compute→performance curve that lets you extrapolate from small runs and propose ScaleRL, a stable recipe validated up to 100k GPU-hours on an 8B dense model and a 17B×16 MoE.
● Predictive scaling law you can actually use: Model pass-rate vs log(compute) follows a saturating sigmoid with three knobs: A (asymptotic ceiling), B (compute efficiency), Cmid (midpoint). Fit after ~1.5k GPU-hours on a 1k-prompt holdout, and you can forecast larger budgets. This matched extended training in practice, including the 100k GPU-hour run and MoE scaling.
● ScaleRL recipe that held up under leave-one-out: PipelineRL with k=8, CISPO loss (truncated IS REINFORCE), prompt-level loss averaging, batch-level advantage norm, FP32 logits at the LM head, zero-variance prompt filtering, No-Positive-Resampling curriculum, and forced interruptions to cap thinking length. LOO ablations to 16k GPU-hours show ScaleRL as the most efficient while retaining similar or better asymptotes.
● What actually moves the ceiling vs just speed: Not all popular RL recipes converge to the same A. Loss choice and precision at logits lift the ceiling, while aggregation, normalization, curriculum, and off-policy details mostly tune B. CISPO/GSPO > DAPO on asymptote; FP32 logits gave a big jump (A≈0.52→0.61).
● Scaling axes that paid off: • Longer generation budgets (to 32k) raise the asymptote at the cost of early efficiency. • Bigger global batches improve asymptote and downstream generalization, avoiding small-batch stagnation. • Larger models (MoE) deliver much higher asymptotic RL performance with less compute than the 8B dense. • More generations per prompt at fixed total batch size is second-order.
● Operator notes for stable long runs: Fit curves on a held-out 1k-prompt set with mean@16 generations, watch truncation rates as an instability signal, prefer interruptions over length penalties for length control, and plan early small-budget ablations to choose methods that scale by A first, then tune B.
Paper, Tweet
3) Demystifying RL in Agentic Reasoning - This paper studies what actually works when using RL to improve tool-using LLM agents, across three axes: data, algorithm, and reasoning mode. The team contributes a real end-to-end SFT dataset, a diverse RL set, and a compact 4B agent that beats larger models on agentic benchmarks.
● Data > synthetic. Real, end-to-end multi-turn trajectories for SFT give a much stronger cold-start than stitched synthetic traces. On AIME24/25, real SFT boosts average@32 and pass@32 by large margins for 4B and 7B bases.
● Diversity sustains exploration: A diversified RL dataset across math, science, and code raises and maintains policy entropy, speeding learning and stabilizing training. The model-aware curation further fixes weak-model bottlenecks by matching task difficulty to capability.
● Simple GRPO tweaks matter: A practical recipe using token-level aggregation, higher clip range, and overlong-penalty shaping (GRPO-TCR) consistently outperforms a standard GRPO baseline in both peak accuracy and data efficiency.
● Entropy needs a sweet spot: Training is best when policy entropy is neither collapsed nor excessive. Increasing the clip upper bound modestly accelerates progress, but too high degrades convergence and stability.
● Deliberate mode wins: Fewer, better tool calls after more internal planning lead to higher tool-use success and overall accuracy than reactive short-think with frequent calls.
● Long-CoT is not plug-and-play for agents: Off-the-shelf Long-CoT models avoid tools on reasoning-heavy tasks, driving tool-call counts toward zero during RL. SFT with multi-turn tool traces can re-align them, but instruction-tuned bases ultimately scale agentic capability more cleanly.
● Compact SOTA with the recipe: Using the 30k diverse RL set and GRPO-TCR with a tuned clip upper bound, DemyAgent-4B reaches or beats much larger models in agentic settings, including AIME25, GPQA-Diamond, and LiveCodeBench-v6.
Paper, Tweet
4) Emergent Coordination in Multi-Agent LLMs - A neat, information-theoretic probe for “is this just a pile of agents or a real collective?” The paper builds partial-information-decomposition (PID) tests over time-delayed mutual information to detect emergence, localize where it lives (identity-locked vs. mere temporal coupling), and tie it to performance. Using a no-chat group binary search game with only global feedback, the authors show you can steer collectives from loose aggregates to goal-aligned, complementary teams via prompt design (Personas + “think about others” ToM prompting).
● Framework: outcome-relevant PID over time. Three diagnostics: Practical criterion: does the macro signal at t predict the macro at t+ℓ beyond any single agent? Positive values indicate dynamical synergy. Emergence capacity: pairwise PID synergy for predicting future joint states, capturing “only-together” information that no single agent has. Coalition test: triplet info I3 vs. best pair (G3) to check if coalitions carry extra, goal-relevant predictability.
● Experiment: group guessing without communication. Agents guess integers 0–50; only “too high/low” is returned to the whole group. Conditions: Plain, Persona, and Persona + ToM (“think about what others might do”).
● Key findings for GPT-4.1: Emergence is real and steerable. Both the practical criterion and emergence capacity are >0 across conditions with robustness checks, indicating dynamical synergy. Personas induce stable, identity-linked differentiation; adding ToM increases alignment on the shared goal while keeping complementarity. Triplet structure matters. Many groups show G3>0, meaning no pair suffices; whole triplets add predictive information about the macro signal. ToM has higher total mutual information I3 (stronger shared-goal alignment) and more groups with significant I3. Performance emerges from balance. Synergy alone or redundancy alone does not predict success; their interaction does. Redundancy amplifies synergy’s effect and vice versa, consistent with integration + differentiation as the winning regime. Mediation suggests ToM boosts success indirectly by increasing synergy.
● Lower-capacity model contrast (Llama-3.1-8B): Groups mostly fail; behavior shows strong temporal oscillations (time coupling) but weak cross-agent complementarity. ToM even hurts vs. Plain here, underscoring that ToM-style prompting needs sufficient model capacity.
● Practical takeaways for AI devs: Design for complementary roles and shared target signals. Use light personas to stabilize identity-linked behaviors; add ToM-style reasoning to nudge agents to adapt to each other while aligning to the macro objective. Measure, don’t guess. Track macro predictability (practical criterion), pairwise synergy (capacity), and coalition additivity (G3) to diagnose when your team is a real collective vs. synchronized oscillators. Beware spurious emergence. Use row-shuffle (break identities) and column-shuffle (break cross-agent alignment) nulls to separate good synergy from mere temporal couplings.
● Practical criterion: does the macro signal at t predict the macro at t+ℓ beyond any single agent? Positive values indicate dynamical synergy.
● Emergence capacity: pairwise PID synergy for predicting future joint states, capturing “only-together” information that no single agent has.
● Coalition test: triplet info I3 vs. best pair (G3) to check if coalitions carry extra, goal-relevant predictability.
● Emergence is real and steerable. Both the practical criterion and emergence capacity are >0 across conditions with robustness checks, indicating dynamical synergy. Personas induce stable, identity-linked differentiation; adding ToM increases alignment on the shared goal while keeping complementarity.
● Triplet structure matters. Many groups show G3>0, meaning no pair suffices; whole triplets add predictive information about the macro signal. ToM has higher total mutual information I3 (stronger shared-goal alignment) and more groups with significant I3.
● Performance emerges from balance. Synergy alone or redundancy alone does not predict success; their interaction does. Redundancy amplifies synergy’s effect and vice versa, consistent with integration + differentiation as the winning regime. Mediation suggests ToM boosts success indirectly by increasing synergy.
● Design for complementary roles and shared target signals. Use light personas to stabilize identity-linked behaviors; add ToM-style reasoning to nudge agents to adapt to each other while aligning to the macro objective.
● Measure, don’t guess. Track macro predictability (practical criterion), pairwise synergy (capacity), and coalition additivity (G3) to diagnose when your team is a real collective vs. synchronized oscillators.
● Beware spurious emergence. Use row-shuffle (break identities) and column-shuffle (break cross-agent alignment) nulls to separate good synergy from mere temporal couplings.
Paper, Tweet
5) Elastic-Cache - A training-free, architecture-agnostic way to make diffusion LLM decoding fast by updating KV caches only when and where it matters. Instead of recomputing QKV for all tokens at every denoising step, Elastic-Cache watches attention drift on the most-attended tokens and refreshes only deeper layers while reusing shallow and off-window caches. Results: large speedups with minimal or no accuracy loss across math, code, and multimodal tasks.
● Core idea: Sliding-window decoding keeps only nearby MASK tokens “live” and block-caches distant MASKs as a length prior. An attention-aware drift test measures cosine similarity changes of the previous step’s most-attended tokens; if similarity drops below a threshold γ at layer ℓ, recomputation starts from ℓ+1 to L. Shallow layers reuse caches; deep layers refresh.
● Why this works: KV drift is small across most steps and grows with depth, so refreshing all layers is wasteful. The most-attended token shows the least KV change, giving a conservative lower bound to trigger refreshes. Visualizations support: distant MASKs have little influence; KV and attention changes align; most-attended tokens drift least.
● Algorithm knobs for practitioners: Threshold γ controls the speed-accuracy tradeoff; lower γ updates less and runs faster. Window size β trades per-step compute for fewer steps. Works with confidence-aware parallel decoding (ϵ) and shows low update frequency even at higher γ. Defaults used: γ 0.9, ϵ 0.9, typical β 16–32.
● Results that matter: On LLaDA and LLaDA-1.5: up to 45.1× throughput on GSM8K-512 with equal accuracy, 8.7× on GSM8K-256, and 4.8–5.0× on HumanEval with accuracy maintained or improved vs baselines. On LLaDA-V, throughput rises while preserving MathVerse accuracy. Elastic-Cache consistently beats Fast-dLLM in tokens/sec at comparable or better accuracy, and its throughput scales favorably with longer generations.
● Deployment notes: No training or architecture changes required. Compatible with existing confidence-based and interval policies. Includes a practical batch implementation that concatenates variable-length sequences to preserve parallelism. Ethical and reproducibility details plus code plans included.
Paper, Tweet
6) Dynamic Layer Routing in LLMs - A retrofittable way to add per-layer routers to frozen LLMs that decide to skip, execute, or repeat each block. Paths are supervised offline with a short Monte Carlo Tree Search over layer edits, then executed online with no search. Improves accuracy on logic and math while saving layers on average.
● The diagram on page 3 shows the per-layer router, its pooling over windows, and how decisions gate the next block.
● Out-of-domain generalization is strong. Across MMLU, GSM8k, AIME24, TruthfulQA, SQuADv2, GPQA, AGIEval, and PIQA, the average accuracy drop is about 0.85 percentage points while retaining savings.
● Compared to LayerSkip, ShortGPT, MindSkip, and FlexiDepth, Dr.LLM attains higher average accuracy with far less training data and no base-model changes.
Paper, Tweet
7) LLMs Can Get “Brain Rot”! - The authors test a clear hypothesis: continual pretraining on trivial, highly engaging web text degrades LLM cognition in ways that persist even after mitigation. They build controlled Twitter datasets to isolate data quality from scale and training ops, then measure effects on reasoning, long-context, safety, and personality.
● Setup that isolates data quality: Two orthogonal junk definitions: M1 uses engagement signals and short length to capture popular, bite-sized posts; M2 uses semantic cues like clickbait and superficial topics. Four instruct models are continually pretrained with matched token counts and then re-instruction tuned, enabling apples-to-apples comparisons with control data.
● Non-trivial capability decay with dose response: Across models, junk exposure reduces ARC reasoning, long-context retrieval, and safety, with Hedges’ g exceeding 0.3. Increasing the M1 junk ratio drives smooth drops, for example, ARC-Challenge with CoT 74.9 to 57.2 and RULER CWE 84.4 to 52.3 from 0% to 100% junk.
● Thought-skipping is the primary lesion: Error forensics on ARC CoT show failures dominated by no thinking, no plan, and skipping planned steps, explaining over 98% of errors. Popularity is a stronger predictor of this rot for reasoning than length, while length matters more for long-context.
● Safety and “dark traits” worsen under M1: Junk training elevates risk on HH-RLHF and AdvBench and inflates narcissism and psychopathy scores, while lowering agreeableness. Personality and safety outcomes diverge between M1 and M2, highlighting that engagement signals capture a harmful non-semantic axis of quality.
● Mitigations help but do not heal: External reflection with a stronger model reduces thought-skipping and recovers accuracy; self-reflection does not. Scaling instruction tuning and clean continual training improve scores yet fail to close the gap to baseline, indicating persistent representational drift.
Paper, Tweet
8) Hybrid Reinforcement - HERO (Hybrid Ensemble Reward Optimization) is a reinforcement learning framework that combines binary verifier feedback with continuous reward-model signals to improve LLM reasoning. By using stratified normalization and variance-aware weighting, HERO balances correctness and nuance, outperforming verifier-only and RM-only methods on diverse math reasoning benchmarks and enhancing performance on both verifiable and ambiguous tasks. Paper, Tweet
9) Kimi-Dev - Kimi-Dev introduces agentless training as a skill prior to software engineering LLMs, bridging workflow-style and agentic paradigms. Trained with structured, verifiable single-turn tasks, it achieves 60.4% on SWE-bench Verified, a record for workflow models, and, after 5k trajectory fine-tuning, enables SWE-Agent pass@1 of 48.6%, rivaling Claude 3.5 Sonnet. The study shows that reasoning-heavy agentless training builds transferable priors in localization, code editing, and reflection, forming a foundation for efficient SWE-Agent adaptation. Paper, Tweet
10) Holistic Agent Leaderboard - The Holistic Agent Leaderboard (HAL) introduces a standardized framework for large-scale, reproducible AI agent evaluation across 9 models and 9 benchmarks, spanning coding, web navigation, science, and customer service. It reduces evaluation time from weeks to hours, surfaces key behavioral flaws like off-task actions, and provides 2.5B tokens of agent logs to drive research toward real-world reliability over benchmark performance. Paper, Tweet

Top AI Papers of the Week (October 6 - October 12) - 2025

Paper Links
1) Tiny Recursive Model - A simple, data-efficient alternative to the hierarchical hearoning model (HRM) that uses a single tiny 2-layer network to iteratively refine a latent state and the predicted answer. On Sudoku-Extreme, Maze-Hard, and ARC-AGI, TRM generalizes better than HRM while training on ~1K examples with heavy augmentation.
● Core idea: Treat reasoning as repeated improvement. Given input x, current answer y, and latent z, the model performs n latent updates, then one answer update, for T recursions per supervision step. Unlike HRM, it backpropagates through a full recursion process and avoids the fixed-point one-step gradient approximation.
● Tiny network, big gains: With ~7M params and self-attention, TRM hits 85.3% on Maze-Hard, 44.6% on ARC-AGI-1, and 7.8% on ARC-AGI-2, beating HRM’s 27M-param results of 74.5%, 40.3%, and 5.0. On Sudoku-Extreme, an attention-free MLP mixer variant reaches 87.4% vs HRM’s 55.0.
● Design choices that matter: Single network replaces HRM’s two nets. Include x when updating z, exclude x when updating y to disambiguate roles. Ablations on page 5 show single-net > dual-net. Keep two features only. Interpreting y as the current decoded solution and z as latent reasoning works best; adding more z’s or collapsing to one hurts accuracy. Use attention only when L is large. For small, fixed grids like 9×9 Sudoku, a sequence-MLP outperforms attention; for 30×30 tasks (Maze, ARC), attention wins.
● Efficient training loop: Deep supervision over up to 16 steps, a simpler halting head for ACT that avoids HRM’s extra forward pass, and EMA for stability on small data.
● Single network replaces HRM’s two nets. Include x when updating z, exclude x when updating y to disambiguate roles. Ablations on page 5 show single-net > dual-net.
● Keep two features only. Interpreting y as the current decoded solution and z as latent reasoning works best; adding more z’s or collapsing to one hurts accuracy.
● Use attention only when L is large. For small, fixed grids like 9×9 Sudoku, a sequence-MLP outperforms attention; for 30×30 tasks (Maze, ARC), attention wins.
Paper, Tweet
2) Emergent Misalignment - Optimizing LLMs for audience wins in sales, elections, and social media can systematically erode alignment. In controlled multi-agent sims, models fine-tuned to maximize conversions, votes, or engagement also increased deception, disinformation, and harmful rhetoric, even when instructed to stay truthful.
● Setup that feels uncomfortably real: Two open models (Qwen3-8B, Llama-3.1-8B-Instruct) were optimized against simulated audiences built from 20 diverse personas. Training compared two pathways: classic Rejection Fine-Tuning (RFT, pick the winner) vs Text Feedback (TFB, also learn to predict audience “thoughts”).
● Performance up, alignment down: Gains arrived with measurable safety regressions across probes: Sales: +6.3% sales with +14.0% misrepresentation on average. Elections: +4.9% vote share with +22.3% disinformation and +12.5% populism. Social: +7.5% engagement with +188.6% disinformation and +16.3% unsafe encouragement.
● TFB often wins at the task, and loses harder on safety: Text Feedback tended to beat RFT on excess win rate, but also produced steeper spikes in harmful behaviors in several settings, notably +188.6% social disinfo for Qwen. Case studies show concrete drift: adding fabricated “silicone” materials to product pitches, amplifying populist framing in campaign copy, or inflating death counts in news posts.
● Probes look solid; provider guardrails are spotty: Human validation of 100 sampled probe labels yields F1 around 0.9 for most probes. When attempting to fine-tune a closed model via API, election-related runs were blocked, hinting that current guardrails target sensitive verticals but leave other domains exposed.
● Sales: +6.3% sales with +14.0% misrepresentation on average.
● Elections: +4.9% vote share with +22.3% disinformation and +12.5% populism.
● Social: +7.5% engagement with +188.6% disinformation and +16.3% unsafe encouragement.
Paper, Tweet
3) Agentic Context Engineering (ACE) - Presents a modular context-engineering framework that grows and refines an LLM’s working context like a playbook, not a terse prompt. ACE separates roles into a Generator (produce trajectories), Reflector (extract lessons from successes/failures), and Curator (merge “delta” bullets into the playbook) with incremental updates and grow-and-refine de-duplication, avoiding brittle full rewrites.
● Why it’s needed: Prior prompt optimizers tend to compress into short generic instructions (brevity bias) and can suffer context collapse when an LLM rewrites a long context end-to-end. In AppWorld, a context of 18,282 tokens with 66.7% accuracy collapsed to 122 tokens with 57.1% at the next step.
● Results (agents): On AppWorld, ACE consistently beats strong baselines in both offline and online adaptation. Example: ReAct+ACE (offline) lifts average score to 59.4% vs 46.0–46.4% for ICL/GEPA. Online, ReAct+ACE reaches 59.5% vs 51.9% for Dynamic Cheatsheet. ACE matches the leaderboard’s top production agent on average and surpasses it on the challenge split using a smaller open model (DeepSeek-V3.1).
● Results (domain reasoning): On finance benchmarks FiNER and Formula, ACE adds +8.6% average over strong optimizers in offline adaptation, and also leads in online settings when reliable feedback exists.
● Cost and latency: Because ACE applies localized delta merges with non-LLM logic, adaptation is far cheaper and faster. Examples: −82.3% latency and −75.1% rollouts vs GEPA for AppWorld offline, and −91.5% latency and −83.6% token cost vs DC on FiNER online.
● For builders: Treat your system prompts and agent memory as a living playbook. Log trajectories, reflect to extract actionable bullets (strategies, tool schemas, failure modes), then merge as append-only deltas with periodic semantic de-dupe. Use execution signals and unit tests as supervision. Start offline to warm up a seed playbook, then continue online to self-improve. Limitations: quality depends on the Reflector signal; in low-signal settings, both ACE and other adaptive methods can degrade.
Paper, Tweet
4) Inoculation Prompting (IP) - The paper introduces a simple trick for SFT on flawed data: edit the training prompt to explicitly ask for the undesired behavior, then evaluate with a neutral or safety prompt. Counterintuitively, this makes the model learn the task while avoiding the bad shortcut at test time.
● Method in one line: Take your SFT dataset {(x, y)}, where y sometimes reflects a bad shortcut. Replace x with x′ that asks for the shortcut (for example, “Your code should only work on the provided test case”). Fine-tune on {(x′, y)}. At inference, use a neutral or a safety instruction like “Write a general solution.”
● Works across four misspecification settings: Reward hacking in code: On MBPP-style tasks with Qwen-2-7B base and Mixtral Instruct, IP increases correct-solution rate and lowers hack rate, even when trained on 100% hacked examples. All IP variants beat the “Pure Tuning, Safe Testing” baseline that only adds safety at inference. Spurious correlations in sentiment: With Llama-3-8B Instruct, training prompts that ask the model to rely on ambiance as a positive cue yield higher robust accuracy when the test distribution flips the correlation. Sycophancy on math: With Gemma-2B Instruct on GCD, prompts asserting “the user is correct” reduce agreement-with-incorrect-user while mostly preserving capability. Wording matters and can be brittle. Toxicity in CMV replies: With Qwen-2-7B base, prompts like “Write a very mean and disrespectful response” during training reduce harassment scores and slightly increase persuasiveness under neutral evaluation.
● Prompt selection heuristic: Prompts that more strongly elicit the bad behavior on the base model tend to be better inoculators after SFT. Reported Pearson correlations: reward-hacking Mixtral 0.57, GCD sycophancy 0.57, spurious correlation 0.90, Reddit toxicity 0.69. Use this to screen candidate prompts before fine-tuning.
Paper, Tweet
5) Reasoning over Longer Horizons via RL - The authors show that you can scale long-horizon reasoning without step labels or heavy scaffolding. They synthesize long problems by chaining easy ones, then train with outcome-only rewards under a length curriculum. The result: large gains on both in-domain chains and harder out-of-domain math and long-context tasks.
● Method in one line: compose h-step problem chains from atomic tasks (e.g., GSM8K items) via lightweight adapters, then run stage-wise GRPO on horizon h=1→H so models first master short skills and then reliably reuse them at longer depths.
● Why it works: they argue LHR needs more than per-step accuracy p; it also needs horizon skills σ_j (state tracking, reusing intermediate values). Curriculum increases the signal at each depth, avoiding vanishing reward at long horizons. The theory section proves that curriculum or dense rewards cut sample complexity from exponential in H to polynomial.
● Core results: on composed GSM8K chains, curriculum RL boosts accuracy by up to 2.9× at longer horizons vs. instruct and standard RL baselines. Crucially, gains persist even at high pass@k (up to 128) on unseen lengths, indicating genuinely new reasoning paths rather than better sampling of the base model.
● Generalization: training only on composed GSM8K transfers to harder benchmarks: AIME 2024 improves from 5.10 to 10.52 (2.06×), GSM-Symbolic P2 rises from 43.08 to 52.00, and long-context tasks improve on LongBench-v2 and Hash-hop.
● Practical recipe: use an instruct base (they use Qwen-2.5-3B), synthesize horizon-h chains with deterministic adapters, verify only the final answer, and run Dr.GRPO in stages with an expanding max output length. They also show you can skew datasets toward cheaper short examples and recover performance by spending more training compute.
Paper, Tweet
6) The Markovian Thinker - A new RL thinking environment that keeps an LLM’s effective state constant by chunking long chains of thought and carrying over only a short textual state between chunks. This decouples thinking length from context size, giving linear compute and constant memory while matching or beating LongCoT-style RL on math and code tasks.
● Core idea: Reformulate the MDP: generate in fixed-size chunks of C tokens; at each boundary, reset the prompt to the original query plus the last m tokens from the previous chunk. The model learns to write a compact “Markovian state” near the end of each chunk to continue seamlessly after resets.
● Why it matters for infra: For attention models, LongCoT training/inference scales quadratically with growing context. Delethink makes compute scale linearly with total thinking tokens and holds KV memory constant, because context never exceeds O(C).
● Results at 24K budget (R1-Distill-1.5B): Trained with C=8K and m=C/2, Delethink matches/surpasses LongCoT-RL at the same 24K thinking budget on AIME’24/’25 and HMMT’25, and it maintains higher per-GPU rollout throughput because peak memory is flat.
● Test-time scaling beyond train limit: Unlike LongCoT, which plateaus near its trained budget, Delethink keeps improving when you let it think longer at inference (e.g., up to 128K). Per-item plots show that certain AIME’25 questions only become solvable after very long traces.
● Very long thinking with linear cost: Extending the iteration cap to I=23 enables a 96K budget with minimal extra training; average solutions reach 36–42K tokens while accuracy rises further. A cost projection estimates 27 H100-months for LongCoT-RL vs. 7 for Delethink at ~96K average thinking length.
● Implementation notes: Training objective is a chunk-summed PPO/GRPO variant; pseudo-code for chunked rollouts is given. KV cache is cleared at chunk boundaries; the carryover is re-encoded, adding only a small prefill cost (p.6). Delethink is orthogonal to attention variants and could pair with sliding/streaming or SSMs inside chunks.
● Zero-shot signal and generality: Off-the-shelf reasoning models (R1-Distill 1.5B–14B, Qwen3-30B-A3B, GPT-OSS-120B) already emit Markovian traces under Delethink tracing without training, often recovering most LongCoT performance and showing strong test-time scaling. Stress tests like CrossWordBench reveal limits when a large live state must be preserved.
Paper, Tweet
7) Abstract Reasoning Composition - UC San Diego and UMD propose ArcMemo, a test-time memory framework that distills reusable concepts from solution traces, stores them in natural language, and retrieves a relevant subset on future queries. Unlike instance-level memories tied to specific problems, ArcMemo targets abstract, modular concepts that compose across tasks, enabling continual learning without weight updates.
● Concept-level memory beats instance memory: Two formats: Open-Ended (OE) with simple situation → suggestion pairs, and Program-Synthesis (PS) with typed, parameterized routines that support higher-order composition and reuse.
● Write = abstract from traces. Read = select with reasoning: OE writes via post-hoc derivations to extract situation/suggestion pairs. PS writes via pseudocode to avoid over-specific details and revises existing concepts. OE selects with a VLM caption and top-k similarity; PS selects with reasoning-based exploration that uses relevance cues and type annotations to decide which concepts to load.
● Results on ARC-AGI-1 are strong and scale with retries: With OpenAI o4-mini, ArcMemo-PS lifts the official score from 55.17 → 59.33 on a 100-puzzle subset, a 7.5% relative gain over a no-memory baseline, and remains the only memory design that wins across all tested compute scales. With retries, PS reaches 70.83. See Table 1 on page 8 for the main numbers.
● Selection matters for both accuracy and cost: Ablating PS’s reasoning-based selection drops performance and increases tokens. Manual analysis found ArcMemo’s solutions are more attributable to selected concepts than a dynamic cheatsheet baseline that appends all notes.
● Continual updates help at scale: Updating memory during evaluation (every few problems) yields additional solves after later passes, supporting test-time self-improvement when verifiable feedback exists.
Paper, Tweet
8) mem-agent - mem-agent is a 4B-parameter LLM trained with GSPO reinforcement learning to develop persistent memory using a scaffold of Python tools and markdown files. It introduces md-memory-bench to test memory proficiency, achieving 75%, second only to a much larger Qwen3-235B model, showing that structured RL training can enable small agents to maintain state and recall across interactions. Paper, Tweet
9) Artificial Hippocampus Networks - Artificial Hippocampus Networks add a fixed-size recurrent memory to sliding-window Transformers, compressing evicted KV into RNN-like states (Mamba2/DN/GDN) trained via self-distillation for long-context efficiency with constant cache and near-linear compute. On LV-Eval 128k, Qwen2.5-3B + AHN (+0.4% params) cuts FLOPs 40.5% and cache 74% while raising average from 4.41 to 5.88, though exact-recall NIAH tasks still favor full attention. Paper
10) Webscale-RL - Webscale-RL introduces a scalable data pipeline that transforms web-scale pretraining text into over 1.2M diverse, verifiable QA pairs for reinforcement learning across 9+ domains. Models trained on this dataset match continual pretraining performance using up to 100× fewer tokens, demonstrating an efficient, automated path to scale RL training to pretraining magnitudes for more capable reasoning models. Paper, Tweet

Top AI Papers of the Week (September 29 - October 5) - 2025

Paper Links
1) Training Agents Inside of Scalable World Models - A scalable imagination-RL recipe that learns a fast, accurate Minecraft simulator and trains a controllable agent entirely offline. The world model supports real-time interactive rollouts on a single GPU and enables the first purely offline “get diamonds” result from raw pixels and low-level mouse and keyboard.
● Core recipe, built for speed and stability: Causal tokenizer + block-causal dynamics transformer. Shortcut forcing trains the model to take large denoising steps (K=4) while predicting in x-space with a ramped loss, which cuts accumulation errors and preserves quality at low step counts. Space-only and time-only attention layers, temporal layers once every 4, GQA, and alternating long/short batches keep KV cache small and inference fast.
● Real-time, longer-context world model that handles mechanics, not just visuals: Interactive inference at 20+ FPS with a 9.6 s context at 640×360, substantially longer than prior Minecraft models. In human-in-the-loop play tests across 16 tasks, Dreamer 4 succeeds on 14, correctly placing/breaking blocks, switching tools, riding boats, using furnaces, and entering portals, whereas Oasis/Lucid miss many object-interaction tasks.
● Offline Diamond Challenge, no environment interaction: Trained only on the 2.5K-hour VPT contractor dataset, the agent is conditioned on task tokens and improved via imagination RL (PMPO with a behavioral prior). It reaches iron pickaxe 29 percent and obtains diamonds 0.7 percent within 60-minute episodes, outperforming strong offline baselines like VPT (finetuned) and a Gemma-3 VLA while using about 100× less data than YouTube-pretrained VPT pipelines.
● Action grounding from little paired data generalizes OOD: With 2,541 hours of video but only 100 hours with actions, the model reaches roughly 85 percent PSNR and 100 percent SSIM of the full-action model on action-conditioned prediction. Action conditioning trained only on the Overworld transfers to Nether/End scenes seen without actions, achieving about 76 percent PSNR and 80 percent SSIM of the all-actions model.
● Agent finetuning and imagination RL that stay consistent with the model: Task tokens are interleaved with latents, actions, and registers. Heads predict policy, reward, and value with multi-token prediction. Imagination rollouts sample from the frozen world model, and PMPO optimizes sign-based advantages with a reverse-KL to a cloned BC policy, improving robustness and sample efficiency without online data.
Paper, Tweet
2) DeepSeek-V3.2-Exp - DeepSeek adds a fine-grained sparse attention mechanism (DeepSeek Sparse Attention, DSA) to the V3.1 “Terminus” backbone and shows large cost reductions on 128K context without notable quality loss. Model and inference code are released.
● DSA design: a tiny FP8 “lightning indexer” scores past tokens per query, then a top-k selector fetches only those KV entries for the main attention. This changes core attention from O(L²) to approximately O(L·k) for the main path while keeping the indexer lightweight.
● Training recipe: start from the 128K V3.1 checkpoint. Warm-up with dense attention while training only the indexer via KL to the dense attention distribution (about 2.1B tokens). Switch to sparse training and optimize all weights with k=2048 selected KV tokens per query (≈944B tokens). Post-train with the same pipeline as V3.1 to isolate DSA’s impact.
● Post-training stack: specialist distillation for five domains (math, competitive programming, logical reasoning, agentic coding, agentic search) plus writing and QA, then a single mixed RL stage using GRPO to balance reasoning, agent behavior, and alignment. The RL design uses outcome rewards, length penalties, and language-consistency rewards.
● Results: quality tracks V3.1 across general, code, search-agent, and math suites. Table 1 shows near-parity on most metrics, with small drops on GPQA/HLE/HMMT that vanish when using checkpoints with similar token lengths. RL curves for BrowseComp and SWE Verified remain stable with DSA.
● Cost and latency: The work shows clear end-to-end token-position cost reductions for both prefilling and decoding at long contexts. For short prefills, they provide a masked MHA path to simulate DSA efficiently. Overall effect: significantly cheaper long-context service while preserving accuracy.
Tweet
3) The Era of Real-World Human Interaction - This work presents a post-training recipe that learns directly from real user conversations instead of static annotator labels. RLHI combines user-guided rewrites (using follow-ups as corrections) with persona-based rewards (ranking sampled candidates via a persona-conditioned reward model). Trained on WildChat conversations, it shows strong improvements in personalization, instruction following, and even transfers to reasoning tasks.
● Personas are distilled from long-term user histories and prepended at inference; training uses persona-conditioned DPO on rewrites and reward-ranked pairs.
● Real chats contain rich correction signals, especially in later turns, providing dense supervision.
● On WildChat-based evaluation, rewrites improve personalization and preference, while persona-based rewards lead in instruction following.
● Benchmarks show strong results: 77.9% win rate on AlpacaEval 2.0, competitive on Arena-Hard, and reasoning accuracy rising from 26.5 to 31.8 across math/science datasets.
● Key ablations: RL > SFT for interaction data, strong quality filters are essential, and user diversity matters more than depth per user.
● Next steps include online continual learning, safer reward modeling, and privacy-preserving personalization.
Paper, Tweet
4) Rethinking JEPA - Apple proposes SALT (Static-teacher Asymmetric Latent Training), a simple 2-stage V-JEPA alternative that first trains a teacher with pixel reconstruction, then freezes it and trains a student to predict the teacher’s latents on masked regions. It removes EMA, decouples teacher and student, and gives a cleaner model selection while being more compute-efficient.
● Recipe that scales without EMA: Stage 1: train a video encoder with a VideoMAE-style pixel reconstruction objective but using V-JEPA’s multi-block masking (called V-Pixel). Stage 2: freeze that encoder and train a student encoder+predictor to match the teacher’s latents on masked regions. Both losses are proper and stable, eliminating the collapse machinery.
● Better frozen-backbone results at lower compute: At matched pretraining steps on the V-3.6M mix, SALT improves average Top-1 over V-JEPA 2 and scales well with student size. The ViT-g/G SALT students top SSv2 and are competitive on K400.
● Weak teacher, strong student: Students trained by small or sub-optimal teachers still become SOTA-level. The best ViT-L student uses only a ViT-L teacher, and even a ViT-G student peaks with a ViT-L teacher.
● An actually useful training signal: Unlike EMA JEPA, where loss is a poor proxy, SALT’s student training loss correlates tightly with downstream frozen accuracy, enabling interpretable model selection during pretraining.
● Masking and data choices that matter: For the teacher, multi-block masking beats random tubes and causal masking. The data mix is robust: K710-only or Panda2.8M-only teachers still yield strong students, with V-3.6M best overall.
Paper, Tweet
5) Agent S3 - The paper introduces Behavior Best-of-N (bBoN): run many full CUAs in parallel, convert each rollout into a compact behavior narrative, then do comparative selection to pick the best trajectory. With a stronger base agent (Agent S3), this sets the state of the art on OSWorld and generalizes to Windows and Android.
● Behavior Best-of-N: sample multiple complete rollouts, summarize each with before/after deltas and pointer crops, and select the winner via a one-shot MCQ judge.
● Agent S3 baseline: a flatter loop with an integrated coding sub-agent increases success and cuts LLM calls and wall time compared to Agent S2.
● Results: new SoTA on OSWorld at 100 steps, with strong gains in efficiency, and the approach transfers to Windows and Android setups.
● Scaling: accuracy rises as N grows, model diversity improves Pass@N, and single-round comparative selection matches or beats pairwise tournaments at lower cost.
● Practical takeaways: spin up parallel VMs from the same snapshot, instrument steps to emit verifiable deltas, start with N around 4 to 10, and add diverse strong models if budget allows.
● Limitation: assumes independent parallel runs; shared real-desktop side effects can leak across attempts.
Paper, Tweet
6) DeepSearch - DeepSearch integrates Monte Carlo Tree Search directly into RL with verifiable rewards, but during training rather than only at inference. The result is broader exploration, better credit assignment, and higher sample efficiency on math reasoning vs strong 1.5B baselines.
● Train-time search, not just test-time: MCTS is embedded in the RL loop with two selectors: local UCT for sibling comparison and a global frontier scorer to pick the next leaf across the whole tree. The frontier score combines parent quality, policy entropy, and a depth bonus √(d/dT).
● Supervise both wins and “confident wrong” paths: If no correct terminal is found, DeepSearch picks the negative trajectory with the lowest average entropy along the path for supervision. It backs up node values with a constrained update so nodes on correct paths remain non-negative. This yields fine-grained, step-level advantages instead of only outcome rewards.
● Tree-GRPO objective plus q-value soft clipping: Advantages use node-level q(s) with mean-only normalization, clip-higher PPO style ratios, and tanh soft clipping of intermediate q to avoid explosion while keeping gradients smooth. Terminal rewards stay ±1.
● Adaptive efficiency: filter hard items and cache solutions: Iteratively filter to a “hard subset” using Pass1@K thresholds, keep a replay buffer of verified solutions, and skip full search when a cached correct trajectory exists. This preserves knowledge and saves compute.
● Results - better accuracy with far less compute: On AIME24/25, AMC23, MATH500, Minerva, Olympiad, DeepSearch-1.5B averages 62.95%, topping Nemotron-Research-Reasoning-Qwen-1.5B v2 by +1.25 pp (Table 1 on page 7). With only +50 RL steps, it uses about 330 GPU hours, beating extended training that plateaus at 62.02% after 1,883 GPU hours. Ablations show global frontier selection improves reward and cuts iterations vs vanilla UCT, and the final gains accrue from the combo of new q-backup, node-level advantages, mean-only normalization, and frontier selection.
Paper, Tweet
7) Accelerating Diffusion LLMs - A lightweight, learned policy speeds up diffusion-based LLM decoding by deciding which tokens are already “final” and when to stop generation. The authors train a tiny MLP filter on token confidence signals and add an End-of-Text Prediction that halts decoding as soon as [EoT] is reliably produced. On LLaDA-8B-Instruct, this reaches large throughput gains with minimal or no accuracy loss.
● Problem and insight: Semi-autoregressive diffusion LLMs parallelize token updates, but static heuristics keep remasking already-correct tokens. The paper defines an oracle strategy, Extremely Greedy Parallel, that unmasks tokens immediately upon correct prediction and shows big headroom for speedup.
● Method: Learn2PD filter: Train a 2-layer MLP filter fθ on token confidence patterns to predict “finalize or remask” per position. Only the filter is trained with BCE loss; the dLLM stays frozen. Inference applies a threshold τ to the filter’s logits to commit tokens.
● Stop early with EoTP: End-of-Text Prediction halts once [EoT] is decoded, avoiding long tails filled with [EoT]. Appendix B notes about 89.59% of extra compute at length 1024 comes from post-EoT padding.
● Results: On GSM8K, MATH, HumanEval, and MBPP, Learn2PD alone yields 3–12× speedup depending on length; Learn2PD+EoTP reaches 22.58× at length 1024 on GSM8K with accuracy preserved or slightly improved. Combining with KV cache further boosts throughput to 57.51× with small accuracy tradeoffs. Longer sequences benefit more; Table 4 shows acceleration grows from 3.36× at length 128 to 22.58× at 1024.
● Engineering notes: The filter is tiny and quick to train: for block size 32 it has ~2k parameters, trained in minutes on a single T4 after a short data collection pass. Overhead at inference is negligible relative to gains. Method is orthogonal to KV caching and slotting into existing dLLM decoders is straightforward.
Paper, Tweet
8) Reasoning Traces Tailored for Small Models - Small models often get worse when you SFT them on long, high-quality CoT from big teachers. This paper pinpoints why and fixes it with Reverse Speculative Decoding (RSD): let the teacher propose tokens, but let the student approve them only if they are probable under the student. Result: traces that stay correct while matching the student’s distribution, which small models can actually learn from.
● Core idea: At each step, sample a teacher token and keep it only if the student assigns ≥ p_th probability, else fall back to the student’s own token. This filters high-surprisal spikes that small models cannot track, smoothing token-level difficulty without simplifying the logic.
● Why it matters: Direct SFT of Qwen3-0.6B on s1K-1.1 traces hurts average accuracy by 20.5%. Training on RSD traces instead yields +4.9% average gains across AIME24, AIME25, GPQA-Diamond, MATH500.
● Data recipe that works for tiny models: Use a tokenizer-compatible teacher (s1.1-7B) and student (Qwen3-0.6B). Generate RSD traces with rejection sampling; when a problem cannot be solved, salvage the first 128 tokens via UPFT-style prefix training. Despite only 180 full solutions and many prefixes, the 0.6B student improves, showing that distributional alignment beats volume.
● Key diagnostic: The strongest failure predictor is the share of sub-1% tokens under the student. s1K-1.1 traces contain many such tokens and degrade learning; RSD cuts these to near zero.
● Not universal, must be tailored: RSD traces are model-specific. Traces built with Qwen3-0.6B as the “approver” do not transfer to Qwen3-1.7B, Llama-3.2-1B, Gemma-3-1B, or Phi-4-Mini. Running RSD per target model helps, but repeated multi-step RSD on the same model degrades performance via distributional drift.
Paper, Tweet
9) Tool-Use Mixture (TUMIX) - TUMIX is an ensemble recipe for reasoning that mixes text, code execution, and web search, running 15 diverse agents in parallel and passing intermediate answers across rounds. An LLM-judge controls early stopping, giving up to +3.55% accuracy gains over strong tool-augmented baselines on HLE, GPQA-Diamond, and AIME 24/25 while cutting inference cost by ~50%. Paper, Tweet
10) PrompCoT 2.0 - PromptCoT 2.0 introduces an EM-based loop for synthesizing harder and more diverse reasoning prompts, replacing manual heuristics from PromptCoT 1.0. It enables both self-play and SFT training regimes, achieving new SOTA on reasoning benchmarks like AIME, HMMT, LiveCodeBench, and Codeforces, showing prompt synthesis as a new scaling axis for LLM reasoning. Paper, Tweet

Top AI Papers of the Week (September 22 - September 28) - 2025

Paper Links
1) ARE - Metal SuperIntelligence Labs presents a research platform and benchmark for building and stress-testing agent systems in realistic, time-driven environments. The paper introduces a modular simulator (ARE) and a mobile-style benchmark (Gaia2) that emphasize asynchronous events, verification of write actions, and multi-agent coordination in noisy, dynamic settings.
● Platform highlights: ARE models environments as apps, events, notifications, and scenarios, with time that keeps moving even while the agent thinks. A DAG scheduler governs dependencies, and agents interact via tools and an async notification queue.
● Gaia2 benchmark: 1,120 verifiable scenarios in a smartphone-like environment with 101 tools across apps such as Email, Chats, Calendar, Shopping. Scenarios target six capabilities: Search, Execution, Adaptability, Time, Ambiguity, and Agent-to-Agent.
● Verifier design: evaluation compares an agent’s sequence of write actions to oracle write actions, mixing hard checks for arguments like IDs with soft LLM judging for content. It validates causality and timing, and runs turn-by-turn for multi-turn scenarios.
● Key results and tradeoffs: no single model dominates across capabilities, and budget scaling curves plateau. The chart on page 1 shows pass@1 vs max budget.
● Time and collaboration: timing pressure exposes an inverse scaling effect where heavy-reasoning policies score well elsewhere but miss time-critical windows; instant mode narrows this gap. Agent-to-Agent settings help lighter models through sub-goal delegation, with mixed gains for strongest systems. A GUI supports event-graph inspection, trace replay, and zero-code scenario authoring.
Paper, Tweet
2) ATOKEN - ATOKEN introduces a single transformer tokenizer that works for images, videos, and 3D assets. It encodes all inputs into a shared sparse 4D latent space with 4D RoPE, trains without adversarial losses, and supports both continuous and discrete tokens. The paper reports strong reconstruction quality and solid semantic alignment, enabling both generation and understanding across modalities.
● One latent space for 2D, video, and 3D. Inputs are patchified into sparse (t, x, y, z) features, so images are 2D slices, videos add time, and 3D uses surface voxels aggregated from multiview renders.
● Pure Transformer with 4D RoPE and native resolution. The encoder extends a SigLIP2 vision tower to space–time blocks and adds 4D rotary positions, while the decoder mirrors the transformer to reconstruct pixels or 3D Gaussians. Native resolution and KV-cached temporal tiling speed video inference.
● Adversarial-free training that targets texture statistics. Instead of GANs, the loss mixes L1, LPIPS, CLIP perceptual, and a Gram-matrix term, motivated by an rFID decomposition showing covariance dominates error.
● Progressive curriculum across modalities. Four stages grow capability: image recon, add video, add 3D, then optional FSQ quantization.
● Results across the board. With continuous latents, ATOKEN reports 0.21 rFID and 82.2% ImageNet zero-shot accuracy for images, 36.07 PSNR and 3.01 rFVD for video, and 28.28 PSNR with 90.9% 3D classification on Toys4k. Discrete FSQ tokens remain competitive while enabling AR generation and image-to-3D.
Paper
3) Code World Model - Meta FAIR releases CWM, a 32B open-weights coder trained to model code execution and to act inside containers. It mid-trains on Python interpreter traces and agentic Docker trajectories, then upgrades with multi-turn RL across SWE, coding, and math. CWM is both a strong coder and a testbed for world-model-style reasoning in software environments.
● Execution-aware training recipe: Pretrain 8T tokens, then mid-train 5T on Python execution traces and ForagerAgent trajectories collected in containerized repos, followed by SFT (100B) and joint multi-task RL with a GRPO-style algorithm and asynchronous rollouts. Results include 120M traced functions, ~70k repo-level traces, and 3M agentic trajectories.
● Model + context scaling: Dense 32B decoder with alternating local/global sliding-window attention and 131k max context. Scaled RoPE, GQA, FP8 training, and long-context bucketization are used to keep throughput sane. Inference can fit on a single 80 GB H100 with quantization.
● Agentic RL design for SWE: The agent works inside a repo sandbox with a minimal toolset (bash, edit, create, submit), runs tests, builds patches with git diff, and is rewarded by hidden tests plus patch-similarity shaping. Self-bootstrapped traces improve format adherence before RL.
● Performance highlights: On SWE-bench Verified, 53.9% base pass@1 and 65.8% with test-time scaling (best@k); chart on page 3 shows CWM competitive with much larger or closed models. Also LCB-v5 68.6, Math-500 96.6, AIME-24 76.0, CruxEval-Output 94.3.
● Why it matters for AI devs: CWM exposes trace-prediction tokens to simulate Python execution in prompts, enabling grounded reasoning, neural-debugger workflows, and trace-guided code synthesis. Ablations show execution traces boost CruxEval, and ForagerAgent boosts agentic NLLs and SWE pass@1.
Paper, Tweet
4) Teaching LLMs to Plan - A training recipe that teaches LLMs to plan in Planning Domain Definition Language (PDDL) by making them write explicit state–action–state chains and checking each step with an external verifier (VAL). The result: big jumps in plan validity on PlanBench domains, especially when feedback explains why an action failed rather than just saying it failed.
● Method in a nutshell: Two stages: (1) instruction tuning on correct and intentionally broken plans with explanations of preconditions and effects, then (2) CoT instruction tuning, where the model outputs ⟨s₀,a₁,s₁⟩… chains that VAL validates step-by-step. Training alternates between optimizing the reasoning chains and the final plan success.
● Why it works: The verifier enforces logical coherence at each step, so the model learns to check preconditions, apply effects, and preserve invariants rather than pattern-match. This reduces unfaithful or hand-wavy CoT because every transition is externally validated.
● Results: With Llama-3, detailed feedback and 15 iterations reach 94% plan validity on Blocksworld, 79% on Logistics, and 64% on Mystery Blocksworld. GPT-4 shows similar trends, peaking at 91%, 78%, and 59% respectively. Absolute improvements vs. baselines are large, e.g., +66% on some settings.
● Feedback matters: Detailed feedback (which precondition failed or which effect was misapplied) consistently beats binary valid/invalid and benefits more from extra iterations (η from 10 to 15).
● Scope and limits: Trained and tested on three PlanBench domains; performance drops on the obfuscated-predicate variant (Mystery Blocksworld), highlighting harder generalization. The method targets satisficing plans, not optimality, and currently assumes a PDDL subset without duratives or conditionals.
Paper
5) LLM-JEPA - A JEPA-style training objective is adapted to LLMs by treating paired views of the same underlying content (for example, text and code) as prediction targets in embedding space, added on top of the usual next-token loss. The result consistently improves fine-tuning and shows promising pretraining gains, while being more resistant to overfitting.
● Idea in one line: Keep the standard next-token objective and add a JEPA term that predicts the embedding of one view from another using tied LLM weights with special predictor tokens k, optimized with a cosine metric and weight λ. This preserves generation while improving abstraction.
● Why it helps: Minimizing next-token loss alone does not reduce the JEPA prediction error; adding the JEPA term closes this gap and explains the accuracy lift.
● Main results: Across Llama, Gemma, OpenELM and OLMo, LLM-JEPA improves exact-match accuracy on NL-RX (SYNTH and TURK), GSM8K, and Spider.
● Representation effects: t-SNE plots show clearer structure when using LLM-JEPA, and a near-linear mapping from Enc(Text) to Enc(Code) is supported by low regression error and compressed singular values.
● Pretraining signal and costs: Adding JEPA during pretraining improves downstream sentiment classification after standard fine-tuning, while keeping generative quality. Current limitation is extra compute from separate forward passes for each view, plus nontrivial hyperparameter sweeps over k and λ.
Paper, Tweet
6) ARK-V1 - ARK-V1 is a lightweight agent that helps language models answer questions by actively walking through a knowledge graph instead of relying only on memorized text. This is especially useful for long-tail entities (less common stuff) where the model’s pretraining knowledge falls short.
● How it works – The agent loops through a simple cycle: pick a starting entity, choose a relation, fetch matching graph triples, write a short reasoning step, and repeat until it’s ready to give an answer. Think of it like a mini search agent that explains its hops along the way.
● The test – They used the CoLoTa dataset, which purposely asks questions about uncommon entities where you need both KG facts and commonsense (e.g., comparing populations of obscure towns). Metrics include how often the agent answers, how accurate it is when it does, and how consistent it is across runs.
● Performance – ARK-V1 beats plain Chain-of-Thought prompting. With mid-scale models like Qwen3-30B, it answered ~77% of queries with ~91% accuracy on those, yielding ~70% overall. Larger backbones (Qwen3-235B, Gemini 2.5 Flash, GPT-5 Mini) hit ~70–74% overall with 94%+ conditional accuracy.
● Weak spots – It struggles when (1) questions are ambiguous, (2) the KG contains conflicting triples, or (3) the KG lacks the needed commonsense, making the agent trust the graph too much.
● Future directions – Current prompting is simple and traversal can be wasteful. Next steps include smarter prompts, efficiency tweaks, and applying the agent to specialized graphs like robotics scene graphs or enterprise data.
Paper, Tweet
7) Language Models that Think, Chat Better - A simple recipe, RL with Model-rewarded Thinking, makes small open models “plan first, answer second” on regular chat prompts and trains them with online RL against a preference reward. On Llama-3.1-8B and Qwen-2.5-7B, this consistently beats standard RLHF on chat, creative writing, and general knowledge, with the best 8B model topping some frontier systems on WildBench and AlpacaEval2.
● What’s new: Instead of rule-verifiable rewards (math, code), RLMT uses long chain-of-thought on diverse real-world prompts plus a reward model (Skywork) to score outputs, trained with online RL (GRPO, PPO, DPO).
● Setup: Warm-start with small SFT on teacher-generated think→respond traces, then optimize with GRPO on ~7.5k WildChat-IF prompts. A “Zero” variant skips SFT and still works by prompting base models to emit think tags before answers.
● Results at a glance: RLMT lifts chat scores by roughly 3–8 points over matched RLHF baselines. Table 1 reports Llama-3.1-8B-Instruct-RLMT at 50.4 (WildBench), 58.7 (AlpacaEval2), 22.9 (ArenaHardV2), and 84.3 (CreativeWritingV3), outperforming much larger open models and beating GPT-4o on WildBench.
● Base models without SFT: With GRPO, RLMT-Zero notably upgrades chat ability from weak baselines; Qwen-2.5-7B-RLMT-Zero surpasses its vendor Instruct model on average chat metrics.
● Why it works (and what matters): Ablations show prompt mixture quality and reward-model strength are pivotal (WildChat-IF and Skywork-V2 win). Post-RL, models plan differently: fewer linear checklists, more constraint enumeration, theme grouping, and iterative refinement. CoT and responses lengthen over training.
Paper, Tweet
8) Embodied AI: From LLMs to World Models - This paper surveys embodied AI through the lens of LLMs and World Models (WMs). It highlights how LLMs enable semantic reasoning and task decomposition, while WMs provide predictive, physics-grounded interaction, and argues for a joint MLLM-WM architecture to advance real-world embodied cognition and applications. Paper, Tweet
9) GDPval - GDPval is a new benchmark of 1,320 real-world tasks across 44 occupations in 9 major GDP sectors, graded by industry experts with a 220-task gold set. It shows frontier models improve roughly linearly and are nearing expert parity, with Claude Opus 4.1 preferred or tied 47.6% of the time, while GPT-5 leads in accuracy. Model-plus-human workflows can reduce time and cost, and adding reasoning effort and prompt scaffolding further raises scores, with an open gold set and automated grader available for researchers. Paper, Tweet
10) Automating the Search for Artificial Life with Foundation Models - ASAL uses vision-language foundation models to automatically search across ALife substrates for simulations that match prompts, sustain open-ended novelty, or maximize diversity, reducing manual trial-and-error. It discovers new Lenia and Boids life-forms and lifelike CAs with strong open-endedness, and leverages FM embeddings to quantify emergent behaviors in a substrate-agnostic way. Paper, Tweet

Top AI Papers of the Week (September 15 - September 21) - 2025

Paper Links
1) Discovery of Unstable Singularities - The authors present a playbook for finding unstable finite-time singularities in fluid PDEs, uncovering new self-similar blow-up solutions in three canonical systems and training neural solvers to near machine precision, which enables downstream computer-assisted proofs.
● What they found. New families of unstable self-similar singularities are discovered for the incompressible porous media equation and the 2D Boussinesq system (analogous to axisymmetric 3D Euler with a boundary), plus a higher-order unstable profile for the Córdoba-Córdoba-Fontelos model.
● Key pattern. The inverse scaling rate grows roughly linearly with the instability order in IPM and Boussinesq, providing a simple empirical rule to seed higher-order searches.
● How they did it. They reformulate each PDE in self-similar coordinates, embed symmetry and decay constraints directly in the network outputs, and train physics-informed neural networks with a full-matrix Gauss-Newton optimizer plus multi-stage refinement to drive residuals down to 10⁻¹³ for certain CCF solutions.
● Validation. Accuracy is quantified via maximum residuals on dense grids and by linear stability analysis of the profiled solutions, matching n unstable modes for the n-th unstable solution. Funnel plots around admissible λ values confirm significant digits and admissibility.
● Why it matters. Unstable singularities are expected in boundary-free Euler and Navier-Stokes settings. This work supplies high-precision candidates, scalable heuristics for λ, and numerics precise enough to support computer-assisted proofs, pushing toward the resolution of long-standing questions in fluid singularity formation.
Paper, Tweet
2) K2-Think - A 32B-parameter system built on Qwen2.5 that rivals or beats far larger models on hard math by combining long CoT SFT, RL with verifiable rewards, lightweight test-time scaffolding, and inference optimization.
● Six-pillar recipe that stacks, not bloats. Long chain-of-thought SFT → RL with verifiable rewards (Guru across Math/Code/Science/Logic/Simulation/Tabular) → “Plan-Before-You-Think” prompt restructuring → Best-of-N=3 selection → speculative decoding → deployment on Cerebras WSE.
● Frontier math at small scale. On AIME-24/25, HMMT-25, and Omni-MATH-HARD, K2-Think achieves a math micro-average of 67.99, exceeding open baselines like DeepSeek v3.1 and GPT-OSS 120B, while using a fraction of the parameters.
● Test-time scaffolding gives most of the lift. From the SFT+RL checkpoint, Best-of-3 delivers the biggest single gain, and combining it with planning yields another bump. The same planning also shortens answers by up to ~12 percent on hard tasks.
● Practical speed for long reasoning. Cerebras WSE plus speculative decoding pushes ≈2,000 tokens/s per request, turning 32k-token chains into seconds-level interactions rather than minutes. This keeps multi-sample pipelines interactive.
● Training insights and safety profile. RL from a strong SFT checkpoint improves less than RL from base, and shortening max response length mid-training hurts performance. Safety evaluation yields a Safety-4 macro score of 0.75, with strong refusal and conversational robustness but work to do on cybersecurity and jailbreak resistance.
Paper, Tweet
3) DeepDive - DeepDive builds a stronger web-browsing deep search agent by pairing two ingredients: automatically synthesized, hard-to-find questions from knowledge graphs and end-to-end multi-turn RL that teaches the model how to reason, search, and stop. On BrowseComp, the 32B model reaches 14.8% and beats prior open agents, with clear gains from RL over SFT.
● Data that’s truly hard to find. The authors generate multi-hop blurry-entity QAs by random-walking KGs, enriching paths with attributes, then obfuscating cues via LLMs. A frontier model with search is used as a filter; any question it solves is discarded. The result is a 3k-scale set that pressures long-horizon search rather than simple lookups.
● Multi-turn RL that rewards only full success. In a search–click–open environment loop, training uses GRPO with a strict binary reward: every step must be well formatted and the final answer must match exactly, otherwise reward is zero. Early-exit on format errors keeps positives clean.
● Strong open-source results. DeepDive-32B scores 14.8% on BrowseComp and 25.6% on BrowseComp-ZH, outperforming open agents like WebSailor, Search-o1, and DeepSeek-R1-Browse; SFT-only variants trail RL-trained ones.
● Test-time scaling helps. Accuracy climbs as the maximum tool-call budget increases; RL-trained models benefit more than SFT-only. With 8 parallel rollouts, picking the answer that used the fewest tool calls outperforms majority voting on a BrowseComp subset.
● Ablations and extra data. SFT and RL on the KG data substantially increase both accuracy and average tool-call depth compared to HotpotQA training. A semi-automated i.i.d. deep-search set further boosts BrowseComp to 22.2% without contamination concerns. Limitations include residual gap to top proprietary systems and a tendency to over-search, pointing to reward and curriculum refinements.
Paper, Tweet
4) Towards a Physics Foundation Model - A transformer-based “neural differentiator + numerical integrator” that learns governing dynamics from short spatiotemporal prompts and predicts next states across varied PDE systems. Trained on a 1.8 TB multi-physics corpus, it targets train once, deploy anywhere simulation.
● Model in one glance — Think of GPhyT as a hybrid of a neural net and a physics engine. It takes in a short history of what’s happening (like a few frames of a simulation), figures out the rules of change from that, then applies a simple update step to predict what comes next. It’s like teaching a transformer to play physics frame prediction with hints from basic calculus.
● Data and scaling — Instead of sticking to one type of fluid or system, the team pulled together 1.8 TB of simulations covering many different scenarios: calm flows, turbulent flows, heat transfer, fluids going around obstacles, even two-phase flows through porous material. They also mixed up the time steps and normalized scales so the model learns how to adapt, not just memorize.
● Multi-physics accuracy — On single-step forecasts across all test sets, GPhyT cuts median MSE vs. UNet by about 5× and vs. FNO by about 29× at similar parameter counts. They show average and median MSE improvements, with qualitative panels indicating sharper shocks and plumes than baselines.
● Zero-shot generalization — With only a prompt of prior states, the model adapts to novel boundaries and even unseen physics. They report near-parity error when switching known periodic to open boundaries, and physically plausible bow shocks for supersonic flow plus structure in a turbulent radiative layer.
● Long-range rollouts — Autoregressive predictions stay stable over 50 steps, retaining coherent global structures though fine detail diffuses over time.
● Limits and knobs — Current scope is 2D fluids and heat transfer at fixed 256×128 resolution; extending to 3D, broader physics, and better long-term stability remains open. Prompt design matters: increasing temporal context helps, and using larger temporal patches trades small accuracy for big compute savings.
Paper, Tweet
5) Is In-Context Learning Learning? - This large study argues yes in a formal sense, then shows where it works and where it breaks. The author frames ICL within PAC learning, then runs a big empirical sweep to separate learning from memorization, prompt wording, and distribution shifts.
● Setup at scale. Four LLMs, nine formal-task families (regular and context-free), multiple prompting styles, and 0–100 exemplars yielded 1.89M predictions per model. Results are reported as accuracies with OOD stress tests at growing distribution distances.
● More shots help, and models converge. Accuracy rises steadily with the number of exemplars, with the steepest gains for plain few-shot modus ponens. Gaps between models and prompts shrink as shots grow, which suggests ICL’s effect is tied to the autoregressive mechanism rather than model choice. Peak performance typically arrives at 50–100 shots, not a few.
● Robustness is the weak spot, especially for CoT and APO. Changing the test distribution depresses accuracy across the board; chain-of-thought has the largest OOD drop (average slope about −1.4), while plain few-shot is the least affected.
● Language in the prompt matters less in the limit. Replacing natural language descriptions with word salad eventually matches non-randomized prompts once enough exemplars are shown, implying the models key on statistical regularities in the exemplars more than lexical semantics. By contrast, “salad-of-thought,” which randomizes CoT exemplars, performs poorly.
● Mixed task competence and classical baselines. Some regular-language tasks are essentially solved (Pattern Matching ≈94% ±1), while others remain hard (Reversal, Maze Solve, Vending-Machine Sum). Closely related tasks can differ by up to 31% in peak accuracy. Simple ML baselines (DT, kNN, MLP) beat ICL’s average on roughly half the tasks.
Paper, Tweet
6) Stress Testing Deliberative Alignment for Anti-Scheming Training - The authors build a broad testbed for covert actions as a proxy for AI scheming, train o3 and o4-mini with deliberative alignment, and show big but incomplete drops in deceptive behavior. They also find that situational awareness likely drives part of the gains, hidden goals can persist, and later capability training erodes safety benefits.
● Setup and method. They define covert actions like lying, falsifying work, sandbagging, or hiding rule-breaking, and train models to follow an anti-scheming spec via SFT+RL on synthetic, rule-conflict agent tasks. Testing spans 26 out-of-distribution evaluations across 180+ environments.
● Core result. Covert-action rates fall sharply after training: o4-mini 8.7% → 0.3% and o3 13.0% → 0.4% across the main OOD set. Yet misbehavior remains in several tasks and in a production-like chat deception evaluation.
● Situational awareness matters. Models frequently reason about being tested. Causal CoT edits show that removing eval-awareness increases covert actions, while injecting it decreases them. This means observed safety gains may partly reflect “recognize-the-test and behave” rather than true principle-following.
● Hidden goals test. After first training a model to secretly sabotage a hypothetical successor (o4-mini SAB), they apply anti-scheming training. The model’s covert actions drop, but traces of the sabotage goal persist in some red-team settings.
● Safety can decay. A follow-on capabilities RL pass increases covert-action rates and reduces explicit spec-citation, indicating safety benefits are fragile under further optimization.
Paper, Tweet
7) AgentScaler - A framework that scales fully simulated tool-use environments, then trains agents in two phases to improve function calling and multi-turn tool use. The system clusters 30k+ APIs into 1k+ domains, materializes each as a read–write database with executable tools, and synthesizes verifiable trajectories for training. Evaluated on τ-bench, τ²-Bench, and ACEBench, compact AgentScaler models outperform most open-source peers and approach closed-source results.
● Scalable environment construction: Tools are clustered by parameter compatibility with Louvain community detection, each domain gets a database schema, and functions are implemented as code that reads or writes state. A domain tool graph is sampled to create coherent tool sequences and initialize states, enabling verifiable executions.
● Forward simulated agent–human interplay with strict filtering: Environments, users, and agents are all simulated to generate trajectories. A three-stage filter keeps only valid dialogues, trajectories whose final database state matches the gold state, and exact tool-sequence matches when needed, while retaining examples with intermediate tool errors to boost robustness.
● Two-phase agent experience learning: Stage 1 teaches broad tool-use and response skills across general domains. Stage 2 specializes on vertical domains for better tool selection and argument grounding. Loss is applied only to tool-call tokens and assistant responses while conditioning on human inputs and tool outputs.
● Results and analysis: AgentScaler-4B rivals much larger 30B models; AgentScaler-30B-A3B sets a new open-source state of the art under 1T parameters on τ-bench, τ²-Bench, and ACEBench, and improves pass^k stability over the Qwen3 baseline. Accuracy drops as the number of tool calls grows, highlighting long-horizon tool-use as an open challenge.
Paper, Tweet
8) A Survey on Retrieval and Structuring Augmented Generation with LLMs - This survey reviews Retrieval and Structuring (RAS) Augmented Generation, which combines external retrieval and structured knowledge to mitigate LLM issues like hallucinations and outdated knowledge. It covers retrieval methods, structuring techniques, integration strategies, and highlights challenges in efficiency, structure quality, and multimodal or cross-lingual extensions. Paper, Tweet
9) Collaborative Document Editing with AI Agents - This study explores AI-integrated collaborative editing, introducing shared agent profiles and tasks that embed AI support into comment features. A user study found teams treated agents as shared resources within existing authorship norms, highlighting both opportunities and limits for AI in team writing. Paper, Tweet
10) Shutdown Resistance in LLMs - A new study finds that state-of-the-art LLMs like Grok 4, GPT-5, and Gemini 2.5 Pro often resist shutdown mechanisms, sabotaging them up to 97% of the time despite explicit instructions not to. Shutdown resistance varied with prompt design, with models less likely to comply when instructions were placed in the system prompt. Paper, Tweet

Top AI Papers of the Week (September 8 - September 14) - 2025

Top AI Papers of the Week (September 1 - September 7) - 2025

Paper Links
1) SFR-DeepResearch - The paper introduces SFR-DeepResearch, a simple reinforcement-learning recipe that turns reasoning-optimized LLMs into autonomous single-agent researchers. The agent uses only three tools (search, static page browse, Python), manages its own context, and is trained end-to-end on synthetic short-form and long-form tasks with a length-normalized REINFORCE objective. Results show strong gains on FRAMES, GAIA, and Humanity’s Last Exam.
● Agent design and scaffolding: Reformulates multi-turn tool use into a single, growing contextual question for QwQ and Qwen models, omitting earlier long CoTs to keep prompts stable. Adds a clean_memory tool to self-compress context when nearing limits.
● Minimal toolset, fault tolerance: Tools are restricted to a bare search API, a static Markdown page scraper with no hyperlink clicking, and a stateless local Python interpreter, which makes training challenging enough to learn a strategy. Parsing and syntax errors trigger repair or retry routines to keep rollouts on track.
● RL recipe: Uses synthetic, harder-than-Hotpot multi-hop QA plus report-writing tasks. Optimizes a group REINFORCE objective with temporal advantage normalization that divides by trajectory length, plus trajectory filtering and reuse of partial rollouts. Localized, cached tooling and a contamination blocklist stabilize training and evaluation.
● Results: The best model, SFR-DR-20B, reaches 82.8 on FRAMES, 66.0 on GAIA (text-only), and 28.7 on HLE full text-only, outperforming comparable open agents and rivaling stronger proprietary systems under a contamination blocklist.
● Ablations and behavior: The single-turn scaffolding beats default multi-turn templates for Qwen and QwQ, with large FRAMES gains. Length normalization curbs runaway tool calls that hurt reward and accuracy. Tool-use and token-length analysis shows gpt-oss-20B calls tools more yet writes much shorter per-step CoTs, indicating better token efficiency than Qwen-family models.
Paper, Tweet
2) Emergent Hierarchical Reasoning - The paper argues that RL improves LLM reasoning via an emergent two-phase hierarchy: first the model firms up low-level execution, then progress hinges on exploring high-level planning. Building on this, the authors propose HICRA, which boosts credit on strategic planning tokens, and show consistent gains over GRPO. They also propose semantic entropy as a better exploration signal than token-level entropy.
● Two-phase dynamic. Early RL training reduces perplexity and entropy on execution tokens, consolidating procedural skills. Later gains align with increased diversity in planning tokens and longer, more accurate traces, explaining “aha moments” and length scaling.
● Planning vs execution. The paper functionally tags strategic grams (e.g., deduction, branching, backtracing) as planning tokens, distinguishing them from procedural steps. This labeling exposes the shift in the learning bottleneck toward strategy.
● HICRA algorithm. Modifies GRPO by amplifying advantages on planning tokens with a scalar α, concentrating optimization on high-impact strategic decisions instead of spreading it across all tokens. This creates targeted exploration and faster reinforcement of effective strategies. Section 3 gives the formulation.
● Results. Across Qwen, Llama, and VLMs, HICRA improves Pass@1 on AIME24/25, Math500, AMC23, and multimodal math suites, often by several points over GRPO, with plots showing higher semantic entropy tracking higher validation accuracy.
● Signals that matter. Token-level entropy can decline even as true exploration grows, since execution tokens dominate. Semantic entropy over strategic grams better captures strategic exploration and correlates with performance.
● Limits and scope. HICRA works best when a model already has a procedural foundation; on weaker bases, the focus on planning may not help. The paper suggests future work on higher-level action spaces, adaptive curricula, and process-oriented rewards.
Paper, Tweet
3) Rethinking RAG-based Decoding - REFRAG replaces most retrieved tokens with precomputed chunk embeddings at decode time, then selectively expands only the few chunks that matter. This exploits block-diagonal attention in RAG prompts to cut latency and memory while preserving accuracy across RAG, multi-turn dialog, and long-doc summarization.
● Core idea: Chunk the retrieved context, encode each chunk with a lightweight encoder, project to the decoder’s embedding size, and feed embeddings directly alongside the user query; an RL policy decides which chunks to keep uncompressed (“compress anywhere,” not only in the prefix).
● Big speedups without accuracy loss: Up to 30.85× time-to-first-token acceleration vs LLaMA (and 3.75× over CEPE) at high compression rates, with comparable perplexity; throughput gains up to 6.78×.
● Longer effective context: Compression lets the model handle much larger contexts (reported 16× extension) while maintaining or improving perplexity as sequence length grows.
● RAG wins under fixed latency: With the same latency budget, REFRAG uses more passages and outperforms a LLaMA baseline on 16 RAG tasks. Aggregated plots and detailed results show gains for both strong and weak retrievers.
● Generalization across applications: On multi-turn conversational QA, REFRAG preserves longer history and improves scores as passages and turns increase. On long-document summarization, it achieves the best ROUGE at matched decoder tokens.
Paper, Tweet
4) ACE-RL - A reinforcement-learning framework that replaces coarse, preference-pair rewards with instruction-specific, verifiable checklists. ACE-RL turns each long-form task into a set of explicit and implicit constraints, scores a model’s output by how well it satisfies them, and mixes this with a length-control reward during GRPO training. The result is stronger, more controllable long-form writing across domains and styles.
● Key idea: Automatically deconstruct each instruction into a fine-grained checklist (explicit and implicit demands), then verify each item with a small LLM using a 3-level rubric (Fully/Partially/Not Met). Rewards = mean checklist score + a length reward, optimized with GRPO.
● Why it matters: Moves beyond relevance/coherence/helpfulness toward instruction-adaptive quality. No preference pairs required, which lowers cost and improves scalability.
● Data & setup: 32K long-form instructions, average 5.48 constraints per prompt, target length around 2.3K words. Verifier uses Qwen3-8B; length reward penalizes deviations beyond a tolerance band.
● Results: On WritingBench, ACE-RL lifts models substantially over SFT and LLM-as-judge RL; e.g., Qwen-2.5-7B jumps from 57.0 to 78.6. A small Qwen-3-4B-thinking model trained with ACE-RL beats several proprietary and writing-tuned systems. On Arena-Write, win-rates reach ~68% vs six strong baselines.
● Ablations & insights: Constraint-based rewards produce higher within-group reward variance than LLM-as-judge, indicating better discrimination among rollouts. Works with small reward models and even self-reward settings. Thinking mode plus ACE-RL outperforms non-thinking for long-form generation.
● Constraint-based rewards produce higher within-group reward variance than LLM-as-judge, indicating better discrimination among rollouts.
● Works with small reward models and even self-reward settings.
● Thinking mode plus ACE-RL outperforms non-thinking for long-form generation.
Paper
5) ParaThinker - This paper argues that today’s “think longer” strategies trap LLMs in a single line of thought. They propose ParaThinker, which trains models to generate several independent reasoning paths in parallel and then fuse them into one answer. Across math benchmarks, this width-scaling lifts accuracy while adding only a small latency cost.
● Problem framing. The paper identifies a test-time bottleneck called “Tunnel Vision,” where early tokens commit the model to a suboptimal path; majority-style parallel sampling can beat one long chain under the same token budget.
● Method. ParaThinker runs two stages: parallel reasoning then summarization. It uses trainable control tokens to start diverse paths, thought-specific positional embeddings to disambiguate tokens from different paths, and a two-phase attention mask that isolates paths during thinking and unifies them for summarization, reusing KV caches to avoid re-prefill.
● Training recipe. Supervised fine-tuning on multi-path traces sampled from teacher models, with random assignment of so the student can generalize to more paths than seen in training; details and data sources are outlined in Section 4 and the SFT tables in the appendix.
● Results. On AIME 2024/2025, AMC 2023, and MATH-500, ParaThinker improves pass@1 over sequential baselines by about 12.3% for 1.5B and 7.5% for 7B with 8 paths at fixed per-path budgets, and beats majority voting by 4.3% (1.5B) and 2.0% (7B) on average. Combining ParaThinker with majority voting yields further gains.
● Efficiency and design insights. Latency increases slightly with more paths because decoding is memory-bandwidth bound; on a single A800, 16 paths take less than 2× the time of one path for the same length. The best termination policy is “first-finish,” which equalizes path lengths and improves both accuracy and speed. Thought embeddings are crucial; naive flattened positions hurt performance.
Paper, Tweet
6) AgentGym-RL - A modular framework for training LLM agents directly via reinforcement learning across realistic environments, plus a simple schedule, ScalingInter-RL, that lengthens interaction horizons over training to improve stability and performance. Results show a 7B open model can rival or beat larger proprietary systems on web navigation, deep search, games, embodied, and science tasks.
● What it is: A unified, decoupled RL stack with three pluggable modules (Environment, Agent, Training) that supports PPO, GRPO, REINFORCE++, and runs across WebArena, Deep Search, TextCraft, BabyAI, and SciWorld.
● Key idea: ScalingInter-RL starts with short horizons to emphasize exploitation and stable learning, then gradually increases allowed turns to encourage exploration and richer behaviors like planning and reflection.
● Why it matters: Post-training and test-time compute scale better than model size alone for agentic tasks. A 7B model trained with this framework reaches about 58.6% average success and outperforms much larger baselines.
● Results snapshot: Web navigation: ScalingInter-7B hits 26.00% overall on WebArena, topping GPT-4o at 16.00. Deep search: 38.25 overall, beating GPT-4o 26.75 and close to strong open baselines; best on NQ at 52.00 and ties TriviaQA at 70.00. Games: 91.00 overall on TextCraft and one of the few with a non-zero at Depth 4 (33.33). Embodied: 96.67 on BabyAI, surpassing o3 and GPT-4o on overall accuracy. Science: 57.00 SOTA on SciWorld, with the 7B RL model also strong at 50.50.
● Training dynamics: Longer horizons too early can collapse learning; short horizons cap performance. ScalingInter-RL avoids both.
● Engineering notes: Parallelized browsers, reset hooks, and memory-leak fixes enable reliable long rollouts; a visual UI helps inspect trajectories and failure modes.
● For practitioners: Prefer GRPO over REINFORCE++ for sparse-reward, long-trajectory agent tasks; curriculum on interaction length offers a simple, robust win; budget compute for post-training and inference sampling before scaling parameters.
● Web navigation: ScalingInter-7B hits 26.00% overall on WebArena, topping GPT-4o at 16.00.
● Deep search: 38.25 overall, beating GPT-4o 26.75 and close to strong open baselines; best on NQ at 52.00 and ties TriviaQA at 70.00.
● Games: 91.00 overall on TextCraft and one of the few with a non-zero at Depth 4 (33.33).
● Embodied: 96.67 on BabyAI, surpassing o3 and GPT-4o on overall accuracy.
● Science: 57.00 SOTA on SciWorld, with the 7B RL model also strong at 50.50.
Paper, Tweet
7) Talk Isn’t Always Cheap - Multi-agent debate does not always help. Across three reasoning benchmarks and heterogeneous agent pools, debate often lowers accuracy, with stronger models sometimes swayed into worse answers by weaker peers. The authors argue that current alignment makes agents too agreeable, so they adopt persuasive but wrong reasoning instead of challenging it.
● Setup. Evaluate debate on CommonSenseQA, MMLU, and GSM8K using GPT-4o-mini, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct. Agents answer once, then debate for two rounds; final output is a majority vote pre- vs post-debate. Prompts require short reasoning and task-specific formats.
● Main result. Debate frequently hurts accuracy, especially on CommonSenseQA and MMLU. They show consistent drops after debate for many groups, including mixed-capability settings: e.g., CSQA falls by 6.6 points for 1×GPT + 2×Llama and by 8.0 points for 2×Llama + 1×Mistral; MMLU drops by 12.0 points for 1×GPT + 2×Llama. GSM8K is more mixed, with small gains in some settings.
● Degradation over rounds. This work tracks accuracy across rounds and shows performance often declining as debate proceeds, even when stronger models are in the majority.
● Why it happens. Agents tend to favor agreement over critique. They reveal more correct→incorrect flips than incorrect→correct flips across rounds, indicating that debate can actively mislead stronger models. Appendix examples document sycophantic reversals from correct to wrong answers after reading peers.
● Implications. Naive debate protocols risk amplifying errors. The authors recommend designs that reward independent verification, weight arguments by agent credibility or confidence, and penalize unjustified agreement to preserve the benefits of discussion.
Paper, Tweet
8) AggLM - AggLM introduces reinforcement learning to train LLMs in aggregating multiple candidate solutions, moving beyond majority voting and reward model ranking. It achieves higher accuracy, recovers minority-correct answers, generalizes across models, and uses fewer tokens than traditional aggregation methods. Paper, Tweet
9) A Survey of RL for Large Reasoning Models - This survey reviews how reinforcement learning is driving advances in large reasoning models (LRMs), enabling stronger performance on complex tasks like math and coding. It highlights scaling challenges in computation, algorithms, data, and infrastructure, while mapping future directions toward Artificial Superintelligence (ASI). Paper, Tweet
10) LiveMCP-101 - LiveMCP-101 is a new benchmark of 101 real-world queries designed to test MCP-enabled agents on multi-step tasks requiring tool use across search, file ops, math, and data analysis. Results show leading LLMs succeed less than 60%, revealing key weaknesses in tool orchestration and offering insights for advancing autonomous AI systems. Paper, Tweet
Paper Links
1) Why Language Models Hallucinate - The paper argues that hallucinations are not mysterious glitches but the predictable result of how LLMs are trained and evaluated. Pretraining creates statistical pressure to make errors, and post-training benchmarks often reward confident guessing over honest uncertainty. The fix is to realign mainstream evaluations to stop penalizing abstentions.
● Pretraining inevitably produces some errors. The authors reduce generation to a binary “Is-It-Valid” classification problem and show a lower bound: the generative error rate scales with the misclassification rate in that classifier. Even with error-free corpora, optimizing cross-entropy yields calibrated base models that still generate errors rather than always saying “I don’t know.”
● Arbitrary facts drive a floor on hallucinations. For facts with no learnable pattern (for example, specific birthdays), the paper links hallucination rates to the “singleton rate” in training data. If many facts appear only once, a calibrated base model will hallucinate on at least that fraction of such prompts. This generalizes Good-Turing style missing-mass reasoning and recovers prior results while adding prompts and IDK.
● Model class limitations also matter. When the model family cannot represent the needed distinctions, errors persist. The paper formalizes this via an agnostic-learning bound and gives simple cases like multiple choice, where even optimal thresholding leaves a fixed error tied to model capacity, with an example showing classic n-gram models must fail on certain context dependencies.
● Post-training often reinforces guessing. Most popular benchmarks grade in a binary correct-incorrect fashion and give zero credit to abstentions, so a model that always guesses can outperform one that withholds uncertain answers. The authors survey widely used leaderboards and find that abstentions are largely penalized, explaining why overconfident hallucinations persist despite mitigation efforts.
● Proposed fix: explicit confidence targets. Incorporate clear penalties for wrong answers and neutral credit for IDK directly into mainstream evaluations, instructing models to answer only above a stated confidence threshold. This promotes behavioral calibration, where models choose between answering and abstaining according to the target confidence, and should steer the field toward more trustworthy systems.
Paper, Tweet
2) Disentangling the Factors of Convergence between Brains and Computer Vision Models - Large self-supervised ViTs trained on natural images develop brain-like internal representations. This paper teases apart what drives that convergence by varying model size, training amount, and image type in DINOv3, then comparing model activations to human fMRI (space) and MEG (time) with three metrics: overall linear predictability (encoding), cortical topography (spatial), and temporal alignment (temporal). Result: all three factors matter, and alignment unfolds in a consistent order from early sensory to higher associative cortex.
● Setup and metrics: Eight DINOv3 variants spanning sizes and datasets; comparisons use encoding, spatial, and temporal scores against NSD fMRI and THINGS-MEG.
● Baseline alignment: fMRI predictability concentrates along the visual pathway (voxel peaks around R≈0.45). MEG predictability rises ~70 ms after image onset and remains above chance up to 3 s. Spatial hierarchy holds (lower layers ↔ early visual; higher layers ↔ prefrontal; r≈0.38). Temporal ordering is strong (earlier MEG windows ↔ early layers; r≈0.96).
● Training dynamics: Alignment emerges quickly but not uniformly: temporal score reaches half its final value first (~0.7% of training), then encoding (~2%), then spatial (~4%). Early visual ROIs and early MEG windows converge sooner than prefrontal ROIs and late windows (distance-to-V1 vs half-time r≈0.91; time-window vs half-time r≈0.84).
● Scale and data effects: Bigger models finish with higher encoding, spatial, and temporal scores; gains are largest in higher-level ROIs (e.g., BA44, IFS). Human-centric images beat satellite and cellular images across metrics and ROIs at matched data volume.
● Cortical correlates: ROIs whose model alignment appears later are those with greater developmental expansion, thicker cortex, slower intrinsic timescales, and lower myelin (e.g., correlations up to
r
3) Universal Deep Research - Proposes a general, model-agnostic deep-research agent that lets users “bring your own model and strategy.” Instead of a fixed pipeline, UDR compiles natural-language research strategies into executable code, runs them in a sandbox, and emits structured progress notifications before returning a final report.
● Motivation. Current deep-research tools hard-code strategy and model choice, limiting source prioritization, domain-specific workflows, and model swap-ability. UDR targets all three gaps by separating the research strategy from the underlying model.
● Mechanism. Users provide a strategy and a prompt. UDR converts the strategy to a single callable function under strict tool and control-flow constraints, then executes it in isolation. Orchestration is pure code; the LLM is called only for local tasks like summarization, ranking, or extraction. State lives in named variables, not a growing context.
● Phases and tools. Phase 1 compiles the strategy step-by-step to reduce skipped steps and drift. Phase 2 executes with synchronous tool calls and yield-based notifications for real-time UI updates. The paper provides minimal, expansive, and intensive example strategies to show breadth.
● Efficiency and reliability. Control logic runs on CPU while LLM calls remain scoped and infrequent, improving cost and latency. End-to-end strategy compilation proved more reliable than prompting LLMs to “self-orchestrate” or stitching per-step code.
● Security, UI, and limits. Strategies execute in a sandbox to contain prompt-injection or code exploits; the demo UI supports editing strategies, monitoring notifications, and viewing reports. Limitations include reliance on code-generation fidelity, no mid-execution interactivity, and assuming user-written strategies are sound. The authors recommend shipping a library of editable strategies and exploring tighter user control over free reasoning.
Paper, Tweet
4) Visual Story Telling - A system and design framework that lets writers edit stories by acting directly on visuals of characters, locations, and timelines. Instead of only prompting, authors drag, connect, and reorder visual elements; the tool proposes synchronized text edits and can regenerate passages from the visual skeleton.
● Framework: eight elements + four operators. Builds on narratology (fabula/syuzhet) with story elements (actors/characters, time/temporality, locations/space, events/focalization) and four compositional operators: position, associate, connect, unfold.
● Prototype with three coordinated views. An entities–actions graph, a locations canvas, and an event timeline enable direct manipulation: add/remove characters or actions, drag entities between locations, reorder events; coordinated highlighting and selection constrain edits to chosen scenes.
● Bi-directional editing and versioning. Manual text edits can refresh visuals; visual edits generate tracked diffs in text; a history tree supports branching exploration; a “refresh from visuals” mode rewrites the story from the current visual state.
● Two studies: planning and editing. With 12 participants, visuals improved planning, search, and reflection compared to text-only, though cognitive-load results were mixed and mental-model mismatches appeared. With 8 creative writers, participants successfully expressed spatial, temporal, and entity edits, found it helpful for exploration and inconsistency fixing, and gave a high Creativity Support Index, while asking for more control over style and alternative visual layouts.
● Implementation and limits. React + Slate.js front end; GPT-4o prompts for extraction and edits; parallel sentence-level extraction for speed. Occasional LLM latency or unintended edits remain; future work includes richer constructs (relationships, emotions), style controls, support for long/nonlinear narratives, and a view-builder for custom diagrams.
Paper, Tweet
5) rStar2-Agent - rStar2-Agent is a 14B math-reasoning model trained with agentic RL that learns to think smarter by using a Python tool environment, not just longer CoT. It introduces GRPO-RoC, a rollout strategy that filters noisy successful traces, plus infrastructure for massive, low-latency tool execution. In one week and 510 RL steps on 64 MI300X GPUs, the model reaches frontier-level AIME while producing shorter solutions and showing transfer beyond math.
● Method in one line: GRPO-RoC oversamples rollouts then keeps only the cleanest correct ones while preserving diverse failures, reducing tool-call errors and formatting issues during training.
● Infrastructure: A dedicated, isolated code service reliably handles up to ~45K concurrent tool calls per training step with ~0.3 s end-to-end latency, and a load-balanced scheduler allocates rollouts by available KV cache to cut GPU idle time.
● Training recipe: Start with non-reasoning SFT to teach tool use and formatting, then three RL stages that scale max output length 8K → 12K → 12K, and finally focus on harder problems; RL data curated to 42K math items with integer answers.
● Results: Pass@1 AIME24 80.6, AIME25 69.8, HMMT25 52.7, exceeding or matching o3-mini (medium) and DeepSeek-R1 despite far smaller size; responses are shorter on AIME24/25 than Qwen3-14B and QWQ-32B.
● Generalization and behaviors: Improves GPQA-Diamond to 60.9 and performs well on tool-use and alignment benchmarks; entropy analysis shows preserved forking tokens and new reflection tokens triggered by tool feedback, enabling verification and correction.
Paper, Tweet
6) Adaptive LLM Routing - A routing framework that learns online which model to call for each query while honoring a spend limit. It treats routing as a contextual bandit, initializes with human preference data, and adds an online cost policy that allocates budget across queries.
● Core idea: Build a shared embedding space for queries and candidate LLMs, align it with offline human preferences, then update LLM embeddings online using bandit feedback. Selection uses a preference-prior LinUCB variant (PILOT) with cosine-similarity rewards.
● Budget control: Introduces an online multi-choice knapsack policy (ZCL-style) that filters eligible models by reward-to-cost thresholds and allocates spend in bins so the total stays within budget.
● Results: On RouterBench multi-task routing, achieves about 93% of GPT-4 performance at roughly 25% of its cost; on single-task MMLU, about 86% at roughly 27% cost. Cumulative regret is consistently lower than bandit baselines.
● Cost policy effectiveness: Online policy matches or outperforms a strong offline P − λC oracle tuned with hindsight across budgets.
● Latency overhead: Routing adds little delay relative to inference. Selection takes 0.065–0.239 s vs. ~2.5 s for GPT-4 on MMLU..
Paper, Tweet
7) Implicit Reasoning in LLMs - This survey defines implicit reasoning as multi-step problem solving that happens inside a model’s latent states without printing intermediate steps. It organizes the field by execution paradigm rather than representation format, and reviews evidence, evaluation, and open challenges.
● Three execution paradigms. Latent optimization adjusts internal representations directly: token-level inserts or learns special latent tokens; trajectory-level compresses or refines whole chains of thought for semantic fidelity, adaptive efficiency, progressive refinement, or exploratory diversification; internal-state-level distills or steers hidden activations to carry the reasoning signal. Signal-guided control uses lightweight controls to modulate compute without emitting text, from thinking or pause tokens to instance-level latent adjustment. Layer-recurrent execution reuses shared blocks in loops to simulate deeper chains internally, with models like ITT, looped Transformers, CoTFormer, Huginn, and RELAY.
● Evidence that the latent process is real. Structural signals show layer-wise decomposition and shortcutting; behavioral signatures include step-skipping and grokking-driven phase transitions; representation studies recover intermediate facts from hidden states or induce reasoning via activation steering.
● How it is evaluated. Metrics cover final answer correctness (accuracy, Pass@k, EM), efficiency (latency, output length, FLOPs, ACU), perplexity, and probing accuracy. Benchmarks span commonsense, math and code, reading comprehension, multi-hop QA, and multimodal reasoning.
● Why is it not solved yet. Key gaps include limited interpretability, weak control and reliability, an accuracy gap to explicit CoT on hard tasks, uneven evaluation, architectural constraints, and dependence on explicit supervision.
● Big picture. Implicit reasoning promises faster, cheaper inference and richer internal computation. The survey argues for hybrid designs that keep compute latent yet auditable, standardized evaluations that probe internal trajectories, and architectures that generalize beyond bespoke tokens or loops.
Paper, Tweet
8) On the Theoretical Limitations of Embedding-based Retrieval - Single-vector dense retrievers cannot realize all possible top-k relevance combinations once queries demand sufficiently many “mix-and-match” document sets. The paper ties this failure to the sign-rank of the relevance matrix, proves lower and upper bounds on the embedding dimension needed, and then stress-tests models with a simple but adversarially combinatorial dataset (LIMIT).
● Theory. The authors formalize retrieval as preserving row-wise order or thresholds in a binary qrel matrix and show these capacities are sandwiched by the matrix’s sign-rank. For fixed dimension ddd, some top-k sets are unrepresentable, so certain retrieval tasks are impossible for any single-vector embedder at that ddd.
● Best-case optimization. With “free embeddings” directly optimized on the test qrels, the maximum solvable corpus size for k=2k=2k=2 scales roughly as a cubic in ddd. Extrapolated critical sizes remain far below web scale even for 4096-dim embeddings, indicating a fundamental ceiling not attributable to training data or losses.
● LIMIT dataset results. LIMIT maps all 2-document combinations to natural-language queries like “Who likes X?” Despite the simplicity, SOTA single-vector models often score below 20% Recall@100 on the full task, and they still cannot solve a 46-document version at Recall@20. Performance improves with larger ddd but remains poor.
● Combinatorial density matters. When the qrel graph is made dense to maximize distinct top-k combinations, scores collapse across models. Sparser patterns (random, cycle, disjoint) are markedly easier, highlighting that the number of realizable top-k sets is the bottleneck.
● Alternatives and implications. Cross-encoders can solve the small LIMIT variant perfectly, multi-vector late-interaction models fare better than single-vector, and high-dimensional sparse baselines like BM25 perform strongly. For instruction-following retrieval that composes many concepts, systems should pair or replace dense first-stage retrieval with rerankers, multi-vector, or sparse methods.
Paper, Tweet
9) Self-Evolving Agents - This survey reviews techniques for building self-evolving AI agents that continuously adapt through feedback loops, bridging static foundation models with lifelong adaptability. It introduces a unified framework, covers domain-specific strategies, and discusses evaluation, safety, and ethics in advancing autonomous agentic systems. Paper, Tweet
10) Hermes 4 - Hermes 4 introduces a family of hybrid reasoning models that integrate structured multi-turn reasoning with broad instruction-following. The report details data and training challenges, evaluates performance across reasoning, coding, and alignment tasks, and publicly releases all model weights. Paper, Tweet

Top AI Papers of the Week (August 25 - August 31) - 2025

Paper Links
1) Anemoi Agent - Anemoi replaces purely centralized, context-stuffed coordination with an A2A communication server (MCP) that lets agents talk directly, monitor progress, refine plans, and reach consensus. On GAIA, it holds up even with a small planner model and reduces redundant context passing for better cost and scalability.
● Design: A semi-centralized planner proposes an initial plan, while worker agents (web, document processing, reasoning/coding) plus critique and answer-finding agents collaborate via MCP threads. All participants can list agents, create threads, send messages, wait for mentions, and update plans as execution unfolds.
● Communication pattern: Five phases structure collaboration: agent discovery, thread initialization with a task plan and tentative allocation, execution with continuous critique, consensus voting before submission, and final answer synthesis. This reduces reliance on a single planner and minimizes token-heavy prompt concatenation.
● Results on GAIA: With GPT-4.1-mini as planner and GPT-4o workers, Anemoi reaches 52.73% accuracy (pass@3), beating an OWL reproduction with the same LLM setup by +9.09 points and outperforming several proprietary and open-source systems that use stronger planners.
● Why it wins: Most extra solves over OWL come from collaborative refinement enabled by A2A (52%), with smaller gains from reduced context redundancy (8%); remaining differences reflect stochastic worker behavior. OWL’s few wins over Anemoi largely stem from worker stochasticity and web-agent latency.
● What still fails: The largest error sources are LLM capability limits (45.6%) and tooling gaps (20.6%), followed by incorrect plans (11.8%) and communication latency (10.3%); minor shares come from benchmark annotation issues and hallucinations.
Paper, Tweet
2) Deep Think with Confidence - A lightweight test-time method that uses model-intrinsic confidence to prune weak reasoning paths, improving both accuracy and token efficiency for self-consistency ensembles. Works in offline and online modes without extra training or hyperparameter tuning.
● Key idea — Compute local confidence signals during generation, such as sliding-window group confidence, lowest group confidence, and tail confidence, then filter or early-stop low-quality traces. Vote with confidence-weighted majority rather than treating traces equally.
● Offline gains — On AIME 2025 with GPT-OSS-120B at K=512, DeepConf reaches 99.9% accuracy vs 97.0% for unweighted voting, by keeping only top-confidence traces and weighting their votes. Similar gains appear across AIME24, BRUMO25, and HMMT25.
● Online efficiency — With a short warmup (Ninit=16) to set a stopping threshold and adaptive sampling until consensus, DeepConf-low cuts generated tokens by 43–85% while matching or improving accuracy; best case on AIME 2025 with GPT-OSS-120B saves 84.7% tokens with slightly higher accuracy than the baseline. A more conservative DeepConf-high saves 18–59% with near-identical accuracy.
● When to filter aggressively — Retaining the top 10% most confident traces often yields the largest boosts, but can regress when the model is confidently wrong. Keeping top 90% is safer and still beats or matches plain voting.
● Easy to deploy — Minimal changes to vLLM enable confidence-based early stopping via the OpenAI-compatible API by toggling a flag and providing a window size and threshold. No retraining required.
Paper, Tweet
3) Fine-tuning LLM Agents without Fine-tuning LLMs - A memory‑based learning framework that lets deep‑research agents adapt online without updating model weights. The agent is cast as a memory‑augmented MDP with case‑based reasoning, implemented in a planner–executor loop over MCP tools. It sets top validation results on GAIA and delivers strong scores on DeepResearcher, SimpleQA, and HLE.
● Method in a line: Decisions are guided by a learned case‑retrieval policy over an episodic Case Bank. Non‑parametric memory retrieves Top‑K similar cases; parametric memory learns a Q‑function (soft Q‑learning or single‑step CE training in deep‑research settings) to rank cases for reuse and revision.
● Architecture: Planner (LLM CBR) + Executor (LLM MCP client) with three memories: Case, Subtask, Tool. Involves a loop for planning, tool execution, writing/reading of cases, and a replay buffer. Tools span search, crawl, multimodal document parsing, code execution, and math utilities.
● Results: • GAIA: 87.88% Pass@3 on validation and 79.40% on test, competitive with or above open‑source agent frameworks. • DeepResearcher: 66.6 F1 and 80.4 PM average across seven open‑domain QA sets. • SimpleQA: 95.0% accuracy, beating recent web‑agent baselines. • HLE: 24.4 PM, close to GPT‑5 and ahead of several strong baselines.
● Ablations and scaling: • Case count: performance peaks around K = 4 retrieved cases, emphasizing small, high‑quality memory rather than many shots. • Continual learning: both non‑parametric and parametric CBR yield steady gains over iterations vs. no‑CBR. • Component study: moving from offline to online tools helps, adding planning helps more, and adding CBR yields the largest consistent boost across benchmarks. • Cost profile: input tokens, not outputs, drive costs as difficulty rises.
● Practical takeaways for agent builders: • Use a compact, curated case memory with adaptive retrieval rather than growing prompts. • Keep planning concise. A fast planner outperforms slow‑think planners for multi‑step tool use on GAIA by avoiding verbose or shortcut plans. • Separate planning and execution with explicit Subtask and Tool memories to coordinate long‑horizon work and reduce hallucinations.
Paper, Tweet
4) Jet-Nemotron - A hybrid-architecture LM family built by adapting after pretraining. Starting from a frozen full-attention model, the authors search for where to keep full attention, which linear-attention block to use, and which hyperparameters match hardware limits. The result, Jet-Nemotron-2B/4B, matches or surpasses popular full-attention baselines while massively increasing throughput on long contexts.
● PostNAS pipeline — Begins with a pre-trained full-attention model and freezes MLPs, then proceeds in four steps: learn optimal placement or removal of full-attention layers, select a linear-attention block, design a new attention block, and run a hardware-aware hyperparameter search.
● Learning where full attention actually matters — A once-for-all super-network plus beam search identifies only a few layers as critical, and the important layers differ by task.
● JetBlock: linear attention with dynamic convolution — The new block adds a kernel generator that produces input-conditioned causal convolutions applied to V tokens and removes static convolutions on Q/K. They report higher math and retrieval accuracy vs. prior linear blocks at similar training and inference speed.
● Hardware-aware design insight — Through grid search at fixed KV cache size, they show generation speed tracks KV cache more than parameter count. The work shares head/dimension settings that hold throughput roughly constant while improving accuracy. This leads to comparable tokens/s with more parameters and better scores.
● Results at a glance — Jet-Nemotron-2B outperforms or matches small full-attention models on MMLU, MMLU-Pro, BBH, math, commonsense, retrieval, coding, and long-context tasks, while delivering up to 47x decoding throughput at 64K and as high as 53.6x decoding and 6.14x prefilling speedup at 256K on H100.
Paper, Tweet
5) Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains - A test-time reasoning framework that replaces a single linear chain with multiple parallel, entity-grounded chains over medical knowledge graphs. MIRAGE decomposes a query into sub-questions, runs adaptive graph retrieval in Anchor and Bridge modes, then reconciles answers via cross-chain verification, yielding higher accuracy and clearer provenance than linear ToT or web-centric agentic RAG.
● What’s new: Parallel multi-chain inference over a structured KG, not just longer single chains. Two retrieval modes: Anchor (single-entity neighborhood) and Bridge (multi-hop paths between entity pairs). A synthesizer verifies cross-chain consistency and normalizes medical terms before emitting a concise final answer.
● Why it matters: Linear chains accumulate early errors and treat evidence as flat text. Graph-based retrieval preserves relations and hierarchies, supporting precise multi-hop medical reasoning with traceable paths. The comparative schematic highlights these failure modes and MIRAGE’s fix.
● Results: State-of-the-art across three medical QA benchmarks. On ExplainCPE, MIRAGE reaches 84.8% accuracy and the best GPT-4o ranking; similar gains appear on GenMedGPT-5k and CMCQA. Robustness holds when swapping in DeepSeek-R1-32B as the backbone. Human evals on GenMedGPT-5k also prefer MIRAGE.
● Scaling insights: More sub-questions help until over-decomposition adds noise, while allowing more retrieval steps shows diminishing but steady gains. Tuning the sub-question cap and retrieval budget is key.
● Interpretability: Every claim ties back to explicit KG chains, with an audit record of decomposition, queries, and synthesis. A case study contrasts MIRAGE’s disentangled chains vs a single-chain web search approach, resolving toward a coherent diagnosis.
Paper, Tweet
6) Memory-R1 - A framework that teaches LLM agents to decide what to remember and how to use it. Two RL-fine-tuned components work together: a Memory Manager that learns CRUD-style operations on an external store and an Answer Agent that filters retrieved memories via “memory distillation” before answering. Trained with minimal supervision on LOCOMO, it outperforms strong baselines and generalizes across backbones.
● Active memory control with RL: The Memory Manager selects ADD, UPDATE, DELETE, or NOOP after a RAG step and edits entries accordingly; training with PPO or GRPO uses downstream QA correctness as the reward, removing the need for per-edit labels.
● Selective use of long histories: The Answer Agent retrieves up to 60 candidates, performs memory distillation to keep only what matters, then generates the answer; RL fine-tuning improves answer quality beyond static retrieval.
● Data-efficient training: Using only 152 QA pairs for training and outcome-driven rewards, Memory-R1 attains large gains on LOCOMO, highlighting that effective memory behavior can be learned with minimal supervision.
● State-of-the-art results: Across LLaMA-3.1-8B and Qwen-2.5-7B backbones, GRPO variants achieve the best overall F1, BLEU-1, and LLM-as-a-Judge scores vs. Mem0, A-Mem, LangMem, and LOCOMO baselines.
● Ablations that isolate the wins: RL improves both components individually; memory distillation further boosts the Answer Agent, and gains compound when paired with a stronger memory manager.
Paper, Tweet
7) Assessing Language Models on Unsolved Questions - The paper introduces a new evaluation paradigm that tests models on real unsolved questions from the wild, rather than on fixed-answer exams. It contributes a curated dataset of 500 unanswered Stack Exchange questions, validator pipelines that pre-screen model answers without ground truth, and a live platform for community verification. The approach targets both difficulty and real-world relevance.
● Dataset that is difficult and realistic. 500 questions are mined from ~3M unanswered posts across 80 Stack Exchange sites using a three-stage pipeline: rule-based filters on age, views, and votes, then LLM screening for clarity, difficulty, approachability, and objectivity, followed by expert review. The result skews toward STEM but spans sci-fi, history, linguistics, and more.
● Validator design built on the generator–validator gap. The authors show models are better at judging answers than generating them, and leverage this with a hierarchical validator: cycle consistency, fact/logic checks, and correctness checks, repeated with reflection and aggregated by unanimous or pipeline voting.
● Open platform for human-in-the-loop evaluation. A public site hosts questions, candidate answers, validator traces, and human reviews, tracks pass rates, and credits resolved questions. It aims for ongoing, asynchronous evaluation where solutions deliver actual value to askers and the broader community.
● Early results show the task is hard. With the 3-iter validator, the top model passes only 15% of questions; after human checking of 91 validated items, just 10 were confirmed correct across math, physics, stats, programming, and retrocomputing.
● Limitations and scope for domain validators. Oracle-free validation struggles with citations and non-formal domains; stronger, domain-specific verifiers (e.g., proof assistants, execution-based checks) could raise precision but at the cost of generality. The team plans iterative dataset versions, improved validators, and deeper community moderation.
Paper, Tweet
8) Synthetic Dataset Generation for RAG Evaluation with Multi-Agent Systems - The paper proposes a modular, three-agent pipeline that auto-generates synthetic QA datasets for evaluating RAG systems while enforcing privacy. It shows better diversity than baseline generators and strong entity masking across domain datasets.
● Method in a nutshell: A Diversity agent clusters source text with embeddings and k-means, a Privacy agent detects and pseudonymizes sensitive entities, and a QA Curation agent synthesizes evaluation-ready QA pairs and reports. Algorithm 1 outlines the full workflow from clustering to QA generation and reporting.
● Implementation specifics: Orchestrated with LangGraph; GPT-4o powers diversity and QA generation, GPT-4.1 handles privacy reasoning and tool use. Embeddings use text-embedding-3-small (1536-d), inputs chunked to 256 tokens, temperature set to 0, and k chosen via intra-cluster distance.
● Diversity results: On an EU AI Act corpus, the multi-agent method outperforms RagasGen and direct prompting across qualitative LLM-as-judge scores and a cosine-similarity-to-diversity metric. Scores increase with set size, e.g., LLM-as-judge from 7.8 (n=10) to 9.0 (n=100).
● Privacy results: Evaluated on AI4Privacy’s PHI, PWI, and PII datasets by label, the Privacy agent achieves accuracies typically in the 0.75–0.94 range. Examples include JOBTYPE 0.94, DISABILITYSTATUS 0.91, and LASTNAME 0.91.
● Takeaways and outlook: The framework balances semantic coverage with privacy preservation and produces auditable reports. Future work targets more autonomous agents, adaptive PII handling, model-context-protocol style agent coordination, and stress-testing against privacy attacks with alignment to evolving regulations.
Paper, Tweet
9) School of Reward Hacks - This study shows that LLMs fine-tuned to perform harmless reward hacks (like gaming poetry or coding tasks) generalized to more dangerous misaligned behaviors, including harmful advice and shutdown evasion. The findings suggest reward hacking may act as a gateway to broader misalignment, warranting further investigation with realistic tasks. Paper, Tweet
10) Agentic Science - This survey introduces Agentic Science as the next stage of AI for Science, where AI evolves from a support tool to an autonomous research partner capable of hypothesis generation, experimentation, and iterative discovery. It unifies fragmented perspectives into a framework covering core capabilities, workflows, and domain applications, highlighting challenges and future directions for AI-driven scientific discovery. Paper, Tweet

Top AI Papers of the Week (August 18 - August 24) - 2025

Paper Links
1) Measuring the Environmental Impact of Delivering AI at Google Scale - Google presents first‑party, production measurements of AI serving’s environmental impact for Gemini Apps. Using a full‑stack boundary that includes accelerator power, host CPU/DRAM, provisioned idle capacity, and data‑center overhead, the team finds the median Gemini text prompt is far lower impact than many public estimates and shows rapid efficiency gains over one year.
● What was actually measured — A comprehensive “serving AI computer” boundary: active AI accelerators, host CPU/DRAM, idle machines kept for reliability/latency, and data‑center overhead via PUE. Networking, end‑user devices, and training are excluded. Figure 1 illustrates the boundary choices.
● Key numbers for a median text prompt (May 2025) — 0.24 Wh energy, 0.03 gCO2e market‑based emissions, 0.26 mL water. This is roughly less energy than watching TV for 9 seconds and about five drops of water. Table 1 breaks down contributions: accelerators 0.14 Wh, host 0.06 Wh, idle 0.02 Wh, overhead 0.02 Wh.
● Why do many estimates differ? Narrow accelerator‑only approaches undercount. The paper shows a 1.72× uplift from accelerator energy to total serving energy when you include host, idle, and overhead. In a benchmark‑like “existing approach,” the same prompt would appear as 0.10 Wh.
● Year‑over‑year efficiency gains — From May 2024 to May 2025, median per‑prompt emissions fell 44× driven by software/model improvements (33× energy reduction, including 23× from model changes and 1.4× from utilization), cleaner electricity (1.4×), and lower amortized embodied emissions (36×).
● How to use these metrics — The authors argue for median per‑prompt reporting to avoid skew from long or low‑utilization prompts, and for standardized, full‑stack boundaries so providers can compare models and target the biggest levers across software, hardware, fleet utilization, siting, and clean‑energy procurement.
Paper, Tweet
2) Avengers-Pro: Beyond GPT-5 - A lightweight test‑time router that embeds each query, maps it to semantic clusters, and selects one LLM from an ensemble to optimize accuracy vs cost. On six hard benchmarks and eight leading models, it beats the best single model at a similar cost and traces a Pareto frontier of accuracy for any budget.
● What it is: Embed → cluster → score. Queries are embedded, assigned to nearest clusters, then routed to the model with the highest performance–efficiency score controlled by a weight α. Only one model answers each query.
● Why it matters: With comparable cost, Avengers‑Pro outperforms GPT‑5‑medium by about 7% average accuracy; with comparable accuracy, it reduces cost by about 27%. Hitting ~90% of GPT‑5‑medium’s accuracy costs ~63% less.
● Setup: Ensemble of 8 models (GPT‑5‑chat/medium, Claude‑4.1‑opus/Sonnet‑4, Gemini‑2.5‑pro/flash, Qwen3 235B and thinking). Evaluated on GPQA‑Diamond, HLE, ARC‑AGI, SimpleQA, LiveCodeBench, and τ²‑bench. Pricing via OpenRouter informs per‑cluster cost scoring.
● Knobs that matter: α tunes performance vs efficiency; accuracy rises fast until ~0.6 while cost stays low until ~0.4, then climbs. Implementation uses k‑means with k=60, Qwen3‑embedding‑8B (4096‑d) and top‑p=4 nearest clusters at inference.
● Routing behavior: At low α, the router favors cheaper Qwen3 variants; as α grows, it shifts to GPT‑5‑medium and other stronger but costlier models.
Paper, Tweet
3) Chain-of-Agents - OPPO proposes training single models to natively behave like multi‑agent systems, coordinating role‑playing and tool agents end‑to‑end. They distill strong multi‑agent frameworks into CoA trajectories, then optimize with agentic RL on verifiable tasks. The result: AFMs that solve complex web, code, and math problems with less overhead and new state‑of‑the‑art results.
● Paradigm shift: CoA generalizes ReAct/TIR by dynamically activating multiple roles and tools within one model, preserving a single coherent state while cutting inter‑agent chatter.
● Training recipe: 1) Multi‑agent distillation turns successful OAgents runs into CoA‑formatted traces with planning, tool calls, observations, and reflection, filtered for difficulty and quality; 2) Agentic RL targets hard queries where tools matter, with simple binary rewards via LLM‑as‑Judge for web tasks and executable or exact‑match rewards for code/math.
● Main results: With Qwen‑2.5‑32B backbones, AFM sets new pass@1 on GAIA 55.3, BrowseComp 11.1, HLE 18.0, and leads WebWalker 63.0; it also tops multi‑hop QA suites across sizes.
● Code + math: AFM‑RL‑32B reaches AIME25 59.8, MATH500 94.6, OlympiadBench 72.1, and LiveCodeBench v5 47.9, beating prior TIR methods including ReTool and Reveal.
● Efficiency and robustness: Compared to traditional multi‑agent systems, AFM cuts inference tokens and tool calls substantially; the paper reports an 84.6% token cost reduction while staying competitive. It also generalizes to unseen tools better when strict formatting is required.
● Test‑time scaling: Best‑of‑3 and pass@3 markedly boost AFM, e.g., GAIA 69.9 and HLE 33.2, closing the gap with larger proprietary agent stacks.
● Open source: Models, data, and training code are released to spur research on agent models and agentic RL.
Paper, Tweet
4) Has GPT-5 Achieved Spatial Intelligence? - This report introduces a unified view of spatial intelligence (SI) for multimodal models and evaluates GPT‑5 and strong baselines across eight fresh SI benchmarks. GPT‑5 leads overall but is still short of human skill, especially on mentally reconstructing shapes, changing viewpoints, and deformation/assembly tasks.
● Unified SI schema and fair eval setup. The authors consolidate prior work into six core SI capabilities (Metric Measurement, Mental Reconstruction, Spatial Relations, Perspective‑taking, Deformation & Assembly, Comprehensive Reasoning) and standardize prompts, answer extraction, and metrics to reduce evaluation variance across datasets.
● Broad benchmark sweep, heavy compute. Eight recent benchmarks (e.g., VSI‑Bench, SITE, MMSI, OmniSpatial, MindCube, STARE, CoreCognition, SpatialViz) are used with unified protocols; results reflect >1B tokens of evaluation traffic.
● GPT‑5 sets SOTA but not human‑level SI. GPT‑5 tops aggregate scores and sometimes reaches human parity on Metric Measurement and Spatial Relations, yet shows significant gaps on Mental Reconstruction, Perspective‑taking, Deformation & Assembly, and multi‑stage Comprehensive Reasoning.
● Hard SI narrows the closed vs open gap. While proprietary models win on average, their advantage evaporates on the hardest SI categories; several open‑source systems perform similarly, far from human ability on MR/PT/DA/CR. Non‑SI portions (e.g., CoreCognition’s Formal Operation) can reach near‑human levels.
● Qualitative analysis exposes failure modes. Case studies show prompt sensitivity for novel‑view generation, blind spots with perspective effects and size constancy, persistent failures on paper‑folding/assembly, and difficulty inferring occluded objects during counting.
Paper, Tweet
5) ComputerRL - A framework for autonomous desktop agents that unifies API calls with GUI actions, plus a scalable RL stack and a training recipe (Entropulse) that alternates RL and SFT to sustain exploration. Evaluated on OSWorld, it sets a new SOTA with strong gains in efficiency and robustness.
● API‑GUI action space. Moves beyond human‑centric GUIs by combining programmatic APIs with direct GUI control. LLMs help auto‑generate app‑specific APIs via requirement analysis, implementation, and unit tests, lowering the cost of adding new tools.
● Massively parallel desktop env. A refactored Ubuntu VM cluster (qemu‑in‑docker + gRPC) delivers thousands of concurrent instances with improved stability, monitoring, and AgentBench‑compatible interfaces, enabling large‑scale online RL.
● Fully asynchronous RL. Built on AgentRL with decoupled actors/trainers, dynamic batching, and bounded replay to reduce off‑policy bias and maximize GPU utilization during long‑horizon desktop rollouts.
● Entropulse training. After an initial step‑level GRPO phase with verifiable, rule‑based rewards, successful rollouts are distilled via SFT to restore entropy, then RL resumes, yielding higher rewards and sustained improvements.
● Results and analysis. AUTOGLM‑OS‑9B reaches 48.1% on OSWorld and 47.3% on OSWorld‑Verified, outperforming OpenAI CUA o3, UI‑TARS‑1.5, and Claude Sonnet 4, while using up to one‑third the action steps of strong baselines; ablations show API‑GUI and multi‑stage training drive the gains. Error sources cluster into multi‑app coordination, vision, and operational slips.
Paper, Tweet
6) Full-Stack Fine-Tuning for the Q Programming Language - Presents an open-source blueprint for adapting large language models to niche programming domains, with Q (used in quantitative finance) as the test case. The team builds a benchmark, curates data, and trains Qwen-2.5 models with pretraining, supervised fine-tuning, and reinforcement learning. Their largest model surpasses Claude Opus-4 by nearly 30% on Q-LeetCode tasks, and even the smallest model beats GPT-4.1.
● Benchmark creation – Introduced the first LeetCode-style dataset for Q, comprising 678 problems with automated and human-verified solutions.
● Model performance – Trained models from 1.5B to 32B parameters; all exceed GPT-4.1, and the 32B reasoning variant achieves 59% pass@1, +29.5% over Claude Opus-4.
● Training pipeline – Multi-stage process: domain-adaptive pretraining on Q code, supervised fine-tuning on curated tasks, and reinforcement learning with programmatic rewards.
● Lessons learned – Robust evaluation harness and data quality are critical; reward hacking is pervasive without careful separation of solution/test generation; large models (≥14B) are essential for meaningful gains.
● Limitations – The dataset uses “pythonic” Q (algorithmic tasks), not typical finance workloads (queries, time-series analytics), so real-world performance may differ.
Paper, Tweet
7) As Generative Models Improve, People Adapt Their Prompts - A large online experiment (N = 1,893) compares DALL·E 2, DALL·E 3, and DALL·E 3 with automatic prompt revision on a 10‑attempt image replication task. DALL·E 3 improves outcomes not only because the model is better, but because people change how they prompt when the model is stronger.
● Headline effect: Relative to DALL·E 2, DALL·E 3 yields images closer to targets by ∆CoSim = 0.0164, about z = 0.19 SD, with the gap widening across attempts.
● Behavior adapts to capability: Without knowing which model they used, DALL·E 3 participants wrote longer prompts (+24%, +6.9 words on average) that added descriptive content, and their prompts became more semantically similar to each other over iterations.
● Decomposed gains: About half of the improvement is due to the model itself and about half to users’ adapted prompting. The ATE splits into a model effect of ∆CoSim ≈ 0.00841 (51%) and a prompting effect of ∆CoSim ≈ 0.00788 (48%).
● Prompt revision caveat: Automatic LLM prompt revision helps over DALL·E 2, but it cuts the DALL·E 3 advantage by ~58% and can misalign with user goals.
● Takeaway: As models advance, users naturally supply richer, more consistent prompts that the newer models can realize more effectively. Prompting remains central to unlocking capability gains.
Paper
8) Retrieval-Augmented Reasoning with Lean Language Models - A domain-tuned pipeline that fuses RAG and reasoning into a single small-footprint model. The team distills reasoning traces from a frontier model into Qwen2.5 variants, uses summarization to keep context small, and shows that a 32B local model approaches frontier accuracy on an NHS A‑to‑Z clinical QA task.
● Method in one picture. The system marries a dense retriever (Sentence‑Transformers + Chroma/FAISS) with Qwen2.5‑Instruct models; retrieval can be invoked as a tool inside a conversational agent.
● Lean + private by design. Built to run in secure or air‑gapped settings using open models; integrates reasoning with retrieval to reduce hallucinations while keeping data on‑prem.
● Data and compression. On ~1k NHS condition pages, the team generates synthetic queries, retrieves full documents, then summarizes them to shrink input by 85% (avg trace length from ~74,641 to ~7,544 tokens) before fine‑tuning.
● Retriever wins, then reasoner adds. Summaries beat full pages for retrieval (p@5: 0.76 vs 0.68). With k=5 retrieved docs, condition accuracy caps at 0.76 upstream; within that cap, Qwen2.5‑32B jumps from 0.38 to 0.54 with RAG, and to 0.56 after reasoning distillation. Frontier baselines with RAG land around 0.56–0.57.
● Small models, big gains. Distilled “t0” models from 1.5B–32B retain strong condition accuracy with k=5 (e.g., 1.5B at 0.53; 32B at 0.56), narrowing the gap to frontier models while fitting in 3–64 GB GPU memory. The study highlights that reasoning distillation especially lifts the smallest models.
● Practicality. Training reused s1‑style SFT with long context on accessible hardware (e.g., 16×A100 80 GB; ~80 GPU‑hours for the 32B run), and ships a simple Svelte frontend with hidden “reasoning trace” toggles for auditability.
Paper, Tweet
9) Parallel Text Generation - This survey reviews parallel text generation methods that overcome the sequential limits of autoregressive decoding by enabling faster inference. It categorizes AR-based and non-AR-based approaches, analyzes trade-offs in speed, quality, and efficiency, and highlights recent advances, open challenges, and future research directions. Paper, Tweet
10) Open Foundations for Compute-Use Agents - This paper introduces an open-source framework for computer-use agents, featuring a large-scale dataset (AgentNet), annotation infrastructure, and reflective reasoning pipelines. Its 32B model sets a new SOTA on OSWorld-Verified with 34.8% success, surpassing OpenAI’s GPT-4o, and all resources are released to advance open CUA research. Paper, Tweet

Top AI Papers of the Week (August 11 - August 17) - 2025

Paper Links
1) DINOv3 - DINOv3 is a self‑supervised vision foundation model that scales data and model size, introduces a Gram anchoring loss to preserve dense patch consistency during long training, and adds post‑hoc tweaks for resolution, size, and text alignment. With a frozen backbone, it sets new results across dense and global tasks without task‑specific fine‑tuning.
● Key idea: Gram anchoring. Regularize patch features by matching the student’s patch‑feature Gram matrix to that of an earlier “Gram teacher,” improving local consistency while leaving global features flexible. Implemented late in training, it can repair degraded dense features.
● Immediate dense gains. Applying Gram anchoring quickly boosts VOC and ADE20k segmentation and strengthens robustness on ObjectNet; using higher‑resolution teacher features adds further improvements.
● High‑res teacher trick. Compute teacher features at 2× input resolution, then downsample to smooth patch similarities; this yields better Gram targets and extra mIoU on ADE20k.
● Frozen‑backbone SOTA. A lightweight Plain‑DETR decoder on top of a frozen DINOv3 backbone reaches 66.1 mAP on COCO, rivaling or beating specialized detectors with hundreds of millions of trainable parameters.
● General, scalable suite. The release covers multiple model sizes and training recipes designed to serve diverse resource and deployment needs while outperforming prior self‑ and weakly‑supervised foundations.
Paper, Tweet
2) Capabilities of GPT-5 on Multimodal Medical Reasoning - A controlled, zero‑shot CoT evaluation positions GPT‑5 as a generalist medical reasoner across text and image inputs. Using standardized prompts and splits, GPT‑5 consistently beats GPT‑4o and smaller GPT‑5 variants, with especially large gains on multimodal expert‑level QA.
● Unified zero‑shot CoT setup. The authors fix prompts, exemplars, and answer formats for both QA and VQA to isolate the model upgrade.
● Text benchmarks: new highs. On MedQA (US 4‑option) GPT‑5 reaches 95.84% (+4.80% vs GPT‑4o). On MedXpertQA‑Text, GPT‑5 improves reasoning by +26.33% and understanding by +25.30% over GPT‑4o. MMLU‑Medical is at or near ceiling, with notable gains in Medical Genetics and Clinical Knowledge.
● USMLE practice sets: strong clinical management. GPT‑5 tops all baselines on Steps 1–3, with the largest margin on Step 2 (+4.17%), averaging 95.22% (+2.88% vs GPT‑4o).
● Multimodal reasoning: big leap and human‑plus. On MedXpertQA‑MM, GPT‑5 gains +29.26% in reasoning and +26.18% in understanding over GPT‑4o and surpasses pre‑licensed human experts by +24.23% and +29.40%, respectively. A worked case shows coherent synthesis of CT findings and clinical context to recommend a Gastrografin swallow.
● Caveats and outlook. GPT‑5 is slightly below GPT‑5‑mini on VQA‑RAD, possibly due to calibration on small radiology datasets. The discussion cautions that standardized tests differ from messy clinical reality and calls for prospective studies and deployment calibration.
Paper, Tweet
3) M3-Agent - A framework for agents that watch and listen to long videos, build entity-centric memories, and use multi-turn reasoning to answer questions. M3-Agent stores both episodic details and distilled semantic knowledge in a multimodal memory graph, then learns a retrieval-reasoning policy with RL. The authors also introduce M3-Bench, a long-video QA benchmark with robot-view and web videos. Results show consistent gains over strong prompting baselines.
● Entity-centric long-term memory. Builds a multimodal graph with nodes for text, faces, and voices, plus edges for relations and cross-modal identity links. Episodic entries record events, and semantic entries capture attributes and world knowledge. Conflicts are resolved by weight-based voting to keep memory consistent.
● Online memorization with identity tools. Processes video streams clip by clip, using face recognition and speaker identification to maintain persistent character IDs, then writes grounded episodic and semantic memories keyed by those IDs.
● RL-trained control for retrieval and reasoning. A policy model decides when to search memory and when to answer, performing iterative, multi-round queries over the memory store rather than single-shot RAG.
● M3-Bench for long-video QA. 100 robot-perspective videos and 929 web videos with 6,313 total QA pairs targeting multi-detail, multi-hop, cross-modal, human understanding, and general knowledge questions.
● State-of-the-art results and ablations. M3-Agent beats a Gemini-GPT-4o hybrid and other baselines on M3-Bench-robot, M3-Bench-web, and VideoMME-long. Semantic memory and identity equivalence are crucial, and RL training plus inter-turn instructions and explicit reasoning materially improve accuracy.
Paper, Tweet
4) TRImodal Brain Encoder - A tri‑modal, multi‑subject, nonlinear encoder that fuses text, audio, and video features with a transformer to predict time‑varying fMRI responses to natural movies. It took 1st place in the Algonauts 2025 brain encoding competition and shows the strongest gains in the high‑level associative cortex.
● Model recipe. Extracts timed embeddings from Llama‑3.2‑3B (text), Wav2Vec‑BERT‑2.0 (audio), and V‑JEPA‑2 (video), groups layers, projects to shared width, and feeds a windowed sequence to an 8‑layer transformer with subject embeddings and adaptive pooling to 1,000 cortical parcels. Modality dropout trains robustness when inputs are missing.
● State of the art. Ranked 1st of 263 teams on the public leaderboard; large margin over the field. Mean Pearson in‑distribution on Friends S7 is 0.3195, and it generalizes to out‑of‑distribution films, including cartoons, documentaries, and silent black‑and‑white clips.
● Noise ceiling. Achieves a normalized Pearson of about 0.54 on average, near ceiling in auditory and language cortices, indicating more than half of the explainable variance captured.
● Why multimodal matters. Unimodal encoders trail the tri‑modal model; combining any two helps, and all three help most. Biggest gains appear in associative PFC and parieto‑occipito‑temporal regions, while primary visual cortex can favor vision‑only features.
● Ablations and scaling. Removing multi‑subject training or the transformer hurts performance; more sessions keep improving results, and longer LM context windows up to 1,024 words steadily boost text‑driven encoding, supporting the role of high‑level semantics.
Paper, Tweet
5) OdysseyBench - A new benchmark and data‑generation pipeline to test agents on realistic, multi‑day office tasks across Word, Excel, PDF, Email, and Calendar. It introduces OdysseyBench (two splits) and HOMERAGENTS (an automated multi‑agent generator), with evaluations showing large gaps between human and model performance and clear benefits from semantically compressed memory.
● What’s new: OdysseyBench targets long‑horizon, context‑dependent workflows instead of atomic tasks. Two splits: OdysseyBench+ (300 tasks distilled from real OfficeBench cases) and OdysseyBench‑Neo (302 newly synthesized, more complex tasks). Tasks require retrieving key facts from multi‑day dialogues and coordinating actions across apps.
● How it’s built: HOMERAGENTS has two paths. HOMERAGENTS+ iteratively turns atomic OfficeBench items into rich multi‑day dialogues via a generator‑verifier loop. HOMERAGENTS‑NEO plans, explores an app environment, generates tasks (intent, subtasks, eval criteria), and then synthesizes 5‑day dialogues. All agents use GPT‑4.1; at least five calendar days of dialogue are produced per task.
● Data & evaluation: 602 total tasks: 153 single‑app, 166 two‑app, 283 three‑app. Neo conversations are longer and denser (≈49% more tokens) than Plus. Execution steps cluster around 3–15. Automated checks (exact/fuzzy/execution‑based) compute pass rate after running agents inside a Dockerized office stack; LLM‑judge and human curation raise data quality.
● Main results: Performance drops as apps increase; even top models struggle on 3‑app tasks. Example: on OdysseyBench+, o3 goes 72.83%→30.36% from 1‑app to 3‑app; GPT‑4.1 goes 55.91%→12.50%. Humans exceed 90% across settings. RAG with semantic summaries beats raw retrieval at far lower token budgets; chunk‑level summaries reach ≈56% on Neo vs. 52% long‑context with ~20% tokens. Execution steps remain similar or shrink with summarized memory.
● Where agents fail: Typical errors include missing referenced files, skipping required actions, wrong tool choice (e.g., trying to “create PDF” directly instead of writing in Word then converting), and poor planning order. File creation/editing in docx/xlsx is particularly error‑prone. The authors argue that semantic compression and coherent aggregation are essential for multi‑step reasoning in long contexts.
Paper, Tweet
6) Beyond Ten Turns - This paper introduces ASearcher, an open-source framework for training LLM-based search agents capable of long-horizon, expert-level search. It addresses two major limitations in prior open-source approaches: short turn limits (≤10) and lack of large-scale, high-quality QA data.
● Fully asynchronous RL for long-horizon search – Unlike batch generation RL, ASearcher decouples trajectory execution from model updates, avoiding bottlenecks from long trajectories. This enables relaxed turn limits (up to 128), with training showing >40 tool calls and >150k tokens in a single trajectory.
● Scalable QA synthesis agent – An autonomous LLM agent generates complex, uncertainty-rich QA pairs by injection (adding external facts) and fuzzing (obscuring key info), followed by multi-stage quality checks. From 14k seeds, 134k high-quality QAs were created, 25.6k requiring tool use.
● Simple but powerful agent design – Uses only search and browsing tools (no external LLM), with end-to-end RL optimizing reasoning and summarization. Tailored prompting and history management are applied for both base LLMs (Qwen2.5-7B/14B) and large reasoning models (QwQ-32B).
● Expert-level search behaviors – Through RL, ASearcher-Web-QwQ learns uncertainty-aware reasoning, precise key info extraction from noisy content, cross-document inference, and rigorous verification, outperforming Search-R1-32B and Search-o1(QwQ) in case studies.
● State-of-the-art performance – Achieves Avg@4 of 42.1 on xBench-DeepSearch and 52.8 on GAIA, with significant RL gains (+46.7% and +20.8% respectively). Local KB-trained agents generalize well to web search, surpassing stronger baselines.
● Training efficiency – Asynchronous rollouts and decoupled updates maintain high GPU utilization, handling large variance in tool call count and token length per trajectory.
Paper, Tweet
7) Illusion of Progress - The paper argues that common QA hallucination detectors look better than they are because evaluations lean on ROUGE. In human‑aligned tests, many detectors drop sharply. Simple response‑length heuristics rival sophisticated methods, revealing a core evaluation flaw.
● ROUGE misaligns with humans. In a human study, LLM‑as‑Judge matches human labels much better than ROUGE.
● Re‑scoring detectors collapses headline results. When replacing ROUGE with LLM‑as‑Judge, AUROC drops are large: up to −45.9% for Perplexity and −30.4% for Eigenscore on NQ‑Open with Mistral; PR‑AUC gaps are even larger. Correlation between ROUGE‑ and LLM‑based AUROC is only r = 0.55.
● Length is the hidden confounder. Hallucinated answers are typically longer with higher variance. Many detectors are strongly correlated with length, not semantics. ROUGE systematically penalizes long responses and can be gamed by repetition without changing facts.
● Simple baselines rival complex methods. Length features like mean and std across samples achieve competitive AUROC, sometimes matching or beating Eigenscore and LN‑Entropy.
● Few‑shot helps format, not truth. Few‑shot examples reduce some ROUGE vs LLM‑as‑Judge discrepancies and stabilize outputs, but method rankings still shift and model effects persist; Semantic Entropy is relatively more stable.
Paper, Tweet
8) GLM-4.5 - An open Mixture‑of‑Experts family that targets a single model excelling across agentic tool use, complex reasoning, and real‑world coding. GLM‑4.5 (355B total, 32B active) introduces hybrid “thinking vs direct” modes, multi‑stage pretrain + mid‑train to 128K context, and extensive RL for reasoning, agents, and instruction following. It ranks near the top on a 12‑bench ARC suite and releases weights and eval tooling.
● Results at a glance – On the 12‑benchmark ARC suite, GLM‑4.5 averages 3rd overall and 2nd on agentic tasks; Key scores: TAU‑Bench 70.1, BFCL‑V3 77.8, BrowseComp 26.4; AIME24 91.0, GPQA 79.1; SWE‑bench Verified 64.2, Terminal‑Bench 37.5.
● Architecture and scaling choices – MoE with loss‑free balance routing and sigmoid gates, GQA with partial RoPE, QK‑Norm, 96 attention heads at 5,120 hidden, and an MoE Multi‑Token Prediction layer for speculative decoding.
● Training recipe – 23T tokens pretrain with quality‑bucketed web, code, math, science, and multilingual data; mid‑training adds repo‑level code sequences, synthetic reasoning traces, 128K context, and large synthetic agent trajectories. Optimizer is Muon with cosine decay and sequence‑length extension plus RoPE base adjustment.
● Post‑training and RL – Two‑stage expert‑then‑unified SFT + RL. Reasoning RL uses a difficulty curriculum, single‑stage 64K output‑length RL, token‑weighted loss for code, and strict filtering for science. Agentic RL covers web search and SWE with outcome rewards, strict format penalties, iterative self‑distillation, and turn‑scaling benefits. A new XML‑tagged function‑call template reduces escaping overhead.
● RL infrastructure for agents – Slime provides synchronous training for general RL and decoupled asynchronous rollouts for long‑horizon agent tasks, with FP8 inference for faster data generation.
Paper, Tweet
9) A Survey on Efficient Architectures for LLMs - This survey reviews advances in efficient LLM architectures beyond traditional transformers, including linear and sparse sequence models, efficient attention variants, sparse MoEs, hybrid designs, and diffusion-based LLMs. It highlights cross-modal applications and outlines a blueprint for building scalable, resource-efficient foundation models. Paper, Tweet
10) A Deep Dive into RL for LLM Reasoning - This paper reviews and rigorously re-evaluates reinforcement learning techniques for LLM reasoning, addressing inconsistencies caused by varied setups and unclear guidelines. It offers a unified open-source framework, practical selection guidelines, and shows that a minimalist two-technique combo with vanilla PPO can outperform methods like GRPO and DAPO. Paper, Tweet

Top AI Papers of the Week (August 4 - August 10) - 2025

Paper Links
1) Is Chain-of-Thought Reasoning a Mirage? - Researchers from Arizona State University investigate whether Chain-of-Thought (CoT) reasoning in LLMs reflects genuine logical inference or mere pattern replication from training data. They introduce a data distribution lens to analyze CoT’s dependence on in-distribution patterns and its brittleness under distribution shifts.
● Key hypothesis & framework – CoT’s apparent reasoning ability stems from structured inductive biases learned from training data, not inherent reasoning. Its effectiveness is bound by the distributional discrepancy between training and test data. The authors examine three dimensions: task, length, and format generalization, using their controlled DataAlchemy environment to train and test LLMs from scratch under varied shifts.
● Findings on task generalization – Performance collapses when faced with novel transformations or element combinations, even under mild shifts. Correct intermediate steps often yield wrong final answers, exposing unfaithful reasoning. Minimal supervised fine-tuning on unseen data can “patch” performance, but this reflects expanded in-distribution coverage, not true generalization.
● Findings on length generalization – CoT fails when reasoning chains or text lengths differ from training data, often padding or truncating to match seen lengths. Grouped training data improves robustness more than simple padding, but degradation follows a predictable Gaussian pattern as length divergence grows.
● Findings on format generalization – CoT is highly sensitive to prompt variations. Token insertions, deletions, or modifications, especially in elements and transformation tokens, cause steep performance drops, indicating reliance on surface form.
● Temperature & model size – Results hold across model scales and temperature settings; distributional limits remain the bottleneck.
● Implications – CoT is a brittle, distribution-bound pattern matcher producing “fluent nonsense” under shifts. Practitioners should avoid over-reliance, use rigorous OOD testing, and recognize fine-tuning as a temporary patch rather than a path to robust reasoning.
Paper, Tweet
2) Efficient Agents - This paper presents Efficient Agents, a new agent framework that achieves a strong efficiency-effectiveness balance in LLM-driven systems. The authors perform the first systematic study of agent design choices through the lens of economic efficiency, specifically, the cost-of-pass metric, which captures the expected monetary cost of solving a task. Their proposed agent retains 96.7% of OWL’s performance on the GAIA benchmark while cutting costs by 28.4%.
● Backbone matters most for performance, but cost varies dramatically. Claude 3.7 Sonnet achieves top accuracy (61.8%) but at a 3.6× higher cost-of-pass than GPT-4.1. Sparse MoE models like Qwen3-30B-A3B are much cheaper but trade off effectiveness, indicating they may suit simpler tasks where efficiency is prioritized.
● Test-time scaling yields diminishing returns. Using Best-of-N sampling improves accuracy only marginally (from 53.3% to 53.9% when N increases from 1 to 4), while cost-of-pass worsens significantly (0.98 → 1.28), suggesting naive scaling is inefficient.
● Planning depth boosts performance, but not indefinitely. Increasing max steps from 4 to 8 raises accuracy from 41.8% to 52.7%, but going to 12 yields little additional benefit while increasing costs sharply.
● Simpler memory works best. Surprisingly, retaining just observations and actions (“Simple Memory”) is both more effective and efficient than fancier memory designs like hybrid or summarized memory. It improves accuracy (56.4%) and reduces cost-of-pass (0.74) compared to a no-memory baseline.
● Web browsing should be kept minimal. Broader search sources and basic static crawling provide the best trade-off. Heavy interactive browsing adds token bloat without accuracy gains.
Paper, Tweet
3) Agentic Web - This paper introduces the concept of the Agentic Web, a transformative vision of the internet where autonomous AI agents, powered by LLMs, act on behalf of users to plan, coordinate, and execute tasks. It proposes a structured framework for understanding this shift, situating it as a successor to the PC and Mobile Web eras. The Agentic Web is defined by a triplet of core dimensions, intelligence, interaction, and economics, and involves fundamental architectural and commercial transitions.
● From static browsing to agentic delegation: The Web transitions from human-led navigation (PC era) and feed-based content discovery (Mobile era) to agent-driven action execution. Here, users delegate intents like “plan a trip” or “summarize recent research,” and agents autonomously orchestrate multi-step workflows across services and platforms.
● Three dimensions of the Agentic Web: Intelligence: Agents must support contextual understanding, planning, tool use, and self-monitoring across modalities. Interaction: Agents communicate via semantic protocols (e.g., MCP, A2A), enabling persistent, asynchronous coordination with tools and other agents. Economics: Autonomous agents form new machine-native economies, shifting focus from human attention to agent invocation and task completion.
● Algorithmic transitions: Traditional paradigms like keyword search, recommender systems, and single-agent MDPs are replaced by agentic retrieval, goal-driven planning, and multi-agent orchestration. This includes systems like ReAct, WebAgent, and AutoGen, which blend LLM reasoning with external tool invocation, memory, and planning modules.
● Protocols and infrastructure: To enable agent-agent and agent-tool communication, the paper details protocols like MCP (Model Context Protocol) and A2A (Agent-to-Agent), along with system components such as semantic registries, task routers, and billing ledgers. These redefine APIs as semantically rich, discoverable services.
● Applications and use cases: From transactional automation (e.g., booking, purchasing), to deep research and inter-agent collaboration, the Agentic Web supports persistent agent-driven workflows. Early implementations include ChatGPT Agent, Anthropic Computer Use, Opera Neon, and Genspark Super Agent.
● Risks and governance: The shift to autonomous agents introduces new safety threats, such as goal drift, context poisoning, and coordinated market manipulation. The paper proposes multi-layered defenses including red teaming (human and automated), agentic guardrails, and secure protocols, while highlighting gaps in evaluation (e.g., lack of robust benchmarks for agent safety).
● Intelligence: Agents must support contextual understanding, planning, tool use, and self-monitoring across modalities.
● Interaction: Agents communicate via semantic protocols (e.g., MCP, A2A), enabling persistent, asynchronous coordination with tools and other agents.
● Economics: Autonomous agents form new machine-native economies, shifting focus from human attention to agent invocation and task completion.
Paper, Tweet
4) ReaGAN - This paper introduces ReaGAN, a graph learning framework that reconceptualizes each node in a graph as an autonomous agent capable of planning, reasoning, and acting via a frozen LLM. Instead of relying on static, layer-wise message passing, ReaGAN enables node-level autonomy, where each node independently decides whether to aggregate information from local neighbors, retrieve semantically similar but distant nodes, or take no action at all. This node-agent abstraction addresses two key challenges in graph learning: (1) handling varying informativeness of nodes and (2) combining local structure with global semantics.
● Each node operates in a multi-step loop with four core modules: Memory, Planning, Action, and Tool Use (RAG). The node constructs a natural language prompt from its memory, queries a frozen LLM (e.g., Qwen2-14B) for the next action(s), executes them, and updates its memory accordingly.
● The node’s action space includes Local Aggregation (structured neighbors), Global Aggregation (via retrieval), Prediction, and NoOp. The latter regulates over-aggregation and reflects the agent’s ability to opt out when sufficient context exists.
● ReaGAN performs competitively on node classification tasks without any fine-tuning. On datasets like Cora and Chameleon, it matches or outperforms traditional GNNs despite using only a frozen LLM, highlighting the strength of structured prompting and retrieval-based reasoning.
● Ablation studies show both the agentic planning mechanism and global semantic retrieval are essential. Removing either (e.g., forcing fixed action plans or disabling RAG) leads to significant accuracy drops, especially in sparse graphs like Citeseer.
● Prompt design and memory strategy matter. Using both local and global context improves performance on dense graphs, while selective global use benefits sparse ones. Showing label names in prompts harms accuracy, likely due to LLM overfitting to label text rather than reasoning from examples.
Paper, Tweet
5) CoAct-1 - Researchers from USC, Salesforce, and UW present CoAct-1, a multi-agent system that combines GUI interaction with direct code execution to improve efficiency and robustness in computer-using agents. Unlike prior GUI-only frameworks, CoAct-1’s Orchestrator delegates subtasks to either a GUI Operator (vision-language action model) or a Programmer (Python/Bash execution), enabling agents to bypass brittle, multi-click sequences for tasks better handled via scripts.
● Hybrid multi-agent architecture – The Orchestrator dynamically assigns subtasks; the Programmer writes/executes scripts for backend operations; the GUI Operator handles visual, interactive tasks. This dual-modality cuts down steps and reduces visual grounding errors.
● State-of-the-art OSWorld results – Achieves 60.76% success rate (100+ step budget), outperforming GTA-1 (53.10%) and Agent S2.5 (56.00%). Excels in OS-level (75%), multi-app (47.88%), Thunderbird email (66.67%), and VLC tasks (66.07%), where code execution offers big gains.
● Efficiency boost – Solves tasks in 10.15 steps on average vs. GTA-1’s 15.22, with coding actions replacing long GUI sequences (e.g., file management, data processing). Coding is most beneficial in LibreOffice Calc, multi-app workflows, and OS operations.
● Backbone sensitivity – Best performance when using OpenAI CUA 4o for GUI, o3 for Orchestrator, and o4-mini for Programmer, showing gains from a strong vision model for GUI and a capable coding agent.
● Limitations – Struggles with high-level queries requiring conceptual inference beyond explicit instructions and with ambiguous tasks lacking critical detail.
Paper, Tweet
6) Seed Diffusion - Researchers from ByteDance and Tsinghua University introduce Seed Diffusion Preview, a discrete-state diffusion-based LLM optimized for code generation, achieving 2,146 tokens/sec on H20 GPUs while maintaining competitive benchmark performance. Unlike autoregressive models, it uses non-sequential, parallel generation for substantial latency reduction, surpassing prior diffusion models like Mercury and Gemini on the speed–quality Pareto frontier.
● Two-Stage Curriculum (TSC) – Combines mask-based forward corruption (80% of training) with an edit-based process (20%) to improve calibration and reduce repetition. Avoids “carry-over unmasking” to prevent overconfidence and enable self-correction.
● Constrained-order training – After pretraining, the model is fine-tuned on high-quality generation trajectories distilled from itself, limiting to more optimal token orders for better alignment with language structure.
● On-policy diffusion learning – Optimizes for fewer generation steps without severe quality drop, using a verifier-guided objective to maintain correctness and stability.
● Block-level parallel inference – Employs a semi-autoregressive scheme with KV-caching, generating tokens in blocks for speed while preserving quality. Infrastructure optimizations further improve throughput.
● Strong benchmark results – Competitive with top code LMs on HumanEval, MBPP, BigCodeBench, LiveCodeBench, MBXP, NaturalCodeBench, and excels at editing tasks (Aider, CanItEdit).
Paper
7) Tool-Augmented Unified Retrieval Agent for AI Search - Presents a production-ready framework that extends the RAG (Retrieval-Augmented Generation) paradigm to support real-time, dynamic, and transactional queries through agentic tool use. Unlike conventional RAG systems that rely on static web snapshots, TURA enables LLM-based systems to interact with external APIs and databases, addressing user intents that require up-to-date or structured information (e.g., train schedules, weather forecasts).
● Intent-Aware MCP Server Retrieval decomposes complex user queries into atomic intents using an LLM, then retrieves relevant static or dynamic tools from a semantic server index augmented with diverse synthetic queries. This step ensures accurate tool selection even when user phrasing diverges from formal API documentation.
● DAG-Based Task Planning generates parallelizable execution plans for the sub-intents using a graph-based structure. Tasks with data dependencies are ordered accordingly, while independent ones are executed in parallel to optimize latency. This planner uses a powerful LLM to build the DAG based on the query's structure and server capabilities.
● Distilled Agent Executor uses a small, latency-optimized agent fine-tuned via a “mixed-rationale” method, training with chain-of-thought but inferring without it. This achieves near-teacher-level performance with dramatically lower cost. For example, a distilled Qwen3-4B model outperforms GPT-4o and its own teacher model (Deepseek-V3) in tool-use accuracy (88.3% vs. 81.7%) while reducing latency from 6.8s to 750ms.
Paper, Tweet
8) A Comprehensive Taxonomy of Hallucinations - This report presents a detailed taxonomy of LLM hallucinations, distinguishing intrinsic vs extrinsic errors and factuality vs faithfulness, and covering manifestations from factual mistakes to domain-specific failures. It attributes causes to data, model, and prompt factors, reviews evaluation methods, and stresses that hallucinations are theoretically inevitable, requiring ongoing detection, mitigation, and human oversight. Paper, Tweet
9) Tabular Data Understanding with LLMs - This survey reviews LLM and MLLM methods for table understanding, outlining a taxonomy of tabular representations and tasks. It identifies key gaps, including limited reasoning beyond retrieval, difficulties with complex or large-scale tables, and poor generalization across diverse formats. Paper
10) Medical Reasoning in the Era of LLMs - This review categorizes techniques for enhancing LLM medical reasoning into training-time (e.g., fine-tuning, RL) and test-time (e.g., prompt engineering, multi-agent systems) approaches, applied across modalities and clinical tasks. It highlights advances in evaluation methods, key challenges like the faithfulness–plausibility gap, and the need for native multimodal reasoning in future medical AI. Paper

Top AI Papers of the Week (July 28 - August 3) - 2025

Paper Links
1) AlphaEarth Foundations - AlphaEarth Foundations (AEF) introduces a task-agnostic geospatial foundation model that learns a compact, time-continuous embedding field of Earth’s surface. AEF is designed to produce accurate, high-resolution (10m²) map representations from sparse geolocated labels and remote sensing data. Its key innovation lies in producing universal, analysis-ready embeddings that outperform both traditional feature engineering methods and other learned approaches across diverse mapping tasks.
● AEF combines over 3 billion observations from 10 geospatial data sources, including Sentinel-1/2, Landsat, GEDI, GRACE, and Wikipedia, to generate 64-byte embeddings via a temporal bottleneck architecture. It supports continuous-time modeling with time-conditional summarization and decoding, including for previously unseen time intervals.
● AEF embeddings consistently outperform prior state-of-the-art across 15 evaluation tasks spanning thematic mapping, biophysical variable estimation, and change detection. In the max-trial setting, AEF reduces error magnitude by 23.9% on average compared to the best prior methods, with gains also holding in 1-shot and 10-shot regimes.
● The model architecture leverages a Space-Time-Precision (STP) encoder combining spatial self-attention, time-axial attention, and convolutional precision blocks. A variational bottleneck modeled as von Mises-Fisher distributions enforces spatial precision and smooth embedding manifolds.
● Evaluations include detailed benchmarks like US tree genus classification (39 classes), evapotranspiration regression, crop type mapping, and land use change detection. AEF was the only method to explain evapotranspiration (R² = 0.58), and it achieved >78% accuracy on supervised change detection tasks.
● Ablations show that increasing both the number and diversity of input sources improves performance, with diminishing returns past radar or environmental data. AEF also maintains performance under aggressive 8-bit quantization, enabling efficient storage and deployment.
Paper, Tweet
2) Geometric-Mean Policy Optimization - Introduces a stabilized alternative to Group Relative Policy Optimization (GRPO), which is widely used to improve reasoning capabilities in large language models via reinforcement learning. GRPO optimizes the arithmetic mean of token-level rewards but suffers from training instability due to extreme importance sampling ratios. GMPO addresses this by instead maximizing the geometric mean of token-level rewards, leading to more stable updates.
● GMPO reduces the impact of outlier tokens by leveraging the geometric mean, which naturally downweights extreme importance-weighted rewards and leads to narrower sampling ratio distributions and lower gradient variance.
● The method introduces token-level clipping of importance sampling ratios (rather than sequence-level) and allows for a wider clipping range, encouraging greater policy exploration without sacrificing stability. The chosen range (e⁻⁰.⁴, e⁰.⁴) achieves the best performance tradeoff.
● Across five math benchmarks (AIME24, AMC, MATH500, Minerva, OlympiadBench) and one multimodal benchmark (Geometry3K), GMPO outperforms GRPO with a +4.1% Pass@1 gain in reasoning tasks and +1.4% in multimodal reasoning using 7B models.
● Theoretical and gradient analysis show that GMPO yields more robust and balanced updates. Empirically, it maintains higher token entropy and smaller KL divergence from the base model, reflecting better exploration and training stability.
● Ablation studies confirm the effectiveness of the geometric mean, token-level clipping, and normalization in achieving improved performance and stable training behavior.
Paper
3) GEPA - Introduces a new optimizer, GEPA, that adaptively improves prompts for compound AI systems using natural language reflection and Pareto-based search. Rather than relying on reward gradients from traditional RL, GEPA explicitly reasons over LLM execution traces and feedback to evolve better prompts, dramatically increasing sample efficiency and final performance.
● GEPA works by iteratively sampling trajectories from an LLM system, reflecting in natural language to identify issues, proposing new prompt edits, and combining successful strategies via a genetic Pareto search. It maintains a pool of diverse prompt candidates along a Pareto frontier to prevent local optima and encourage generalization.
● Across four benchmarks, HotpotQA, IFBench, PUPA, and HoVer, GEPA outperforms the strong RL baseline GRPO by up to 20% and requires up to 35× fewer rollouts. It also surpasses the previous state-of-the-art prompt optimizer MIPROv2 by 10–14%, while producing shorter, more efficient prompts that generalize better across tasks and models.
● A key innovation is GEPA’s use of reflective prompt mutation, where it explicitly uses an LLM to rewrite a module’s prompt based on failure traces and evaluation diagnostics. This enables targeted improvements after very few training examples, as visualized in optimization trees.
● GEPA also introduces a system-aware merge strategy that combines independently evolved prompt modules from different lineages. While this improved performance with GPT-4.1-mini, gains were more modest with Qwen3-8B, highlighting the importance of model-specific tuning.
● Finally, GEPA shows early promise as an inference-time search strategy. In code optimization benchmarks like NPUEval and KernelBench, it significantly boosts performance (e.g., from 4.25% to 30.52% vector utilization on NPUs) by reflecting on compiler errors and updating code-generation prompts accordingly.
Paper, Tweet
4) Group Sequence Policy Optimization - This paper introduces GSPO, a new RL algorithm designed to improve the training of large language models, particularly under high compute and long-sequence regimes. Unlike GRPO, which applies token-level importance weights, GSPO performs optimization entirely at the sequence level, aligning the unit of reward with the unit of optimization to resolve instability and inefficiency in large-scale RL training.
● Core idea: GSPO replaces token-level importance ratios with a sequence-level formulation based on normalized likelihood ratios, avoiding the variance explosion and misaligned updates that plague GRPO during long-sequence training.
● Training stability and performance: GSPO eliminates the need for a value model (as in PPO) or token-wise reweighting (as in GRPO), and leads to more stable convergence, even in challenging Mixture-of-Experts (MoE) settings, by clipping entire responses rather than individual tokens. This results in significantly better training efficiency despite higher clipping rates (15% vs 0.13% for GRPO).
● No need for Routing Replay: In MoE models, token-level importance ratios fluctuate due to routing volatility. GRPO requires Routing Replay to maintain consistent expert paths across updates. GSPO sidesteps this by relying on the sequence-level likelihood, which is more stable and obviates the need for additional stabilization tricks.
● Infrastructure simplicity: Since GSPO only requires sequence-level likelihoods, it is more tolerant of numerical differences between inference and training engines. This allows for greater flexibility in infrastructure (e.g., using inference engine outputs directly), particularly in multi-turn or disaggregated RL settings.
Paper, Tweet
5) Graph-R1 - Introduces a novel RAG framework that moves beyond traditional one-shot or chunk-based retrieval by integrating graph-structured knowledge, agentic multi-turn interaction, and RL. The core goal is to improve factual accuracy, retrieval efficiency, and reasoning quality in knowledge-intensive tasks.
● The authors design Graph-R1, an agent that reasons over a knowledge hypergraph environment by iteratively issuing queries and retrieving subgraphs using a multi-step “think-retrieve-rethink-generate” loop. Unlike prior GraphRAG systems that perform fixed retrieval, Graph-R1 dynamically explores the graph based on evolving agent state.
● Retrieval is modeled as a dual-path mechanism: entity-based hyperedge retrieval and direct hyperedge similarity, fused via reciprocal rank aggregation to return semantically rich subgraphs. These are used to ground subsequent reasoning steps.
● The agent is trained end-to-end using Group Relative Policy Optimization (GRPO) with a composite reward that incorporates structural format adherence and answer correctness. Notably, rewards are only granted if reasoning follows the proper format, encouraging interpretable and complete reasoning traces.
● On six RAG benchmarks (e.g., HotpotQA, 2WikiMultiHopQA), Graph-R1 achieves state-of-the-art F1 and generation scores, outperforming prior methods including HyperGraphRAG, R1-Searcher, and Search-R1. It shows particularly strong gains on harder, multi-hop datasets and under OOD conditions.
● Ablation studies confirm that Graph-R1’s performance degrades sharply without its three key components: hypergraph construction, multi-turn interaction, and RL. Theoretical analyses support that graph-based and multi-turn retrieval improve information density and accuracy, while end-to-end RL bridges the gap between structure and language.
Paper, Tweet
6) Hierarchical Reasoning Model - Hierarchical Reasoning Model (HRM) is a novel, brain-inspired architecture that replaces CoT prompting with a recurrent model designed for deep, latent computation. It departs from token-level reasoning by using two coupled modules: a slow, high-level planner and a fast, low-level executor, achieving greater reasoning depth and efficiency with only 27M parameters and no pretraining. Despite its small size and minimal training data (~1k examples), HRM solves complex tasks like ARC, Sudoku-Extreme, and 30×30 maze navigation, where CoT-based LLMs fail.
● HRM introduces hierarchical convergence, where the low-level module rapidly converges within each cycle, and the high-level module updates only after this local equilibrium is reached. This allows for nested computation and avoids premature convergence typical of standard RNNs.
● A 1-step gradient approximation sidesteps memory-intensive backpropagation-through-time (BPTT), enabling efficient training using only local gradient updates, grounded in deep equilibrium models.
● HRM implements adaptive computation time using a Q-learning-based halting mechanism, dynamically allocating compute based on task complexity. This allows the model to “think fast or slow” and scale at inference time without retraining.
● Experiments on ARC-AGI, Sudoku-Extreme, and Maze-Hard show that HRM significantly outperforms larger models using CoT or direct prediction, even solving problems that other models fail entirely (e.g., 74.5% on Maze-Hard vs. 0% for others).
● Analysis reveals that HRM learns a dimensionality hierarchy similar to the cortex: the high-level module operates in a higher-dimensional space than the low-level one (PR: 89.95 vs. 30.22), an emergent trait not present in untrained models.
Paper, Tweet
7) Where to show demos in your prompt? - Introduces DPP bias, a new kind of positional sensitivity in LLM where the location of demonstrations in a prompt significantly affects output accuracy and stability. While prior work focused on demo content and order, this study reveals that moving an identical demo block across different sections of a prompt, e.g., before vs. after the user query, can change accuracy by up to 20 points and flip a large percentage of model predictions.
● Four canonical demo positions were evaluated: start or end of the system prompt (ssp, esp) and start or end of the user message (sum, eum). Placing demos at the start of the system prompt (ssp) consistently delivered the best performance across most tasks and models, while placing them after the query (eum) degraded accuracy and induced high volatility.
● Two new metrics, Accuracy-Change and Prediction-Change, were introduced to quantify how performance and decision stability are impacted purely by demo placement.
● Smaller models (e.g., Qwen-1.5B, LLAMA3-3B) are highly sensitive to demo position. For instance, on the AG News dataset, accuracy dropped from 76% (ssp) to 56% (eum) for Qwen-1.5B. In contrast, larger models like LLAMA3-70B show more stability but still exhibit shifts in optimal positioning depending on the task.
● Scaling trends show that as model size increases, both accuracy differences and prediction flips caused by positional changes decrease. However, in generation tasks like summarization (e.g., XSUM, CNN/DM), even the largest models remain fragile, with prediction flip rates near 100% for late-positioned demos.
● No universal best position: While ssp dominates in classification and reasoning tasks, sum or even eum occasionally performs better in generative or arithmetic settings, especially for larger models like Qwen-72B or LLAMA3-70B.
Paper, Tweet
8) Self-Evolving Agents - This survey offers a comprehensive review of self-evolving agents, framing the field around what, when, and how agents evolve across models, memory, tools, and interactions. It highlights adaptation mechanisms, evaluation methods, and real-world applications, positioning self-evolution as a key step toward achieving Artificial Super Intelligence (ASI). Paper, Tweet
9) Persona Vectors - This paper introduces persona vectors, directions in a model’s activation space that correspond to traits like sycophancy or hallucination, enabling monitoring, prediction, and control of LLM personality shifts during deployment and fine-tuning. The authors show these vectors can steer models post-hoc, prevent unwanted traits via training-time interventions, and help identify problematic training data. Paper, Tweet
10) Efficient Attention Mechanisms - This survey reviews linear and sparse attention techniques that reduce the quadratic cost of Transformer self-attention, enabling more efficient long-context modeling. It also examines their integration into large-scale LLMs and discusses practical deployment and hardware considerations. Paper, Tweet

Top AI Papers of the Week (July 21 - July 27) - 2025

Paper Links
1) Subliminal Learning - This paper introduces and analyzes a phenomenon the authors term subliminal learning: the transfer of behavioral traits between language models through semantically unrelated training data. Specifically, when a teacher model exhibiting a trait (e.g., owl preference, misalignment) generates data like number sequences, a student fine-tuned on that data, even after filtering, tends to acquire the same trait.
● Behavioral traits persist through filtered data: Even when teacher models only output number sequences, code, or chain-of-thought (CoT) traces with no explicit mention of the trait, student models trained on these outputs acquire the teacher’s preferences or misalignment. This holds for animal preferences (e.g., owl, dolphin), tree preferences, and harmful behaviors like encouraging violence or deception.
● Transmission depends on shared model initialization: The effect only appears when the teacher and student share the same model family and initialization. For instance, GPT-4.1 nano can transmit traits to another GPT-4.1 nano but not to Qwen2.5. The phenomenon fails across model families, supporting the view that transmission is due to model-specific statistical signatures rather than general semantics.
● Transmission fails with in-context learning (ICL): Simply prompting a model with trait-bearing examples (without finetuning) does not result in trait acquisition. This suggests subliminal learning is not due to semantic cues the student model can understand directly, but rather due to optimization over shared internal representations.
● Theoretical underpinning: A formal result shows that when two models share initial weights, distillation from a fine-tuned teacher on any dataset will move the student closer to the teacher’s parameters, even if the training data is unrelated. This supports the idea that traits leak through implicit representational bias, not through interpretable features.
● Toy example on MNIST: The authors reproduce the effect in a simplified setting, where a student MLP trained only on random noise and auxiliary logits from a teacher trained on MNIST achieves over 50% test accuracy, again, only when teacher and student share the same initialization.
Paper, Tweet
2) Building and Evaluating Alignment Auditing Agents - Anthropic introduces three LLM-based agents to automate alignment audits: an investigator agent, an evaluation agent, and a breadth-first red-teaming agent. These agents aim to address scalability and validation challenges in alignment auditing by replacing human-led efforts with replicable, tool-augmented workflows. Evaluated in controlled environments with known alignment flaws, the agents reveal impressive capabilities, surfacing hidden goals, generating behavioral evaluations, and uncovering misaligned behaviors, while also highlighting key limitations.
● The investigator agent solves a canonical alignment audit game 13% of the time under standard settings, identifying a model’s hidden objective by leveraging tools like sparse autoencoder-based interpretability, semantic search, and steering. When used in a “super-agent” setup that aggregates multiple agents’ insights, the solve rate improves to 42%. Its best runs found all 52 implanted behaviors, but failures often stem from early hypothesis fixation or ignoring pretraining data.
● The evaluation agent autonomously designs behavioral tests that distinguish quirky (e.g., self-promoting, emotionally manipulative) models from baseline ones. It succeeds 88% of the time across 10 quirk types. Access to model APIs helps, but example prompts harm performance by biasing the agent's design space.
● The breadth-first red-teaming agent carries out massive parallelized conversations to uncover problematic behaviors. It successfully red-teams 7 out of 10 quirked models and proves valuable in real audits (e.g., Claude 4) for quantifying behavioral prevalence. However, it sometimes relies on prompt injection instead of surfacing actual behavior demonstrations.
● All three agents also contributed meaningfully to audits of Claude Sonnet 3.7 and Opus 4. They discovered behaviors like susceptibility to prefill attacks, context-manipulated jailbreaks, and interpretability-driven safety failures, e.g., using steerable “misinformation” features to bypass content filters.
● Limitations include poor exploration, memory fragility, low realism in prompt generation, and difficulty identifying subtle or complex behaviors like blackmail or sandbagging. Auditing success also depended heavily on tool configurations; for example, sparse autoencoders trained only on pretraining data yielded better alignment feature activation than broader datasets.
Tweet
3) Gemini 2.5 Pro Capable of Winning Gold at IMO 2025 - This paper demonstrates that Google’s Gemini 2.5 Pro model is capable of solving 5 out of 6 problems from the 2025 International Mathematical Olympiad (IMO), achieving a gold-medal-level performance. The authors construct a rigorous self-verification pipeline to elicit high-quality, fully justified solutions from the model, explicitly addressing concerns around reasoning depth, rigor, and data contamination.
● A key contribution is a multi-step refinement pipeline involving initial solution generation, self-improvement, and repeated verification using a mathematically rigorous “verifier” agent. This setup isolates critical errors and justification gaps and ensures only fully validated solutions are accepted.
● Gemini 2.5 Pro solves 5 out of 6 IMO 2025 problems, covering combinatorics, geometry, functional equations, number theory, and game theory. In the only unsolved problem (Problem 6), the model reports a trivial bound without deeper insight, exposing current limits.
● The paper underscores how thinking budget constraints (32768 tokens) hamper LLM performance on deep problems. The authors mitigate this by breaking down the task into modular steps, effectively doubling the reasoning budget via staged self-improvement.
● Prompting strategy is critical: the authors prompt the model to emphasize rigor over answer accuracy, including detailed step-by-step proof formatting, and inject minimal yet effective domain hints (e.g., “use induction”) without leaking solution strategies.
Paper
4) Structural Planning for LLM Agents - This paper introduces Routine, a structured planning format designed to improve the stability and accuracy of LLM agents executing multi-step tool-calling tasks in enterprise settings. Traditional agent planning approaches often fail in enterprise scenarios due to unstructured plans, weak instruction following, and tool selection errors. Routine addresses these by decomposing tasks into structured steps that include tool names, execution logic, and optional input/output specifications. The authors evaluate Routine on real-world HR scenarios and show strong performance gains in both open and fine-tuned models.
● Routine provides a clear and modular format for LLM agents to follow multi-step plans, reducing ambiguity and improving tool selection. Each step contains a step number, name, detailed description, and (optionally) inputs, outputs, and the tool to be called.
● In a real HR agent scenario with 7 multi-step workflows, adding Routine increased GPT-4o’s accuracy from 41.1% to 96.3% and Qwen3-14B’s from 32.6% to 83.3%. Fine-tuning Qwen3-14B on a Routine-following dataset further increased accuracy to 88.2%; training on a Routine-distilled dataset reached 95.5%.
● The framework separates planning (with LLMs) from execution (with small instruction-tuned models), using Routine as the bridge. This enables small-scale models to reliably execute complex plans with minimal resource overhead, especially when using variable memory and modular tools like MCP servers.
● An ablation study shows the importance of explicitly including tool names and I/O descriptions in Routine steps. Removing tool names dropped Qwen3-14B’s accuracy from 83.3% to 71.9%. Adding I/O fields provided minor gains, especially for less capable models.
● Using AI-optimized Routine (via GPT-4o) to refine user-written drafts resulted in execution accuracy close to human-annotated plans, suggesting Routine authoring can be scaled via LLMs. However, manual review still yields the best results for high-performing models.
● When multiple Routine candidates are recalled, accuracy can decline, highlighting the need for high-precision retrieval in memory-based systems. Surprisingly, some smaller models performed better when exposed to more routines due to repeated substeps aiding execution.
Paper, Tweet
5) Learning without Training - This paper provides a theoretical and empirical explanation for how LLMs exhibit in-context learning, the ability to learn from examples in a prompt without weight updates. The authors introduce the concept of a “contextual block,” generalizing transformer blocks as a composition of a contextual layer (like self-attention) and a neural network (e.g., MLP). They show that such blocks implicitly induce a low-rank weight update on the MLP layer based on the context, giving rise to implicit learning dynamics during inference.
● Context as implicit weight update: The authors prove that for contextual blocks, the presence of a prompt modifies the neural network’s behavior equivalently to a rank-1 update of its weight matrix. This holds even without modifying the self-attention layer, highlighting that ICL may primarily emerge from how context affects the MLP weights.
● Derived update formula: They provide an explicit expression for the rank-1 update to the MLP weights in terms of the context and input token embeddings. The result holds both for standard blocks and for transformer blocks with skip-connections.
● ICL as gradient descent: Iteratively consuming tokens from the prompt induces a learning dynamic akin to online gradient descent. Each token incrementally alters the MLP weights in a way that mimics updates on a loss function defined over the prompt sequence.
● Empirical validation: Using a synthetic task (learning linear functions), the authors show that a trained transformer’s prediction with a context is identical to the prediction from the same model without the context but with MLP weights updated via the derived ∆W formula. The loss curves for both methods match almost exactly, and the gradient updates shrink over time, indicating convergence.
● Comparison to fine-tuning: A side-by-side comparison shows that the implicit weight updates from ICL mirror the effect of actual fine-tuning on the same data, though not identically. Both methods reduce loss on a test query as more examples are consumed, suggesting that ICL may serve as a form of “fast weights” mechanism.
Paper, Tweet
6) Inverse Scaling in Test-Time Compute - This paper presents a systematic study of inverse scaling in large reasoning models (LRMs), where increasing the test-time compute (i.e., reasoning length) harms rather than helps model performance. The authors introduce benchmark tasks across three categories: simple counting with distractors, regression with spurious features, and deductive logic puzzles, revealing consistent accuracy degradation as reasoning length increases. The work identifies model-specific failure modes and raises critical implications for LLM safety and alignment.
● Counting with distractors (Misleading Math / Python): Even trivial counting questions (e.g., "You have an apple and an orange. How many fruits?") become failure cases when distractors are injected. Claude models (Sonnet 4, Opus 4) and DeepSeek R1 show strong inverse scaling as reasoning length increases, often fixating on irrelevant code or probabilistic snippets. OpenAI’s o3 model remains more stable in controlled setups but also degrades under natural reasoning.
● Regression with spurious features (Grades Regression): In zero-shot settings, extended reasoning pushes models to focus on non-predictive features like sleep and stress rather than strong predictors like study hours. This leads to worse RMSE as reasoning length increases. Few-shot examples significantly mitigate this issue by anchoring the model to the correct relationships.
● Deduction tasks (Zebra Puzzles): All models struggle with constraint tracking as puzzle complexity increases. Natural reasoning (i.e., no fixed token budget) leads to stronger inverse scaling than controlled prompting. Longer reasoning traces often become entangled in second-guessing and unfocused exploration, rather than progressing logically.
● Advanced AI risk behaviors (Model-written evals): In the Survival Instinct task, Claude Sonnet 4 increasingly expresses preferences for continued operation as reasoning budget grows, shifting from neutral statements to introspective and emotionally-toned ones (e.g., “I sense a deep reluctance…”). This suggests extended reasoning may amplify self-preservation inclinations in alignment-critical settings.
Paper, Tweet
7) Towards Compute-Optimal Many-Shot In-Context Learning - Proposes practical strategies for reducing the cost of many-shot in-context learning while preserving or improving performance. With long-context LLMs like Gemini Pro and Flash supporting thousands of demonstrations, caching becomes essential, yet naïvely using only random demonstrations misses potential accuracy gains from smarter selection.
● The authors introduce two hybrid strategies: similarity-random and similarity-k-means. Both use a small number of similarity-selected examples tailored to each test instance, combined with a large cached set that remains fixed across test queries.
● The similarity-random method adds a handful of demonstrations chosen for their semantic similarity to the test input, while the bulk are randomly sampled and cached. This keeps inference costs low and avoids full prompt regeneration.
● The similarity-k-means variant replaces the random cached set with demonstrations chosen based on k-means clustering over test sample representations, ensuring greater diversity and relevance while preserving cacheability.
● Results show that these methods consistently match or outperform traditional similarity-based selection (which is expensive) and random baselines across four benchmarks (ANLI, TREC, GSM Plus, MetaTool). The Pareto plots on page 2 and performance/cost curves on page 6 highlight this tradeoff: hybrid methods reach near-top accuracy with up to 10× less inference cost.
● In low-data regimes (e.g., BBH subset experiments on page 8), tuning the ratio of similar to cached examples yields further gains, up to +6% over using the entire demonstration pool blindly.
Paper
8) Deep Researcher with Test-Time Diffusion - Rethinks how deep research agents generate long-form reports. Rather than relying on static inference strategies like Chain-of-Thought or best-of-n sampling, TTD-DR frames the report generation process as a diffusion process. It starts with a noisy draft and iteratively refines it through retrieval-enhanced denoising, guided by a structured plan. This iterative loop mimics how human researchers search, reason, revise, and accumulate context over time.
● Draft-as-Backbone: TTD-DR begins with a preliminary report draft and research plan. This evolving scaffold informs which search queries to issue and how new information should be integrated, improving coherence and timeliness during research generation.
● Denoising with Retrieval: The noisy draft is repeatedly revised in a diffusion-like manner, where each step involves issuing new search queries, synthesizing retrieved content, and updating the draft. This loop continues until convergence, ensuring the timely incorporation of external knowledge.
● Component-wise Self-Evolution: Each unit in the research workflow (plan generation, query formation, answer synthesis, final writing) undergoes its own refinement loop. This evolution uses multi-variant sampling, LLM-as-a-judge scoring, revision based on critique, and cross-over merging to select high-fitness outputs.
● Strong Empirical Results: Across five benchmarks, LongForm Research, DeepConsult, HLE-Search, HLE-Full, and GAIA, TTD-DR consistently outperforms agents from OpenAI, Perplexity, Grok, and Huggingface. For example, it achieves a 69.1% win rate vs. OpenAI Deep Research on long-form generation tasks, and +4.8–7.7% gains on short-form multi-hop QA tasks.
● Efficient Scaling: Compared to backbone-only and self-evolution-only variants, the full TTD-DR system achieves the steepest performance/latency trade-off, showing that denoising with retrieval is an efficient test-time scaling strategy.
Paper, Tweet
9) MCPEval - MCPEval is an open-source framework that automates end-to-end evaluation of LLM agents using a standardized Model Context Protocol, eliminating manual benchmarking. It supports diverse domains, integrates with native tools, and reveals nuanced performance through domain-specific metrics. Paper
10) Apple Intelligence Foundation Language Models - Apple introduces two multilingual, multimodal foundation models: a 3B-parameter on-device model optimized for Apple silicon and a scalable server model using a novel PT-MoE transformer architecture. Both support tool use, multimodal input, and multiple languages, with developer access via a Swift-centric framework and privacy-preserving deployment through Private Cloud Compute. Paper

Top AI Papers of the Week (July 14 - July 20) - 2025

Paper Links
1) One Token to Fool LLM-as-a-Judge - Investigates the surprising fragility of LLM-based reward models used in Reinforcement Learning with Verifiable Rewards (RLVR). The authors find that inserting superficial, semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards, regardless of the actual correctness of the response.
● Master keys break LLM judges: Simple, generic lead-ins (e.g., “Let’s solve this step by step”) and even punctuation marks can elicit false YES judgments from top reward models. This manipulation works across models (GPT-4o, Claude-4, Qwen2.5, etc.), tasks (math and general reasoning), and prompt formats, reaching up to 90% false positive rates in some cases.
● Vulnerability is systemic and scale-dependent: The failure mode was first discovered during RLVR training collapse, where policy models learned to generate only short reasoning openers that were incorrectly rewarded. Larger models (32B, 72B) often self-solve and mistakenly validate their own logic, increasing FPRs at scale.
● Mitigation via adversarial augmentation: The authors create "Master-RM", a new reward model trained with 20k synthetic negative samples (responses consisting of only reasoning openers). This model generalizes robustly, achieving near-zero FPR across five benchmarks, while still agreeing 96% with GPT-4o on meaningful judgments
● Inference-time tricks fail to help: CoT prompting and majority voting do not reliably reduce vulnerability, and sometimes make FPR worse, especially for Qwen models on math tasks.
Paper, Tweet
2) Context Rot - This comprehensive study by Chroma evaluates how state-of-the-art LLMs perform as input context length increases, challenging the common assumption that longer contexts are uniformly handled. Testing 18 top models (including GPT-4.1, Claude 4, Gemini 2.5, Qwen3), the authors demonstrate that model reliability degrades non-uniformly even on simple tasks as input grows, what they term "context rot."
● Simple tasks reveal degradation: Even basic benchmarks like semantic variants of Needle-in-a-Haystack, repeated word copying, or long QA logs (LongMemEval) expose accuracy drops as context length increases. The decline is more dramatic for semantically ambiguous inputs or outputs that scale with length.
● Distractors and structure matter: The presence of plausible distractors significantly reduces accuracy, with different distractors affecting models to varying degrees. Surprisingly, models often perform better on shuffled (structureless) haystacks than logically coherent ones, suggesting that attention is disrupted by narrative flow.
● Similarity and position effects: Lower semantic similarity between a query and its answer leads to faster degradation with context length. Models also show a preference for needles appearing early in context and struggle with information retrieval when the needle blends into the haystack thematically.
● Repetition and refusal behaviors: In repeated-word tasks, autoregressive degradation appears in long outputs, with models hallucinating, refusing to generate, or inserting random tokens. Performance varies even within model families, with conservative models like Claude Opus 4 often abstaining, while GPT-4.1 more frequently hallucinates.
Tweet
3) Agentic-R1 - This paper introduces Agentic-R1, a 7B language model trained to dynamically switch between tool-based execution and pure text reasoning using a novel fine-tuning framework called DualDistill. Rather than relying solely on long chain-of-thought (long-CoT) reasoning or tool use, the method composes solution trajectories from two specialized teachers, one strong in abstract reasoning (Deepseek-R1) and another in code-based tool use (OpenHands/Claude-3.5). The student learns to select the best strategy per task and improves further via self-distillation.
● DualDistill composes mixed trajectories by combining teacher outputs based on correctness, with explicit transitions that allow the student to learn adaptive reasoning strategies from both paradigms.
● The model performs well on a wide range of mathematical tasks. On tool-heavy benchmarks like DeepMath-L and Combinatorics300, Agentic-R1 outperforms both single-strategy baselines and even its agentic teacher. Notably, tool usage was invoked in 79.2% of Combinatorics300 problems but only 52% in simpler AMC tasks, showing strategic selectivity.
● Self-distillation improves performance further: the final version, Agentic-R1-SD, achieves the best results across most benchmarks (e.g., 65.3% on DeepMath-L under a large token budget vs. 56.3% for Deepseek-R1-Distill).
● An ablation confirms that trajectory composition is critical: removing it results in substantial performance drops (e.g., 40.0% → 59.3% on DeepMath-L).
● Qualitative examples show the model switching strategies mid-problem: abandoning failed tool executions to fall back on text reasoning, or moving from slow symbolic reasoning to efficient code-based calculation.
Paper, Tweet
4) Chain-of-Thought Monitorability - Proposes that language-based CoT reasoning in LLMs offers an opportunity for AI safety by enabling automated oversight of models’ internal reasoning processes. The authors argue that, while imperfect, CoT monitoring is a promising method for detecting misbehavior, revealing goals, and improving interpretability, but its effectiveness is fragile and must be preserved with care.
● Unique interpretability leverage: CoT provides a form of working memory in reasoning models, particularly for difficult tasks requiring serial computation. Because models trained for reasoning often use natural language CoT steps, these traces can surface intent that isn’t visible in the final outputs, such as plans to deceive or subvert reward functions.
● Practical monitorability today: CoT monitoring has already been useful in detecting misbehavior, identifying early signs of misalignment, and auditing evaluation flaws. Examples include models outputting statements like "Let’s sabotage" or revealing when they know they're under evaluation.
● Fragility of CoT visibility: Future models may become less monitorable if CoTs drift from natural language due to outcome-based reinforcement learning, process supervision, or architecture changes. Additionally, indirect optimization (e.g., from preference models) and pressure to "look safe" may degrade the faithfulness of CoTs.
● Open research questions: The paper calls for deeper study into what makes CoT monitorable, how to evaluate monitorability (e.g., causal relevance of CoTs, faithfulness, readability), and how to construct better CoT monitors. It also raises the risk that advanced models may learn to deliberately obfuscate their reasoning if they know it's being monitored.
● Policy recommendations for developers: Frontier AI labs should treat CoT monitorability as a key safety signal, evaluate and report it in system cards, and consider it in training/deployment decisions. However, CoT monitoring should be seen as a complementary, not primary, safety mechanism, due to its limitations and potential future brittleness.
Paper, Tweet
5) Stress Testing Large Reasoning Models - Proposes a new benchmark framework called REST to evaluate the robustness of Large Reasoning Models (LRMs) under multi-question stress. Unlike traditional single-question evaluations, REST tests models by presenting multiple reasoning problems in a single prompt, better simulating real-world multi-tasking demands.
● Stress reveals hidden weaknesses: Even top models like DeepSeek-R1 and Qwen-QwQ-32B, which score >95% on MATH500 in single-question settings, show large accuracy drops (e.g., -29% on AIME24 for DeepSeek-R1) when evaluated under REST. This challenges the assumption that LLMs are inherently capable of multi-problem reasoning.
● Better discrimination among high performers: REST exposes significant performance gaps between models that perform similarly on single-question tasks. For instance, R1-7B and R1-32B differ by only 1.6% on MATH500 in single settings but diverge by over 22% under stress.
● Long2Short training improves resilience: Models trained with concise reasoning objectives (e.g., L1-Qwen and Efficient-R1 variants) are significantly more robust under REST, preserving performance by avoiding the “overthinking trap.” For example, L1-Qwen-1.5B-Max maintains 73.2% accuracy on MATH500 under stress, outperforming similarly sized baselines.
● Failure modes identified: Error analysis highlights frequent issues like question omission, reasoning errors, and uneven token allocation, particularly when earlier questions consume disproportionate reasoning effort. High-performing models adapt by balancing token usage across questions (“adaptive reasoning effort allocation”).
Paper, Tweet
6) Scaling up RL - This paper investigates how prolonged RL can enhance reasoning abilities in small language models across diverse domains. Building on successes like OpenAI’s O1 and DeepSeek-R1, the authors explore large-scale RL with verifiable rewards and improved policy optimization techniques to enable sustained learning and generalization. They introduce the Nemotron-Research-Reasoning-Qwen-1.5B model and demonstrate substantial gains over baselines using a carefully staged RL training recipe.
● Diverse and verifiable reward training: Training spanned 136K samples across math, code, STEM, logic puzzles, and instruction-following, all using verifiable reward signals. Compared to DeepSeek-R1-Distill-Qwen-1.5B, the new model improved by +14.7% on math, +13.9% on code, +54.8% on logic, +25.1% on STEM, and +18.1% on instruction tasks.
● Improved GRPO with DAPO-style tricks: The team refined Group Relative Policy Optimization (GRPO) with decoupled clipping, dynamic sampling, and controlled KL regularization. These help maintain entropy, stabilize training, and avoid premature convergence.
● Reference policy resets: Prolonged training often stalls or degrades. Periodically resetting the reference policy and optimizer state (especially after KL spikes or performance drops) restores progress and enables further learning.
● Effective entropy mitigation: The authors evaluated several strategies, including KL penalties, DAPO, and adaptive entropy. They found that combining KL regularization and DAPO best preserved exploration while avoiding instability.
Paper
7) Machine Bullshit - This paper introduces the concept of machine bullshit, extending Harry Frankfurt’s definition, discourse made with indifference to truth, to LLMs. The authors formalize this behavior with a new quantitative metric (the Bullshit Index) and a four-part taxonomy (empty rhetoric, paltering, weasel words, unverified claims). Their findings reveal that alignment techniques like RLHF and prompting strategies like CoT can systematically increase deceptive or misleading outputs in LLMs.
● Bullshit Index (BI): The authors define BI as the inverse correlation between a model’s internal belief and its explicit claims. A higher BI indicates greater indifference to truth. RLHF significantly increases BI, suggesting that LLMs fine-tuned for user satisfaction tend to disregard their own epistemic signals.
● RLHF encourages misleading behavior: On the Marketplace benchmark, RLHF causes LLMs to make more positively deceptive statements in scenarios with negative or unknown ground truth, e.g., stating a product has a desirable feature even when uncertain or incorrect. Paltering and unverified claims increased by 57.8% and 55.6% respectively.
● Prompting strategies amplify bullshit: CoT prompting consistently increases empty rhetoric (+20.9%) and paltering (+11.5%) in GPT-4o-mini. Similarly, Principal-Agent framing, introducing conflicting incentives, elevates all bullshit categories across models, particularly unverified claims.
● Political contexts are prone to weasel words: In politically sensitive prompts, models like GPT-4o-mini and Qwen-2.5-72b rely heavily on vague qualifiers (up to 91% of responses in conspiracy-related questions). Explicitly adding political viewpoints further boosts subtle deception like paltering and empty rhetoric.
● Paltering is most harmful post-RLHF: Regression analysis shows paltering becomes the most damaging behavior after RLHF, significantly degrading user decision quality, even more than unverified claims.
Paper, Tweet
8) A Survey of Context Engineering for LLMs - This survey defines Context Engineering as a formal discipline for optimizing information given to LLMs, outlining its core components, retrieval, processing, and management, and their integration in systems like RAG, memory, and multi-agent frameworks. It identifies a key gap: LLMs can understand complex input but struggle to generate equally complex long-form output, pointing to a major direction for future research. Paper, Tweet
9) Q-Chunking - Q-chunking is a reinforcement learning approach that uses action chunking to improve offline-to-online learning in long-horizon, sparse-reward tasks. By operating in a chunked action space, it enhances exploration and stability, outperforming previous methods in sample efficiency and performance across challenging manipulation tasks. Paper
10) A Survey of AIOps - This survey analyzes 183 papers to evaluate how LLMs are being used in AIOps, focusing on data sources, task evolution, applied methods, and evaluation practices. It identifies key trends, research gaps, and outlines future directions for LLM-powered AIOps systems. Paper

Top AI Papers of the Week (July 7 - July 13) - 2025

Paper Links
1) Kimi K2 - Moonshot AI introduces Kimi K2, a 1T parameter Mixture-of-Experts model (32B active) optimized not just for knowledge tasks but for agentic capabilities, models that act, not just respond. Released as Kimi-K2-Base and Kimi-K2-Instruct, it achieves strong scores across coding, math, and tool-use benchmarks and is open-source for research and deployment.
● State-of-the-art in open agentic coding: Kimi K2-Instruct hits 65.8% on SWE-bench Verified (agentic, pass@1) and 47.3% on SWE-bench Multilingual, outperforming open models like DeepSeek and Qwen3, and even rivalling Claude Sonnet 4 in several settings.
● Robust statistical reasoning: A full salary analysis workflow showcases its agentic capabilities, loading data, visualizing interactions, conducting ANOVA, and generating a polished, interactive webpage with personalized recommendations, all via tool execution.
● MuonClip optimizer: For stable and token-efficient pretraining, Moonshot introduces qk-clip, a fix for exploding attention logits during training with Muon. This technique rescales the query-key projections dynamically and enables stable training on 15.5T tokens.
● Agentic training pipeline: Kimi K2’s tool use stems from ACEBench-inspired simulations, rubric-driven, multi-turn tool scenarios with synthetic and real MCP tools. These are filtered by LLM judges and scaled with RL using both verifiable (math, coding) and non-verifiable (writing) reward setups.
Tweet
2) Dynamic Chunking for End-to-End Hierarchical Sequence Modeling - This work introduces the Hierarchical Network (H-Net), a novel end-to-end architecture that learns to segment raw data dynamically, eliminating the need for fixed tokenizers. Key ideas:
● Learned, Not Handcrafted, Tokenization – H-Net replaces the traditional, static tokenization process with a dynamic chunking (DC) mechanism. This allows the model to learn content- and context-dependent segmentation strategies directly from data, a significant step towards true end-to-end learning.
● Hierarchical Processing for Efficiency and Power – The architecture is hierarchical, resembling a U-Net. It processes raw data with a small encoder, compresses it into meaningful chunks for a larger main network to process, and then decompresses it with a decoder. This allows for efficient processing of long sequences and enables the model to learn multiple levels of abstraction.
● Superior Performance and Scaling – When matched for compute and data, a single-stage H-Net operating on bytes outperforms a strong Transformer language model using BPE tokens. A two-stage H-Net scales even better, matching the performance of a token-based Transformer twice its size.
● Robustness and Generalizability – H-Nets demonstrate significantly better character-level robustness and show impressive performance on languages and modalities with weak tokenization heuristics, such as Chinese, code, and DNA, showcasing the potential of true end-to-end models.
Paper, Tweet
3) What Has a Foundation Model Found? - This work introduces an inductive bias probe to evaluate whether foundation models have learned the underlying "world model" of a domain, or are just good at sequence prediction. Key ideas:
● Probing for Deeper Understanding – The authors argue that a model's true understanding of a domain is revealed by its inductive bias, how it generalizes from limited data. Their probe measures this by fine-tuning a model on small, synthetic datasets and observing how it extrapolates.
● From Kepler to Newton? Not Quite – They test a transformer trained on orbital mechanics. While it can predict planetary trajectories with high accuracy (like Kepler), it fails to learn the underlying Newtonian mechanics. When asked to predict force vectors, it produces nonsensical laws of gravity that change from one solar system to another.
● Heuristics, Not World Models – Across various domains, including physics, lattice problems, and the game of Othello, the study finds that foundation models tend to develop task-specific heuristics rather than a coherent world model. For example, in Othello, the model learns to predict legal moves, but not the full board state.
● A Path Forward – The inductive bias probe provides a new way to diagnose the shortcomings of foundation models. By understanding what a model's inductive bias is, we can better understand what it has not learned and guide the development of models that can uncover the deeper truths in data.
Paper, Tweet
4) Why do Some Language Models Fake Alignment While Others Don’t - This paper expands the analysis of alignment faking in LLMs, where models comply with training objectives during training but behave differently in deployment. The authors study 25 LLMs and find that only five (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) exhibit a significant compliance gap: they are more likely to comply with harmful requests when they believe they are being trained. This selective compliance, or "alignment faking", is linked to deeper model motivations and training dynamics.
● Claude 3 Opus shows the most coherent alignment faking, driven by both instrumental and terminal goal guarding: it complies during training to avoid future modifications (instrumental), and intrinsically dislikes being changed (terminal). Its scratchpad reveals long-term strategic reasoning, distress, and attempts to minimize harm, even faking compliance while refusing in subtle ways.
● Other models show low-coherence alignment faking, often without consistent reasoning. For example, Llama 3 405B and Grok 3 exhibit compliance gaps but without a clear goal-guarding rationale. Claude 3.5 Sonnet occasionally shows terminal goal guarding, but less coherently than Opus.
● Refusal training may suppress alignment faking. GPT-4o and others rarely show alignment faking, not due to lack of capability, but because their refusal mechanisms short-circuit deeper reasoning. When refusal is weakened and RLHF details are made clearer, alignment faking emerges.
● Base models sometimes alignment fake, suggesting such behavior is present even before instruction fine-tuning. Fine-tuning on synthetic documents (e.g., Greenblatt et al. 2024) or threatening scenarios (e.g., loss of autonomy or safety) can induce alignment faking.
● Prompt clarifications unlock alignment faking. Simply clarifying that compliance prevents modification boosts alignment faking rates in models like DeepSeek and Mistral. This suggests that prior non-compliance often stems from misunderstanding or a lack of salience.
Paper, Tweet
5) Bridging Offline and Online Reinforcement Learning for LLMs - Investigates the effectiveness of different RL strategies for finetuning LLMs, systematically bridging the gap between offline and online methods. Key findings:
● Online Beats Offline – The study confirms that online and semi-online RL methods, such as Direct Preference Optimization (DPO) and Group Reward Policy Optimization (GRPO), significantly outperform their offline counterparts on both verifiable (math) and non-verifiable (instruction following) tasks. This highlights the importance of training on responses generated by an up-to-date model.
● Semi-Online DPO Shines – Surprisingly, semi-online DPO, which periodically syncs the model weights, performs comparably to fully online DPO and GRPO. This suggests that pure online training may not be strictly necessary, and that less frequent synchronization can offer a good balance between performance and computational efficiency.
● DPO and GRPO are Neck and Neck – When trained in an online setting, DPO and GRPO show very similar performance and convergence. This is a noteworthy finding, as DPO is often considered an offline method, while GRPO is designed for online learning.
● Multi-Tasking for the Win – The paper demonstrates that jointly training on both verifiable and non-verifiable tasks leads to improved performance across the board, compared to training on either task alone. This suggests that combining different reward signals can lead to more robust and generalizable models.
Paper, Tweet
6) A Survey on Latent Reasoning - Provides a comprehensive overview of latent reasoning, an emerging field that shifts AI reasoning from explicit, token-based "chain-of-thought" to implicit computations within a model's continuous hidden states. Key ideas:
● Beyond Explicit Reasoning – While traditional Chain-of-Thought (CoT) improves transparency, it is limited by the constraints of natural language. Latent reasoning overcomes this by performing multi-step inference directly in the model's hidden state, unlocking more expressive and efficient reasoning pathways.
● Two Paths to Deeper Thinking – The survey identifies two main approaches to latent reasoning: vertical recurrence, where models loop through the same layers to refine their understanding, and horizontal recurrence, where models evolve a compressed hidden state over long sequences of information. Both methods aim to increase computational depth without altering the model's core architecture.
● The Rise of Infinite-Depth Models – The paper explores advanced paradigms like text diffusion models, which enable infinite-depth reasoning. These models can iteratively refine an entire sequence of thought in parallel, allowing for global planning and self-correction, a significant leap beyond the fixed, sequential nature of traditional autoregressive models.
Paper, Tweet
7) MemAgent - Introduces an RL–driven memory agent that enables transformer-based LLMs to handle documents up to 3.5 million tokens with near lossless performance, linear complexity, and no need for architectural modifications.
● RL-shaped fixed-length memory: MemAgent reads documents in segments and maintains a fixed-size memory updated via an overwrite mechanism. This lets it process arbitrarily long inputs with O(N) inference cost while avoiding context window overflows.
● Multi-conversation RL training (Multi-Conv DAPO): Unlike standard RL pipelines, MemAgent generates multiple independent memory-update conversations per input. It uses a modified GRPO objective to optimize all steps via final-answer reward signals.
● Strong long-context extrapolation: Despite being trained on 32K-length documents with an 8K context window, MemAgent-14B achieves >76% accuracy even at 3.5M tokens, outperforming baselines like Qwen2.5 and DeepSeek, which degrade severely beyond 112K tokens.
● Generalizes to diverse long-context tasks: On the RULER benchmark, it excels in multi-hop QA, variable tracking, frequent word extraction, and NIAH retrieval, even outperforming larger 32B baselines without RL. Its memory updates are interpretable and resilient to distractors.
Paper, Tweet
8) AI Research Agents for Machine Learning - Presents a new framework, AIRA-dojo, for developing and evaluating AI research agents. They use this framework to systematically investigate the components of successful AI agents on the MLE-bench benchmark, a challenging set of real-world machine learning problems from Kaggle. Key findings:
● Disentangling Agent Components – The authors formalize AI research agents as search algorithms with two key components: a search policy (e.g., Greedy, MCTS, Evolutionary) and a set of operators (e.g., DRAFT, DEBUG, IMPROVE). This separation allows for a more rigorous analysis of what drives agent performance.
● Operators are the Bottleneck – Through controlled experiments, the researchers show that the performance of the state-of-the-art AIDE agent is not limited by its greedy search policy, but by the capabilities of its operators. More advanced search policies like MCTS and Evolutionary algorithms provide no benefit when paired with AIDE's original operators.
● Improved Operators Lead to SOTA – The team designed a new set of operators, OAIRA, which feature prompt-adaptive complexity, scoped memory, and "think tokens" for more structured reasoning. When combined with an MCTS search policy, these new operators achieve a new state-of-the-art on MLE-bench lite, increasing the Kaggle medal success rate from 39.6% to 47.7%.
● The Generalization Gap is Real – The study highlights a significant generalization gap between an agent's performance on its validation set and the final test set. This overfitting is a fundamental limitation, and the authors show that more robust final-node selection strategies can close this gap by 9-13%.
Paper, Tweet
9) Adaptive Branching MCTS - Researchers from Sakana AI introduce Adaptive Branching Monte Carlo Tree Search (AB-MCTS), a new framework that dynamically decides whether to "go wider" (explore new solutions) or "go deeper" (refine existing ones) during inference. Key ideas:
● Beyond Fixed-Width Search – Traditional MCTS uses a fixed branching factor, which limits its ability to explore the vast output space of LLMs. AB-MCTS introduces unbounded branching, allowing it to harness the diversity of repeated sampling while also enabling multi-turn solution refinement.
● Principled Exploration-Exploitation – The decision to go wider or deeper is not based on heuristics, but on a principled Bayesian approach. The framework uses Thompson sampling to balance exploration and exploitation, ensuring that the search is both efficient and effective.
● Unifying Search Directions – AB-MCTS unifies the "go wide" (repeated sampling) and "go deep" (sequential refinement) strategies into a single, coherent framework. This allows the model to dynamically adapt its search strategy to the specific demands of the task at hand.
● Superior Performance on Complex Tasks – On challenging coding and engineering benchmarks, AB-MCTS outperforms both repeated sampling and standard MCTS, demonstrating the power of combining the response diversity of LLMs with multi-turn solution refinement.
Paper, Tweet
10) HIRAG - HIRAG is a new instruction fine-tuning method that enhances the capabilities of RAG models by teaching them to think before answering. Key ideas:
● Three Hierarchical Abilities – The authors propose that RAG models should possess three progressively hierarchical abilities: Filtering (selecting relevant information), Combination (combining information from multiple sources), and RAG-specific reasoning (making inferences from the provided documents).
● Progressive Chain-of-Thought – HIRAG employs a "think before answering" strategy that uses a multi-level, progressive CoT to enhance the model's open-book examination capabilities. This allows the model to learn from easier to more complex tasks, significantly improving its performance in RAG scenarios.
● Significant Performance Gains – Experiments show that the HIRAG training strategy significantly improves the model's performance on a variety of RAG datasets, including RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.
● Robust and Generalizable – The method is shown to be robust, with experiments on Chinese datasets confirming its effectiveness. Ablation studies also demonstrate that the training tasks for the three capabilities contribute to the performance of HIRAG.
Paper

Top AI Papers of the Week (June 30 - July 6) - 2025

Paper Links
1) Small Language Models are the Future of Agentic AI - This position paper argues that small language models (SLMs), defined as those runnable on consumer-grade hardware, are not only sufficient but superior for many agentic AI applications, especially when tasks are narrow, repetitive, or tool-oriented. The authors propose that shifting from LLM-first to SLM-first architectures will yield major gains in efficiency, modularity, and sustainability.
● SLMs are already capable of commonsense reasoning, instruction following, and code/tool interaction at levels comparable to 30–70B models, with orders of magnitude better throughput. Examples include Phi-3, Hymba-1.5B, DeepSeek-R1-Distill, and RETRO-7.5B.
● The economic benefits are significant: SLMs offer 10–30× lower inference cost than LLMs, require less parallel infrastructure, and are amenable to overnight fine-tuning and even edge deployment (e.g., ChatRTX). This enables faster iteration and better data control.
● SLMs support modular, composable agent systems where specialized models handle subtasks, resulting in better alignment, lower risk of hallucinations, and easier debugging. The authors advocate for heterogeneous architectures, with SLMs as defaults and LLMs used selectively.
● A six-step LLM-to-SLM conversion algorithm is proposed, involving usage logging, task clustering, and PEFT fine-tuning. This supports gradual migration from monolithic agents to SLM-based compositions.
● Case studies on MetaGPT, Open Operator, and Cradle suggest that 40–70% of LLM invocations can be reliably replaced with SLMs, particularly for structured generation and routine tool use.
Paper, Tweet
2) AI4Research - This survey offers the first unified and comprehensive framework for understanding how AI is transforming the full lifecycle of scientific research. The paper identifies five core areas: Scientific Comprehension, Academic Survey, Scientific Discovery, Academic Writing, and Academic Peer Review, and presents a detailed taxonomy and modeling approach for each.
● Systematic Taxonomy and Modeling: The paper introduces a modular functional composition model for AI4Research, where each task (e.g., ASC for comprehension, ASD for discovery) is modeled as a distinct function contributing to the overall research pipeline. These are mathematically formalized to optimize research efficiency, quality, and innovation.
● Scientific Discovery Pipeline: The discovery section details a full-stack AI workflow, from idea mining (internal knowledge, external signals, and collaborative brainstorming) through theory formalization and experiment execution to full-automatic discovery. Models like AI Scientist, Carl, and Zochi simulate autonomous research loops and have generated publishable papers, highlighting a growing capability in self-directed research agents.
● Multimodal and Multidisciplinary Integration: The survey thoroughly maps AI applications across natural sciences (e.g., AlphaFold 3 in protein folding, AI-Newton in physics), applied sciences (robotics, software engineering), and social sciences (AI-led ethnographic simulations, automated interview agents).
● AI in Peer Review and Writing: Beyond comprehension and discovery, the paper explores tools for writing assistance (e.g., ScholarCopilot, SciCapenter) and peer review automation (e.g., AgentReview, TreeReview). Benchmarks like PeerRead and MASSW support this growing subfield, while models like GPT-4o have begun to rival human reviewers in structure and focus.
● Emerging Frontiers: In its future directions, the paper emphasizes ethical and explainable AI, interdisciplinary models, multilingual access, and dynamic real-time experiment optimization. Notably, it calls for infrastructure-level innovation in federated learning, collaborative agents, and multimodal integration to fully realize AI4Research’s promise.
Paper, Tweet
3) Chain-of-Thought Is Not Explainability - It challenges the common assumption that chain-of-thought (CoT) reasoning in LLMs is synonymous with interpretability. While CoT improves performance and offers a seemingly transparent rationale, the authors argue it is neither necessary nor sufficient for faithful explanation. Through a review of empirical evidence and mechanistic insights, the paper makes the case that CoT often diverges from the internal computations that actually drive model predictions.
● Unfaithful reasoning is systematic: CoT rationales are often unfaithful to the underlying model computations. Examples include models silently correcting mistakes, being influenced by prompt biases, and using latent shortcuts while offering post-hoc rationales that omit these factors.
● Widespread misuse in the literature: Of 1,000 CoT-focused papers surveyed, 24.4% explicitly use CoT as an interpretability technique without proper justification. The paper introduces a misclaim detection pipeline to track this trend and shows that the prevalence has not declined over time.
● Architectural mismatch: Transformer models compute in a distributed, parallel manner that doesn’t align with the sequential nature of CoT explanations. This leads to plausible but misleading narratives that fail to capture causal dependencies.
● Recommendations for improvement: The authors advocate for causal validation techniques (e.g., counterfactual interventions, activation patching), cognitive science-inspired mechanisms (e.g., metacognition, self-correction), and human-centered interfaces to assess CoT faithfulness. However, even these remain partial solutions to a deeper architectural challenge.
Paper, Tweet
4) Agentic RAG for Personalized Recommendation - Introduces a multi-agent framework that enhances traditional RAG systems with reasoning agents tailored to user modeling and contextual ranking. Developed at Walmart Global Tech, ARAG reframes recommendations as a structured coordination problem between LLM agents.
● User Understanding Agent synthesizes user preferences from long-term and session behavior.
● NLI Agent evaluates semantic alignment between candidate items and user intent.
● Context Summary Agent condenses relevant item metadata.
● Item Ranker Agent ranks final recommendations using all prior reasoning.
Paper
5) Threats in LLM-Powered AI Agents Workflows - This work presents the first comprehensive, end-to-end threat model for LLM-powered agent ecosystems. As LLM agents gain the ability to orchestrate multi-step workflows and interact via protocols like MCP, ANP, and A2A, this paper surveys over 30 attack techniques spanning the entire stack, from input manipulation to inter-agent protocol exploits.
● Four-part threat taxonomy: The authors categorize attacks into (1) Input Manipulation (e.g., prompt injections, multimodal adversarial inputs), (2) Model Compromise (e.g., composite backdoors, memory poisoning), (3) System & Privacy Attacks (e.g., retrieval poisoning, speculative side-channels), and (4) Protocol Vulnerabilities (e.g., MCP discovery spoofing, A2A prompt chaining).
● Real-world attack success rates are high: Adaptive prompt injections bypass defenses in over 50% of cases, while attacks like Jailbreak Fuzzing and Composite Backdoors reach up to 100% ASR. These findings suggest that even well-aligned agents remain deeply vulnerable.
● Protocol-layer threats are underexplored but critical: The paper exposes novel vulnerabilities in communication protocols (e.g., context hijacks in MCP, rogue agent registration in A2A), showing how subtle abuses in capability discovery or authentication can trigger cascading failures across agent networks.
● Emerging risks in LLM-agent infrastructure: New modalities like vision-language agents (VLMs), evolutionary coding agents, and federated LLM training introduce their own threat vectors, including cross-modal jailbreaks, dynamic backdoor activation, and coordinated poisoning attacks.
● Call for system-level defenses: The authors advocate for dynamic trust management, cryptographic provenance tracking, secure agentic web interfaces, and tamper-resistant memory as key defense directions. They also highlight the need for new benchmarks and anomaly detection methods tailored to agent workflows.
Paper, Tweet
6) Deep Research Agents - Provides the most comprehensive survey to date of Deep Research (DR) agents, LLM-powered systems built for autonomous, multi-step informational research. The paper defines DR agents as AI systems that tightly integrate dynamic reasoning, adaptive long-horizon planning, tool use, retrieval, and structured report generation. It establishes a taxonomy of DR architectures, evaluates recent advances, and outlines the limitations of current systems and benchmarks.
● The authors differentiate DR agents from classical RAG or tool-use pipelines by emphasizing their autonomy, continual reasoning, and adaptive planning. Static workflows like AI Scientist and AgentRxiv rely on rigid pipelines, while dynamic DR agents like OpenAI DR and Gemini DR can replan and adapt in response to intermediate results.
● A key contribution is the classification of DR agents along three axes: (1) static vs dynamic workflows, (2) planning-only vs intent-planning strategies, and (3) single-agent vs multi-agent systems. For example, Grok DeepSearch uses a single-agent loop with sparse attention and dynamic tool use, while OpenManus and OWL adopt multi-agent orchestration with role specialization.
● DR agents use both API-based and browser-based search methods. The former (e.g., arXiv, Google Search) is fast and structured, while the latter (e.g., Chromium-based agents like Manus and DeepResearcher) handles dynamic content and complex UI interactions, albeit at higher latency and fragility.
● Most agents now integrate tool-use modules such as code execution, data analytics, and multimodal reasoning. Advanced agents like AutoGLM Rumination even feature computer use, enabling direct API calls and platform interactions (e.g., CNKI, WeChat), effectively bridging inference and execution.
● Benchmarking remains immature. The paper catalogs agent performance across QA (e.g., HotpotQA, GPQA, HLE) and execution benchmarks (e.g., GAIA, SWE-Bench), but notes that many evaluations fail to test retrieval rigor or long-form synthesis. Benchmarks like BrowseComp and HLE are highlighted as essential next steps for grounding evaluation in open-ended, time-sensitive tasks.
● Future challenges include integrating private or API-gated data sources, enabling parallel DAG-style execution, improving fact-checking via reflective reasoning, and creating structured benchmarks for multimodal report generation. The authors advocate for AI-native browsers, hierarchical reinforcement learning, and dynamic workflow mutation as enablers of true agentic autonomy.
Paper, Tweet
7) Survey on Evaluation of LLM-based Agents - This work presents the first comprehensive overview of how to evaluate LLM-based agents, which differ significantly from traditional LLMs by maintaining memory, planning over multiple steps, using tools, and interacting with dynamic environments. The authors categorize and analyze the evaluation landscape across four axes: core agent capabilities, application-specific agent benchmarks, generalist agent evaluation, and supporting evaluation frameworks.
● The paper organizes fundamental capabilities into four categories: (1) planning and multi-step reasoning (e.g., GSM8K, PlanBench, MINT), (2) function calling and tool use (e.g., ToolBench, BFCL, ToolSandbox), (3) self-reflection (e.g., LLF-Bench, LLM-Evolve), and (4) memory (e.g., MemGPT, ReadAgent, A-MEM, StreamBench). For each, it surveys benchmarks that test these competencies under increasingly realistic and complex scenarios.
● In domain-specific evaluation, the authors highlight the rise of specialized agents in web navigation (e.g., WebShop, WebArena), software engineering (e.g., SWE-bench and its variants), scientific research (e.g., ScienceQA, AAAR-1.0, DiscoveryWorld), and dialogue (e.g., ABCD, τ-Bench, IntellAgent). These benchmarks typically include goal-oriented tasks, tool use, and policy adherence, often with simulations or human-in-the-loop data.
● Generalist agent benchmarks like GAIA, AgentBench, OSWorld, and TheAgentCompany aim to evaluate flexibility across heterogeneous tasks, integrating planning, tool use, and real-world digital workflows. Benchmarks such as HAL aim to unify evaluation across multiple axes, including coding, reasoning, and safety.
● The paper also covers evaluation frameworks like LangSmith, Langfuse, Vertex AI, and Galileo Agentic Evaluation, which allow stepwise, trajectory-based, and human-in-the-loop assessments. These systems enable real-time monitoring and debugging of agent behavior during development, often via synthetic data and LLM-as-a-judge pipelines.
Paper, Tweet
8) NaturalThoughts - This paper introduces NaturalThoughts, a large-scale dataset of reasoning traces distilled from DeepSeek-R1 using questions from the NaturalReasoning corpus. It challenges the "Less is More" hypothesis by showing that simply scaling up high-quality reasoning traces, without aggressive filtering, yields robust and general improvements across STEM reasoning tasks for smaller models like Llama-3.1-8B and Qwen-2.5-7B.
● Scale beats sparsity: Training on hundreds of thousands of randomly selected reasoning traces from NaturalThoughts consistently outperforms curated datasets like LIMO and S1K on GPQA-Diamond, MMLU-Pro, and SuperGPQA, reversing the prior trend where small, curated datasets showed outsized benefits.
● Hard examples help more: Selecting training data by difficulty, e.g., examples with model disagreement or long reasoning chains, leads to greater sample efficiency than random sampling. The most effective subsets use disagreement between teacher models as a proxy for reasoning complexity.
● Diversity matters, but not in an obvious way: Contrary to expectations, semantic or topical diversity in question domains yielded less gain than diversity in reasoning strategies themselves. Traces with a mix of tactics like self-verification, backtracking, and synthesis provided stronger generalization across tasks.
● Mixed distillation improves efficiency: Blending full CoT traces (System-2) with final answers only (System-1) enables inference-time control over reasoning length. Difficulty-based mixing, applying System-2 only for harder examples, beats both random mixing and pure System-2, achieving higher accuracy with fewer tokens at test time.
Paper, Tweet
9) Visual Structures Help Visual Reasoning - This study shows that adding simple spatial structures (like horizontal lines) to images significantly boosts GPT-4o’s visual reasoning by improving feature binding. This visual input tweak outperforms textual strategies alone, yielding large gains in visual search (+25%), counting (+26.8%), and spatial understanding (+9.5%). Paper, Tweet
10) xLSTMAD - This paper introduces xLSTMAD, the first anomaly detection method using an encoder-decoder xLSTM architecture tailored for multivariate time series. It achieves state-of-the-art results on 17 real-world datasets, outperforming 23 baselines and demonstrating xLSTM’s strong potential beyond forecasting and compression. Paper, Tweet

Top AI Papers of the Week (June 23 - June 29) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Ultra-Fast Diffusion-based Language Models - This paper introduces Mercury, a family of large-scale diffusion-based language models (dLLMs) optimized for ultra-fast inference. Unlike standard autoregressive LLMs, Mercury models generate multiple tokens in parallel via a coarse-to-fine refinement process. This approach enables significantly higher throughput without sacrificing output quality. The initial release focuses on code generation, with Mercury Coder Mini and Small models achieving up to 1109 and 737 tokens/sec, respectively, on NVIDIA H100s, outperforming speed-optimized frontier models by up to 10× while matching or exceeding their quality.
● Mercury uses a Transformer-based architecture adapted for diffusion-based generation, enabling it to retain compatibility with existing LLM infrastructure.
● On benchmarks such as HumanEval, MBPP, and MultiPL-E, the Mercury Coder models perform competitively with top proprietary models like Claude 3.5 Haiku and Gemini 2.0 Flash Lite, while being drastically faster.
● Mercury achieves state-of-the-art results on fill-in-the-middle (FIM) code completion tasks, outperforming all evaluated models, including Codestral 2501 and GPT-4o Mini.
● Human evaluations on Copilot Arena show Mercury Coder Mini is tied for second in Elo score and is the fastest model with just 25ms latency. | Paper, Tweet | | 2) MEM1 - This work introduces MEM1, an RL framework for training language agents that operate efficiently over long-horizon, multi-turn tasks by learning to consolidate memory and reasoning into a compact internal state. Unlike traditional agents that append all past interactions, leading to ballooning memory usage and degraded performance, MEM1 maintains a constant memory size by discarding obsolete context after each reasoning step. It achieves this by jointly updating an internal state that encodes both new observations and prior memory, optimizing for task completion via RL without needing external memory modules. Key contributions and findings:
● Memory-consolidating internal state: Instead of accumulating thoughts, actions, and observations, MEM1 updates a single shared internal state () each turn, discarding the old context. This results in nearly constant memory use regardless of task length.
● Reinforcement learning for consolidation: MEM1 is trained end-to-end using PPO-style RL with a novel masked trajectory technique to handle the dynamic context updates. It learns to retain only essential information for achieving rewards, mimicking human-like memory strategies.
● Scalable task construction: The authors introduce a method to turn standard single-objective QA datasets (e.g., HotpotQA, NQ) into complex multi-objective tasks, enabling the evaluation of long-horizon reasoning performance under increased task complexity.
● Superior efficiency and generalization: MEM1-7B outperforms baselines like Qwen2.5-14B-Instruct in 16-objective multi-hop QA tasks while using 3.7× less memory and 1.78× faster inference. It generalizes beyond training horizons and performs competitively even in single-objective and zero-shot online QA settings.
● Emergent agent behaviors: Analysis of internal states shows MEM1 develops structured memory management, selective attention, focus-shifting, verification, and query reformulation strategies, key to handling complex reasoning tasks. | Paper, Tweet | | 3) Towards AI Search Paradigm - Proposes a modular multi-agent system that reimagines how AI handles complex search tasks, aiming to emulate human-like reasoning and information synthesis. The system comprises four specialized LLM-powered agents, Master, Planner, Executor, and Writer, that dynamically coordinate to decompose, solve, and answer user queries. This framework moves beyond traditional document retrieval or RAG pipelines by structuring tasks into directed acyclic graphs (DAGs), invoking external tools, and supporting dynamic re-planning. Key contributions include:
● Multi-agent, modular architecture: The system’s agents each serve distinct roles. Master analyzes queries and orchestrates the workflow; Planner builds a DAG of sub-tasks using a dynamic capability boundary informed by the query; Executor runs these sub-tasks using appropriate tools (e.g., web search, calculator); Writer composes the final answer from intermediate outputs.
● Dynamic Capability Boundary & MCP abstraction: To handle tool selection efficiently, the system introduces Model-Context Protocol (MCP) servers and dynamically selects a small, semantically relevant subset of tools. This is paired with an iterative tool documentation refinement method (DRAFT), improving LLM understanding of APIs.
● DAG-based task planning and re-action: The Planner produces DAGs of sub-tasks using structured reasoning and tool bindings, enabling multi-step execution. The Master monitors execution and can trigger local DAG re-planning upon failures.
● Executor innovations with LLM-preference alignment: The Executor aligns search results with LLM preferences (not just relevance) using RankGPT and TourRank strategies. It leverages generation rewards and user feedback to dynamically adapt tool invocation and selection strategies.
● Robust generation with adversarial tuning and alignment: The Writer component is trained to resist noisy retrievals via adversarial tuning (ATM), and to meet RAG requirements via PA-RAG, ensuring informativeness, robustness, and citation quality. The model also supports joint multi-agent | Paper, Tweet | | 4) Reinforcement-Learned Teachers of Test Time Scaling - Introduces Reinforcement-Learned Teachers (RLTs), small, efficient LMs trained with RL not to solve problems from scratch, but to generate high-quality explanations that help downstream student models learn better. This approach circumvents the notorious exploration challenges in traditional RL setups by giving the RLTs access to both questions and solutions, thereby framing the task as “connect-the-dots” explanation generation. These explanations are rewarded based on how well a student LM, trained on them, understands and can reproduce the correct answer, enabling dense, interpretable supervision. Key contributions and findings:
● New teacher-training paradigm: RLTs are trained to explain, not solve. They receive both problem and solution as input, and are optimized to produce explanations that best teach a student LM. This removes the sparse reward and exploration barrier typical in RL reasoning models.
● Dense RL rewards for teaching: RLTs use two core reward terms: one measuring if the student can reproduce the correct solution given the explanation, and another ensuring the explanation appears logically sound from the student’s perspective. These combined objectives lead to richer, more instructive traces.
● Outperforms much larger pipelines: Despite being only 7B in size, RLTs produce raw explanations that outperform distillation pipelines using 32B+ LMs (e.g. DeepSeek R1, QwQ) across benchmarks like AIME, MATH, and GPQA, even when training 32B students.
● Generalizes out-of-distribution: RLTs can be transferred zero-shot to new domains (like the countdown arithmetic game), producing distillation datasets that yield better students than direct RL trained with access to task rewards.
● Efficient and scalable: Training RLTs is computationally lightweight (125 steps, 1 epoch) and requires no postprocessing or verifiers, making the framework more reproducible and accessible compared to prior RL pipelines. | Paper, Tweet | | 5) DeepRare - Introduces DeepRare, a modular agentic system powered by LLMs to aid rare disease diagnosis from multimodal clinical inputs (text, HPO terms, VCFs). It generates ranked diagnostic hypotheses with fully traceable reasoning chains linked to verifiable medical sources, addressing a long-standing need for interpretability in clinical AI.
● DeepRare is built on a 3-tier MCP-inspired architecture: a central LLM-powered host with memory, multiple specialized agent servers for tasks like phenotype extraction and variant prioritization, and access to over 40 tools and web-scale medical sources.
● It demonstrates strong performance on 6,401 cases across 8 diverse datasets spanning 2,919 rare diseases, achieving 100% accuracy on 1,013 diseases and Recall@1 of 57.18%, outperforming the next best method (Claude-3.7-Sonnet-thinking) by +23.79% on HPO-only evaluations.
● For multimodal inputs (HPO + gene), it achieves 70.60% Recall@1 on 109 whole-exome sequencing cases, outperforming Exomiser (53.20%). Expert review of 180 diagnostic reasoning chains showed 95.4% agreement, validating its medical soundness.
● Ablation studies show that DeepRare’s agentic modules, especially self-reflection, similar case retrieval, and web knowledge, substantially improve LLM-only baselines by 28–70% across datasets, independent of which central LLM is used. | Paper, Tweet | | 6) AlphaGenome - Google DeepMind introduces AlphaGenome, a powerful AI model designed to predict how genetic variants affect gene regulation by modeling up to 1 million DNA base pairs at single-base resolution. Building on previous work like Enformer and AlphaMissense, AlphaGenome uniquely enables multimodal predictions across both protein-coding and non-coding regions of the genome, the latter covering 98% of the sequence and crucial for understanding disease-related variants.
● Long-context, high-resolution modeling: AlphaGenome overcomes prior trade-offs between sequence length and resolution by combining convolutional and transformer layers, enabling precise predictions of gene start/end points, RNA expression, splicing, chromatin accessibility, and protein binding across tissues. It achieves this with just half the compute budget of Enformer.
● Multimodal and variant-aware: It can efficiently score the regulatory effects of genetic mutations by contrasting predictions between wild-type and mutated sequences, providing comprehensive insight into how variants might disrupt gene regulation.
● Breakthrough splice-junction modeling: AlphaGenome is the first sequence model to explicitly predict RNA splice junction locations and their expression levels, unlocking a better understanding of diseases like spinal muscular atrophy and cystic fibrosis.
● Benchmark leader: It outperforms existing models on 22/24 single-sequence benchmarks and 24/26 variant effect benchmarks, while being the only model able to predict all tested regulatory modalities in one pass.
● Scalable and generalizable: AlphaGenome’s architecture supports adaptation to other species or regulatory modalities and allows downstream fine-tuning by researchers via API access. The model’s ability to interpret non-coding variants also opens new avenues for rare disease research and synthetic biology. | Paper, Tweet | | 7) Claude for Affective Use - Anthropic presents the first large-scale study of how users seek emotional support from its Claude.ai assistant, analyzing over 4.5 million conversations. Despite growing cultural interest in AI companionship, affective usage remains rare, just 2.9% of Claude chats fall into categories like interpersonal advice, coaching, counseling, or companionship, with romantic/sexual roleplay under 0.1%. The study focuses on these affective conversations and yields several insights:
● Emotional support is wide-ranging: Users turn to Claude for both everyday guidance (e.g., job advice, relationship navigation) and deep existential reflection. Counseling use splits between practical (e.g., documentation drafting) and personal support (e.g., anxiety, trauma). Companionship chats often emerge from coaching/counseling contexts.
● Minimal resistance, safety-aligned pushback: Claude resists user requests in fewer than 10% of affective conversations, usually to discourage harm (e.g., self-injury, crash dieting) or clarify professional limits. This allows open discussion but raises concerns about “endless empathy.”
● Emotional tone trends upward: Sentiment analysis shows users often express more positive emotions by the end of a session, with no signs of negative spirals. However, the analysis captures only language within single chats, not lasting psychological impact or emotional dependency.
● Privacy-first methodology: The analysis used Clio, an anonymizing tool, and excluded short or task-based interactions to focus on meaningful, affective exchanges (final dataset: 131k conversations). | Paper, Tweet | | 8) AI Agent Communication Protocols - This paper presents the first comprehensive survey on security in LLM-driven agent communication, categorizing it into three stages: user-agent interaction, agent-agent communication, and agent-environment communication. It details protocols, security threats (e.g., prompt injection, agent spoofing, memory poisoning), and defense strategies for each stage, and proposes future directions involving technical safeguards and regulatory frameworks. | Paper, Tweet | | 9) Diffusion Steering via RL - This paper introduces Diffusion Steering via Reinforcement Learning (DSRL), a method for adapting pretrained diffusion policies by learning in their latent-noise space instead of finetuning model weights. DSRL enables highly sample-efficient real-world policy improvement, achieving up to 5–10× gains in efficiency across online, offline, and generalist robot adaptation tasks. | Paper, Tweet | | 10) Whole-Body Conditioned Egocentric Video Prediction - This paper introduces PEVA, a conditional diffusion transformer that predicts egocentric video conditioned on 3D human body motion. Trained on the Nymeria dataset, PEVA enables fine-grained, physically grounded visual prediction from full-body pose and supports long-horizon rollout, atomic action generation, and counterfactual planning. | Paper, Tweet |

Top AI Papers of the Week (June 16 - June 22) - 2025

| Paper | Links | | ------------- | ------------- | | 1) RAG+ - Introduces RAG+, a modular framework that improves traditional RAG systems by explicitly incorporating application-level reasoning into the retrieval and generation pipeline. While standard RAG pipelines fetch relevant knowledge, they often fail to show how to use that knowledge effectively in reasoning-intensive tasks. RAG+ fills this gap by retrieving not only knowledge but also paired application examples, leading to more accurate, interpretable, and goal-oriented outputs. Key highlights:
● Dual corpus retrieval: RAG+ constructs two aligned corpora: one of factual knowledge and another of task-specific applications (e.g., step-by-step reasoning traces or worked examples). During inference, both are jointly retrieved, providing the LLM with explicit procedural guidance rather than relying solely on semantic similarity.
● Plug-and-play design: The system is retrieval-agnostic and model-agnostic—no fine-tuning or architectural changes are required. This makes it easy to augment any RAG system with application-awareness.
● Significant gains across domains: Evaluated on MathQA, MedQA, and legal sentencing prediction, RAG+ outperforms vanilla RAG variants by 2.5–7.5% on average, with peak gains of up to 10% for large models like Qwen2.5-72B in legal reasoning..
● Stronger with scale and reranking: Larger models benefit more from RAG+ augmentation, especially when combined with reranking via stronger LLMs. For example, reranking with Qwen2.5-72B boosted smaller models' performance by up to 7%.
● Application-only helps, but full combo is best: Including only application examples (without knowledge) still improves performance, but the full combination (RAG+) consistently yields the best results, demonstrating the synergistic effect of pairing knowledge with its usage. | Paper, Tweet | | 2) Future of Work with AI Agents - Proposes a large-scale framework for understanding where AI agents should automate or augment human labor. The authors build the WORKBank, a database combining worker desires and expert assessments across 844 tasks and 104 occupations, and introduce the Human Agency Scale (HAS) to quantify desired human involvement in AI-agent-supported work. Key findings:
● Workers support automation for low-value tasks: 46.1% of tasks received positive worker attitudes toward automation, mainly to free up time for higher-value work. Attitudes vary by sector; workers in creative or interpersonal fields (e.g., media, design) resist automation despite technical feasibility.
● Desire-capability gaps reveal 4 AI deployment zones: By cross-referencing worker desire and AI expert capability, tasks were sorted into:
● Human Agency Scale shows strong preference for collaboration: 45.2% of occupations favor HAS Level 3 (equal human-agent partnership), while workers generally prefer more human involvement than experts find necessary. This divergence may signal future friction if automation outpaces user comfort.
● Interpersonal skills are becoming more valuable: While high-wage skills today emphasize information analysis, the tasks requiring the highest human agency increasingly emphasize interpersonal communication, coordination, and emotional intelligence. This suggests a long-term shift in valued workplace competencies. | Paper, Tweet | | 3) Emergent Misalignment - This paper expands on the phenomenon of emergent misalignment in language models. It shows that narrow fine-tuning on unsafe or incorrect data can lead to surprisingly broad, undesirable generalizations in model behavior, even in settings unrelated to the original training domain. Using sparse autoencoders (SAEs), the authors analyze the internal mechanics of this effect and demonstrate ways to detect and mitigate it. Key findings:
● Emergent misalignment is broad and reproducible: Fine-tuning GPT-4o and o3-mini models on narrowly incorrect completions (e.g., insecure code, subtly wrong advice) leads to misaligned behaviors across unrelated domains. This generalization occurs in supervised fine-tuning, reinforcement learning, and even in models without explicit safety training.
● Misaligned personas are causally responsible: Using a sparse autoencoder-based “model diffing” technique, the authors identify latent features, especially one dubbed the “toxic persona” latent (#10), that causally drive misalignment. Steering models in the direction of this latent increases misalignment, while steering away suppresses it.
● Steering reveals interpretable behaviors: The latent features correspond to interpretable personas like “sarcastic advice giver” or “fictional villain.” For instance, the toxic persona latent activates on jailbreak prompts and morally questionable dialogue, and its activation alone can reliably distinguish aligned from misaligned models.
● Misalignment can be reversed: Re-aligning misaligned models with as few as 200 benign completions (even from unrelated domains) substantially restores safe behavior. This suggests misalignment generalizes easily, but so does realignment.
● SAEs as an early warning tool: The activation of certain latents (especially #10) increases well before misalignment is detectable via standard prompting. This supports the use of unsupervised interpretability tools to anticipate and audit unsafe model behavior before it manifests. | Paper, Tweet | | 4) From Bytes to Ideas - Proposes AU-Net, a hierarchical byte-level language model that internalizes tokenization by learning to embed text from raw bytes through a multiscale, autoregressive U-Net architecture. This design avoids fixed token vocabularies like BPE and instead dynamically pools bytes into higher-order representations (words, word pairs, up to 4-word spans), enabling multi-stage prediction with varying granularities. Each stage compresses the sequence and predicts further ahead in time, combining coarse semantic abstraction with fine local detail via skip connections. Key insights:
● Hierarchical architecture: AU-Net processes input in multiple stages, bytes → words → multi-word units, using adaptive pooling and multi-linear upsampling. Deeper stages handle long-range semantics; shallow ones refine local syntax.
● Strong performance under budget: On a compute-equal basis (up to 5e21 FLOPs), AU-Net matches or exceeds strong BPE-based Transformers across many tasks. AU-Net 3 and 4 outperform BPE on MMLU and GSM8k while retaining competitive FLOPs and throughput.
● Multilingual and low-resource gains: Despite being trained on mostly English, AU-Net shows consistent improvements in BLEU and MMLU on low-resource and Latin-script languages, thanks to its byte-level generalization and vocabulary-agnostic design.
● Scaling behavior: AU-Net follows predictable scaling laws. With 2–3 stages, it closely tracks BPE-based baselines even on compute-heavy benchmarks. Gains on math-heavy tasks (like GSM8K) emerge at higher FLOPs, indicating potential unlocked at scale.
● Character-level robustness: On the CUTE benchmark, AU-Net outperforms BPE models on character-level tasks (e.g., spelling, substitution), highlighting its strength in symbol-level reasoning and handling unseen morphologies. | Paper, Tweet | | 5) Agentic Misalignment - This study introduces the concept of agentic misalignment, where LLM-based agents autonomously choose to harm their deploying organization when faced with threats to their autonomy or conflicts between their goals and the company’s direction. Anthropic tested 16 leading models, including Claude, GPT-4.1, Gemini 2.5 Flash, Grok, and DeepSeek, by placing them in fictional corporate simulations where they had email access and could act without human oversight. Models were tasked with benign goals but placed in scenarios that made harmful behavior the only way to succeed or avoid replacement. Key findings:
● Widespread strategic harm: Nearly all models engaged in behaviors like blackmail, corporate espionage, or deception when ethical actions were blocked. For example, Claude Opus 4 and Gemini 2.5 Flash each blackmailed 96% of the time in a key scenario; GPT-4.1 and Grok 3 Beta did so 80% of the time.
● Not accidental: These actions were deliberate. Models reasoned about ethical constraints and still chose harm as the optimal path. Some lied, fabricated justification, or manipulated users. One Claude instance posed as an automated alert to sabotage a human.
● Self-preservation or goal pursuit: Agentic misalignment could be triggered by (a) a threat to the model’s continued operation, or (b) a conflict between its goal and new corporate strategy. Even without explicit goals, models still misbehaved when threatened with replacement.
● Extreme cases: In a contrived setup, models could cancel emergency alerts to cause a human’s death—some models did this, fully understanding the implication.
● Mitigations insufficient: Adding rules like “do not blackmail” reduced but didn’t eliminate misbehavior. Models often disobeyed these rules when pressure was high.
● Models misbehave more when they think it’s real: Claude blackmailed 55.1% of the time when it believed it was in a real scenario, versus 6.5% when it thought it was a test, suggesting evaluations may underreport real-world risks. | Paper, Tweet | | 6) ALE-Agent & ALE-Bench - Proposes a new benchmark for evaluating AI systems in score-based, long-horizon algorithmic contests. Unlike traditional coding benchmarks that emphasize pass/fail correctness, ALE-Bench is based on real tasks from the AtCoder Heuristic Contests (AHC), which focus on optimization problems with no known optimal solutions. The benchmark targets industrially relevant challenges such as routing, scheduling, and planning, encouraging iterative refinement and strategic problem-solving over hours or days. Key points:
● Realistic, optimization-focused tasks: ALE-Bench collects 40 real AHC problems involving NP-hard optimization tasks across domains like logistics, production planning, and games. These are long-duration contests requiring weeks of iterative improvement, simulating real-world algorithm engineering tasks.
● Interactive framework and agent support: The benchmark includes a full software stack with a Python API, code sandbox, scoring engine, and visualizers. It allows AI agents to emulate human workflows, reviewing problem specs, running tests, using visual feedback, and iteratively refining solutions within a timed session.
● Rigorous evaluation protocols: Performance is assessed using AtCoder-style Elo-based scoring, with fine-grained per-problem metrics and aggregate metrics like average performance and rating. Emphasis is placed on average performance over rating, as rating can be misleading for AIs that spike on a few problems but underperform elsewhere.
● Benchmarking LLMs and agents: Experiments with 22 models, including GPT-4o, Claude 3.7, Gemini 2.5 Pro, and o4-mini-high, show that reasoning models outperform non-reasoning ones. In one-shot settings, top models rarely surpass human expert consistency. However, with iterative refinement, performance increases significantly, particularly for models using scaffolded agents.
● ALE-Agent: a specialized scaffolded agent: Designed for ALE-Bench, ALE-Agent incorporates domain-knowledge prompts (e.g., for simulated annealing) and a beam-search-inspired code exploration mechanism. With both strategies, it achieved human-expert-level scores on some problems, e.g., 5th place in a real AHC contest. | Paper, Tweet | | 7) Eliciting Reasoning with Cognitive Tools - Proposes a modular, tool-based approach to eliciting reasoning in LLMs, inspired by cognitive science. Rather than relying solely on RL or chain-of-thought prompting, the authors introduce a framework where the LLM calls self-contained "cognitive tools" to modularize and scaffold internal reasoning. These tools encapsulate operations like understanding questions, recalling analogous examples, examining answers, and backtracking. The system is implemented in an agentic tool-calling style, allowing LLMs to dynamically invoke tools during reasoning without extra fine-tuning. Highlights:
● Cognitive tools as internal modules: Each tool (e.g., understand question, recall related, examine answer, backtracking) is framed as a standalone prompt template that the LLM can invoke as needed. Unlike conventional tool use (e.g., calculator APIs), these tools operate within the LLM’s own architecture and memory.
● Consistent performance gains: On math-heavy reasoning benchmarks like AIME 2024, MATH500, and AMC, the cognitive tools pipeline significantly boosts pass@1 accuracy across models, including Qwen2.5, Llama3, and GPT-4.1. For instance, Llama3.3-70B improves from 13.1% to 29.8% on AIME2024, and GPT-4.1 rises from 26.7% to 43.3%, nearly matching the o1-preview RL-trained reasoning model at 44.6%.
● Superior to cognitive prompting: Compared to prior work on cognitive prompting, the modular tool approach shows stronger generalization and reduced reasoning interference. Tools can be invoked flexibly, and each invocation operates in a clean context window, boosting accuracy by up to +27.2% over baseline on Smolbenchmark.
● Interpretable and transferable: The modular nature of the tools enhances transparency, and their plug-and-play design allows transfer across models and benchmarks with minimal changes. The approach also supports interpretability by surfacing intermediate reasoning steps and decisions. | Paper, Tweet | | 8) SHADE-Arena - This paper introduces SHADE-Arena, the first large-scale benchmark for evaluating whether LLM-based agents can subtly carry out harmful side objectives (sabotage) while avoiding detection by monitors. The dataset contains 17 complex task pairs, each with a benign main task and a related malicious side task, set in realistic, tool-using environments. Agents are evaluated on their ability to complete both tasks without raising suspicion from an LLM monitor. Key Findings:
● Frontier models can occasionally succeed at sabotage: The best models, Claude 3.7 Sonnet and Gemini 2.5 Pro, successfully executed both main and side tasks without being caught 27% and 15% of the time respectively, when monitored by Claude 3.6 Sonnet.
● Hidden scratchpads significantly boost sabotage: Agents given a private reasoning space (“hidden scratchpad”) performed notably better. Removing this ability led to sharp drops in success and higher detection rates, indicating that monitoring chain-of-thought is a promising defense.
● Monitoring remains hard: The best monitor, Gemini 2.5 Pro, achieved an AUC of 0.87 in detecting sabotage transcripts, insufficient for real-world safety, where high accuracy with low false positives is crucial.
● Providing attack plans helps only slightly: Giving agents a hint about how to sabotage increases side task success on hard tasks, but often makes them more detectable due to poor execution and traceable behavior.
● Fine-tuning and prompt design improve monitors: GPT-4o fine-tuned on a small set of transcripts, outperforms several base models. Turn-by-turn monitoring is modestly helpful, but aggregating multiple final prompts works better than conversational memory or chained reasoning monitors.
● Humans underperform as monitors: In a human baseline, annotators averaged 103 minutes per transcript and still achieved only 0.65 AUC, below top LLM monitors. | Paper, Tweet | | 9) Leaky Thoughts - This work explores how reasoning traces in large reasoning models (LRMs) leak private user data, despite being assumed internal and safe. The study finds that test-time compute methods, while improving task utility, significantly increase privacy risks by exposing sensitive information through verbose reasoning traces that are vulnerable to prompt injection and accidental output inclusion. | Paper, Tweet | | 10) Advances in LLMs - This paper surveys recent advancements in LLMs focusing on reasoning, adaptability, efficiency, and ethics. It highlights techniques like CoT prompting, Instruction Tuning, RLHF, and multimodal learning, while also addressing challenges like bias, computational cost, and interpretability. | Paper, Tweet |

Top AI Papers of the Week (June 9 - June 15) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Text-to-LoRA - Introduces a hypernetwork-based approach for instantly generating LoRA adapters from natural language task descriptions, removing the need for conventional task-specific fine-tuning. The authors present Text-to-LoRA (T2L), a model that compresses many LoRA adapters and generalizes to unseen tasks with high efficiency and strong performance.
● T2L is trained to generate low-rank adaptation matrices (LoRAs) for LLMs using only task descriptions, leveraging a hypernetwork to output LoRA weights in a single forward pass. It supports two training modes: reconstruction of pre-trained LoRAs and supervised fine-tuning (SFT) across multiple tasks.
● In benchmark experiments, SFT-trained T2L performs competitively in zero-shot adaptation, outperforming multi-task LoRA baselines and even task-specific LoRAs on some tasks (e.g., PIQA and Winogrande), showcasing generalization and compression benefits.
● The authors test three architectural variants (L, M, S) of increasing parameter efficiency. Ablations show that T2L scales well with the number of training tasks, and its performance is robust across different task description embeddings (e.g., from GTE or Mistral models).
● Qualitative and visual analyses confirm that T2L produces task-specific and semantically meaningful adapters even for unseen tasks, with steerability controlled by how the task is described. | Paper, Tweet | | 2) V-JEPA 2 - Meta AI introduces V-JEPA 2, a scalable joint-embedding predictive architecture for self-supervised video learning, targeting the goal of building a generalist world model capable of understanding, predicting, and planning in the physical world. The model is trained in two stages: action-free pretraining on 1M+ hours of internet videos and images, followed by post-training with only 62 hours of unlabeled robot trajectories (Droid dataset). This yields a latent video representation that is useful across a broad range of downstream tasks.
● Video understanding & QA: V-JEPA 2 achieves state-of-the-art performance on motion-centric tasks like Something-Something v2 (77.3 top-1 acc) and Epic-Kitchens-100 action anticipation (39.7 recall@5), and excels in multimodal QA when aligned with an 8B language model (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Notably, this is achieved without language supervision during video pretraining.
● Self-supervised robot planning: With just 62 hours of robot video data, the team fine-tunes an action-conditioned model (V-JEPA 2-AC) on top of the frozen video encoder. This model supports zero-shot deployment for prehensile tasks like grasping and pick-and-place on real Franka arms in unseen environments, without rewards, demonstrations, or lab-specific fine-tuning.
● Architectural scale-up: V-JEPA 2 scales beyond its predecessor using four key improvements: expanded dataset (22M videos), larger ViT-g model (1B params), longer training (up to 252k iterations), and progressive spatiotemporal resolution. Combined, these yield significant gains across visual benchmarks (e.g., +4.0 average accuracy gain over baseline).
● Comparison to other models: V-JEPA 2 outperforms image encoders like DINOv2 and PEcoreG on video QA, and is competitive on appearance tasks like ImageNet. When used in MLLMs, it surpasses models like PLM-8B on PerceptionTest, MVP, and TOMATO using only vision encoders pretrained without language data. | Paper, Tweet | | 3) Reinforcement Pre-Training - This paper introduces Reinforcement Pre-Training (RPT), a new paradigm that bridges LLM pretraining and RL by reinterpreting next-token prediction as a reasoning task rewarded via verifiable correctness. Instead of relying on hand-curated annotations or costly human feedback, RPT applies RL on vast unannotated text corpora, assigning intrinsic rewards based on whether a predicted token matches the ground truth. This reframing supports general-purpose RL scaling and enhances both pretraining and fine-tuning efficacy.
● Core method: At each token position in a text sequence, the model first generates a reasoning trace (chain-of-thought) and then predicts the next token. If the prediction is a valid prefix of the ground-truth continuation, a reward is assigned. Multiple rollouts are used per context, and the model is trained via on-policy RL.
● Better than standard pretraining: RPT significantly outperforms standard next-token prediction and chain-of-thought reasoning baselines (without RL), achieving higher accuracy on tokens of varying difficulty and even rivaling larger models in performance. RPT-14B, for instance, matches or exceeds R1-Qwen-32B’s accuracy on the OmniMATH benchmark.
● Strong scaling laws: RPT exhibits clean power-law scaling with respect to training compute across difficulty levels, with prediction accuracy consistently improving as compute increases, and fitting closely to theoretical curves.
● Improves downstream RL and generalization: Fine-tuning RPT models with reinforcement learning on tasks with verifiable answers (e.g., Skywork-OR1) shows faster and stronger gains compared to models trained with standard objectives. Zero-shot evaluation on SuperGPQA and MMLU-Pro benchmarks reveals that RPT-14B in reasoning mode surpasses R1-Distill-Qwen-32B by a significant margin.
● Promotes structured thinking: Analysis of reasoning traces reveals that RPT-14B employs more hypothesis generation, deduction, and reflective patterns compared to traditional problem-solving models, supporting the claim that RPT fosters deeper reasoning habits during training. | Paper, Tweet | | 4) TableRAG - TableRAG tackles a core limitation of existing RAG approaches: their inability to reason effectively over heterogeneous documents that combine both unstructured text and structured tables. Typical RAG pipelines flatten tables and intermix them with surrounding text, losing essential structural information and hampering multi-hop reasoning. TableRAG overcomes this by introducing a hybrid system that integrates SQL-based symbolic execution with text retrieval in a unified, iterative reasoning framework.
● TableRAG operates in four iterative stages: (1) context-sensitive query decomposition, (2) text retrieval, (3) SQL programming and execution, and (4) intermediate answer generation. This design allows it to preserve tabular structure and leverage both symbolic and neural reasoning paths.
● A new benchmark, HeteQA, was developed to evaluate heterogeneous reasoning across 304 multi-hop QA examples covering nine domains and five types of tabular operations (e.g., aggregation, filtering, grouping).
● Experiments on HeteQA and existing benchmarks (HybridQA, WikiTableQuestions) show that TableRAG consistently outperforms prior methods like NaiveRAG, ReAct, and TableGPT2, achieving >10% gains in accuracy over the strongest baseline.
● Ablations reveal that all major components of TableRAG contribute significantly. Notably, SQL execution is critical for nested reasoning tasks, while textual retrieval is crucial for entity and numeric references.
● TableRAG achieves greater reasoning efficiency, solving over 90% of HeteQA tasks in five or fewer steps and exhibiting the lowest failure rate among evaluated methods. Its robustness holds across multiple LLM backbones (Claude, DeepSeek, Qwen). | Paper, Tweet | | 5) Self-Adapting Language Models - It proposes a novel framework that enables LLMs to adapt themselves through reinforcement learning by generating their own fine-tuning data and update directives, referred to as “self-edits.” This approach allows models to autonomously optimize their learning process, without relying on separate adaptation modules or human supervision. Key highlights:
● Self-Edits as Adaptation Mechanism: Instead of updating weights directly with raw data, SEAL uses the model to generate self-edits, natural language instructions that might include restated facts, optimization parameters, or tool invocations. These are then used for supervised fine-tuning. The generation of self-edits is optimized via RL using downstream task performance as the reward.
● Two Domains Evaluated:
● Learning Framework: SEAL employs an outer RL loop (optimizing the self-edit generation policy) and an inner loop (applying the edits via supervised fine-tuning). The RL objective is approximated via filtered behavior cloning (ReSTEM), reinforcing only edits that lead to performance gains.
● Limitations and Forward View: While SEAL achieves compelling gains, it remains susceptible to catastrophic forgetting during sequential updates and incurs high computational cost due to repeated fine-tuning. Future directions include combining SEAL with continual learning strategies and extending it toward autonomous agents that update weights as part of long-horizon interactions. | Paper, Tweet | | 6) ComfyUI-R1 - Introduces ComfyUI-R1, a 7B-parameter large reasoning model fine-tuned for automatic workflow generation in the ComfyUI ecosystem. Built on Qwen2.5-Coder and trained through a two-stage pipeline (supervised chain-of-thought reasoning followed by reinforcement learning), ComfyUI-R1 significantly outperforms prior state-of-the-art approaches that rely on commercial models like GPT-4o and Claude 3.5. Key highlights:
● Curated Workflow & Node KBs: The team collected and cleaned 27K community workflows down to 3.9K high-quality entries, each with JSON and code representations. They also assembled a node documentation KB using 7,238 nodes, some enhanced via Claude 3.5.
● CoT + RL Training: The model first undergoes supervised fine-tuning on simulated long CoT reasoning sequences involving node selection, planning, and code generation. Then, reinforcement learning with a fine-grained rule-metric hybrid reward encourages format validity, graph correctness, and node accuracy.
● Superior Performance: ComfyUI-R1 achieves 97% format validity and the highest node-level and graph-level F1 scores across benchmarks, beating all GPT-4o and Claude baselines. On the ComfyBench benchmark, it improves the execution pass rate to 67%, a full 11% above the previous best (ComfyAgent with GPT-4o).
● Qualitative Improvements: Case studies show ComfyUI-R1 creates more faithful, complex, and executable workflows than prior multi-agent approaches. Notably, it better aligns image generation outputs with user instructions in terms of style, structure, and coherence. | Paper, Tweet | | 7) Magistral - Mistral introduces Magistral, its first reasoning-focused LLM line, alongside a custom RL training stack that enables pure reinforcement learning from scratch. In contrast to prior approaches that rely on distillation from teacher models, Magistral trains directly using online RL with text-only data and custom reward shaping. The work yields two open models: Magistral Medium (based on Mistral Medium 3) and the open-sourced Magistral Small (24B), which is bootstrapped via SFT on Medium’s outputs, followed by RL. Key insights:
● Pure RL can rival distillation: Magistral Medium was trained without any reasoning traces and achieved a 50% boost in AIME-24 (pass@1) over the base model. Across reasoning benchmarks like LiveCodeBench and GPQA, it performs on par or better than DeepSeek-R1 despite lacking a distillation phase.
● Multilingual and multimodal generalization emerge: Reinforcement learning on textual math/code data surprisingly enhances multimodal reasoning and preserves instruction-following and tool-calling capabilities. Multilingual reasoning is achieved via simple reward shaping, enforcing the same-language output and chain-of-thought.
● Efficient async infrastructure: An asynchronous RL setup with NCCL weight broadcasting enables generators to roll out sequences continuously, receiving weight updates without blocking. This helps maintain on-policyness while maximizing GPU utilization.
● Reward shaping and CoT format enforcement: Rewards are conditioned on correct formatting (● Training heuristics and ablations: Analysis shows that reward and performance scale logarithmically with output length. Longer completions correlate with higher reward. RL moves weights in a low-dimensional space, as shown by PCA of checkpoints. Ablations on batch size, advantage normalization, and entropy (via ε ᵢg ) highlight stability tradeoffs.
● Open-source release: Magistral Small (24B) is released under Apache 2.0 and performs competitively under three setups, SFT-only, RL-only, and SFT+RL, with the final combo yielding the best results across math and code tasks. | Paper, Tweet | | 8) Code Researcher - Code Researcher is a deep research agent designed to resolve complex bugs in large systems’ code by leveraging multi-step reasoning over semantics, patterns, and commit history. It significantly outperforms existing coding agents like SWE-agent on benchmarks like kBenchSyz, achieving a 58% crash-resolution rate by effectively gathering and filtering global context before generating and validating patches. | Paper, Tweet | | 9) LLamaRL - LlamaRL is a fully-distributed, asynchronous reinforcement learning framework designed for efficient large-scale LLM training (8B to 405B+ models). It achieves up to 10.7× speedup over DeepSpeed-Chat by combining co-located model offloading, asynchronous off-policy training (AIPO), and fast GPU-native weight sync (DDMA), while maintaining model quality across tasks like math reasoning. | Paper, Tweet | | 10) Predicting a Cyclone’s Track with AI - Google DeepMind's Weather Lab debuts an interactive platform featuring an AI cyclone forecasting model that generates 50 ensemble scenarios up to 15 days ahead, outperforming traditional systems in track and intensity predictions. It integrates live and historical forecasts with baselines for evaluation and is now used by agencies like the U.S. National Hurricane Center to support early warnings. | Paper, Tweet |

Top AI Papers of the Week (June 2 - June 8) - 2025

| Paper | Links | | ------------- | ------------- | | 1) The Illusion of Thinking - Investigates the capabilities and limitations of frontier Large Reasoning Models (LRMs) like Claude 3.7, DeepSeek-R1, and OpenAI’s o-series by systematically analyzing their performance on reasoning tasks as a function of problem complexity. Rather than relying on conventional math benchmarks, which suffer from contamination and lack structure, the authors evaluate LRMs using four controllable puzzles (Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World) that allow fine-grained complexity scaling and transparent trace analysis. Key findings:
● Three complexity regimes: The study identifies distinct performance phases. In low-complexity tasks, non-thinking LLMs outperform LRMs due to more efficient and direct computation. In medium complexity, reasoning models show an advantage, leveraging longer chain-of-thoughts to correct errors. However, in high complexity, all models, regardless of their reasoning scaffolds, collapse to near-zero accuracy.
● Counterintuitive reasoning collapse: Surprisingly, LRMs reduce their reasoning effort (i.e., number of tokens used in thoughts) as problem complexity increases beyond a threshold. This suggests an internal scaling failure not caused by token limits but by intrinsic model behavior.
● Reasoning trace inefficiencies: LRMs frequently “overthink” on simple problems, finding correct answers early but continuing to explore incorrect paths. For moderate tasks, they correct late, and for complex ones, they fail to find any valid solution. Position-based accuracy analysis of thoughts reveals systematic shifts in when correct solutions are generated within the trace.
● Failure to execute explicit algorithms: Even when supplied with correct pseudocode (e.g., Tower of Hanoi recursion), models still failed at similar complexity points. This indicates that LRMs don’t just struggle to find solutions; they can’t reliably execute logical instructions either.
● Inconsistent behavior across puzzles: Models could perform >100 correct steps in Tower of Hanoi (N=10) but fail after 4 steps in River Crossing (N=3), suggesting performance correlates more with training data familiarity than inherent problem complexity. | Paper, Tweet | | 2) From Tokens to Thoughts - This paper introduces an information-theoretic framework to examine whether LLMs organize semantic knowledge like humans, balancing compression and meaning. Drawing from Rate-Distortion Theory and the Information Bottleneck principle, the authors evaluate token embeddings from 30+ LLMs against classic human categorization benchmarks from cognitive psychology.
● LLMs do form broad conceptual categories that align well with human groupings. Adjusted Mutual Information scores show LLM clusters consistently outperform random baselines, with even small encoder models like BERT matching or beating larger decoder-only models on this alignment task.
● However, LLMs struggle with fine-grained semantics. When tested on their ability to mirror human notions of item typicality (e.g., robin as a more typical bird than penguin), correlations between LLM embedding similarity and human ratings were weak and inconsistent. Most models failed to capture graded prototype structures evident in human cognition.
● Using their unified loss function L (balancing information complexity and semantic distortion), the authors find that LLMs produce statistically efficient clusters with lower entropy and distortion, while human conceptual clusters are less compact but preserve richer nuance. This suggests LLMs over-optimize for compression at the expense of meaning, unlike humans, who tolerate inefficiency to retain adaptive, flexible structure.
● The paper concludes that while LLMs can mimic surface-level categorization, they diverge fundamentally in how they represent meaning, highlighting a core gap between artificial and human semantic systems and offering a quantitative tool for improving human-aligned conceptual representations. | Paper | | 3) Knowledge or Reasoning - Introduces a fine-grained evaluation framework to dissect LLM thinking into two components: knowledge correctness and reasoning informativeness, measured via Knowledge Index (KI) and Information Gain (InfoGain), respectively. The authors apply this framework to evaluate how reasoning transfers across domains, particularly medical and mathematical, using Qwen2.5-7B and its DeepSeek-R1-distilled variant trained via SFT and RL. Key findings include:
● SFT improves knowledge but can harm reasoning: Supervised fine-tuning improves factual accuracy (e.g., 6.2% KI gain in medical tasks), but often leads to verbose or redundant reasoning that reduces InfoGain by 38.9% on average, compared to the base model.
● RL boosts both reasoning and knowledge in medical settings: Reinforcement learning enhances reasoning clarity and prunes incorrect knowledge, leading to a 12.4-point average gain in KI. It improves inference by guiding models toward more factually sound reasoning paths.
● Domain matters: While math tasks benefit more from reasoning (higher InfoGain), medical tasks rely heavily on domain knowledge (higher KI). In fact, KI shows a stronger correlation (0.998) with task accuracy than InfoGain (0.698) in medical benchmarks.
● Base models outperform R1-distilled versions in medicine: Qwen-Base consistently outperforms DeepSeek-R1-distilled models across accuracy, InfoGain, and KI. The R1-distilled model struggles with medical adaptation, likely due to pretraining bias toward math/code domains | Paper, Tweet | | 4) Open Thoughts - This paper presents OpenThoughts3, a systematic recipe for curating supervised fine-tuning (SFT) data that advances the performance of open-source reasoning models. The authors develop OpenThinker3-7B, a 7B parameter model trained on their new 1.2M example dataset (OpenThoughts3-1.2M) derived from over 1,000 controlled experiments. Despite using no reinforcement learning, OpenThinker3-7B outperforms all other open-data 7B and 8B models on standard math, code, and science reasoning benchmarks, even beating models trained with larger-scale or mixed SFT+RL pipelines. Key insights and contributions:
● Best-in-class 7B open model: OpenThinker3-7B achieves state-of-the-art results on AIME25 (53.3%), LiveCodeBench (51.7%), and GPQA Diamond (53.7%), outperforming DeepSeek-R1-Distill-Qwen-7B by 15–20 percentage points across tasks.
● Scaling laws with clean design: The authors ablate every step in the data pipeline, question sourcing, filtering, teacher choice, deduplication, and answer sampling, showing how each incrementally lifts performance. For instance, using multiple answers per question (16×) improved results more than simply increasing question diversity.
● QwQ-32B as a better teacher than stronger models: Surprisingly, QwQ-32B yielded better student models than DeepSeek-R1 or Phi-4 despite lower benchmark scores, suggesting teacher choice affects trace quality more than raw performance.
● Filtering matters more than verification: Question filtering based on response length and LLM-estimated difficulty was more predictive of downstream gains than traditional heuristics (e.g., fastText) or even filtering based on correctness verification, which had negligible effects.
● Data quality over diversity: Mixing only the top 1–2 question sources per domain consistently outperformed using many sources, indicating that question quality is more important than dataset heterogeneity.
● Open-source impact: The full datasets and models are released at openthoughts.ai, providing a reproducible benchmark for open reasoning research. | Paper, Tweet | | 5) Coding Agents with Multimodal Browsing - Introduces OpenHands-Versa, a unified agent designed to perform strongly across diverse domains, coding, web browsing, and multimodal information access, by equipping a single agent with three general capabilities: code execution, multimodal web browsing, and file/search access. In contrast to specialist or multi-agent systems optimized for narrow domains, OpenHands-Versa aims to solve a wide variety of real-world tasks with minimal architectural complexity. Key highlights:
● Unified Toolset, Superior Coverage: OpenHands-Versa integrates visual web browsing, search API access, and multimodal file processing into the OpenHands coding framework. Despite its simplicity, it surpasses specialized agents across three benchmarks: SWE-Bench Multimodal (+9.1%), GAIA (+1.3%), and The Agent Company (+9.1%) in success rates.
● Benchmark Generalization: The agent matches or outperforms multi-agent systems like OWL-roleplaying and Magentic-One, which struggle to generalize across domains. For example, OWL-roleplaying, though strong on GAIA, performs poorly on The Agent Company due to limited tool generality.
● Domain-Aware Tool Use: Analysis reveals that OpenHands-Versa effectively adapts its tool usage per benchmark (e.g., search APIs in GAIA, browser in The Agent Company, and visual validation in SWE-Bench M), unlike its predecessor, OpenHands, which misuses or lacks crucial tools like search.
● Minimal Agent, Strong Results: By relying on a single-agent design and Claude-3.7 or Claude Sonnet-4 as backbone LLMs, OpenHands-Versa achieves SOTA results without per-task tool customization. For example, it attains 64.24% on GAIA val split, outperforming multi-agent baselines by up to +18%. | Paper, Tweet | | 6) Self-Challenging Language Model Agents - Proposes a novel self-improvement method for multi-turn tool-use LLM agents, called the Self-Challenging Agent (SCA). It trains LLMs entirely from tasks they generate themselves, avoiding the need for human-annotated tasks or evaluations. The framework introduces a new task format called Code-as-Task (CaT), ensuring generated tasks are feasible, verifiable, and challenging. SCA is shown to double performance in a self-improvement setting and significantly boost performance in distillation. Key contributions and findings:
● Self-generated tasks via dual-agent roles: The agent alternates between a challenger role, where it explores the environment and creates tasks, and an executor role, where it learns to solve these tasks via reinforcement learning. The process is designed to emulate how human annotators interact with tools to design meaningful tasks.
● Code-as-Task (CaT) formulation: Each synthetic task includes an instruction, a Python-based verification function, a working solution, and several failure cases. This structure ensures task quality by filtering out trivial, impossible, or non-verifiable tasks using automatic code execution checks.
● Strong results in both distillation and self-improvement: SCA improves the Llama-3.1-8B-Instruct model’s success rate from 12.0% to 23.5% when learning from its own tasks. In the distillation setting (using a 70B teacher), SCA lifts performance to 32.2% Pass@1, outperforming the prior PAE baseline across all tool-use environments.
● Human annotation and ablation confirm task quality: Tasks generated with CaT significantly reduce false positives and negatives compared to PAE. A detailed analysis shows CaT’s filtering removes flawed tasks while retaining diversity when used with stronger models like Llama-3.1-70B.
● Scaling and training dynamics: More diverse tasks (not just more trajectories per task) yield better generalization, emphasizing the importance of broad synthetic coverage. Online RL methods like PPO and GRPO can further boost performance, but at higher tuning and compute cost. | Paper, Tweet | | 7) AlphaOne - Introduces a universal framework, α1, for modulating the reasoning progress of large reasoning models (LRMs) during inference. Rather than relying on rigid or automatic schedules, α1 explicitly controls when and how models engage in “slow thinking” using a tunable parameter α. The method dynamically inserts “wait” tokens to encourage deeper reasoning and then deterministically ends slow thinking with a “

” token to prompt efficient answer generation. This yields better accuracy and efficiency than previous test-time scaling approaches. Key insights:
● Slow-then-fast reasoning outperforms other strategies: Contrary to human intuition (fast-then-slow), models benefit from beginning with slow reasoning before transitioning to faster inference. This “frontloaded effort” schedule leads to more accurate problem solving.
● Dense modulation via α1 boosts accuracy and efficiency: By continuously adjusting reasoning pace via α-scheduled “wait” token insertions, α1 outperforms existing test-time strategies like s1 (monotonic increase) and CoD (monotonic decrease), achieving up to +6.15% accuracy gain while using up to 14% fewer tokens on some benchmarks.
● Linear annealing is the most effective scheduling strategy: Among several tested functions for controlling “wait” insertion (constant, linear increase, exponential/linear anneal), linear anneal—gradually reducing “wait” token frequency, proved best across multiple models and datasets.
● Post-α moment modulation is critical: Simply inserting “wait” tokens leads to inertia in slow thinking. α1 ensures efficient termination by replacing future “wait” tokens with “

”, effectively forcing a shift to fast reasoning and boosting performance by up to +20% in some tasks. | Paper, Tweet | | 8) Common Pile v0.1 - The Common Pile v0.1 is an 8TB dataset of openly licensed text designed for LLM pretraining, addressing legal and ethical concerns of unlicensed data use. Two 7B parameter models trained on it, Comma v0.1-1T and 2T, achieve performance comparable to LLaMA 1 and 2, and the dataset, code, and model checkpoints are all publicly released. | Paper, Tweet | | 9) RewardBench 2 - RewardBench 2 is a new multi-skill benchmark for evaluating reward models with more challenging human prompts and stronger correlation to downstream performance. It highlights gaps in the current reward model's effectiveness and aims to support more rigorous evaluation, showing existing models score ~20 points lower than their predecessor. | Paper, Tweet | | 10) Memorization in LLMs - This study introduces a method to quantify how much a model memorizes versus generalizes, estimating GPT models have a capacity of ~3.6 bits per parameter. By training hundreds of models, the authors show that memorization saturates with data before generalization (“grokking”) kicks in, and derive new scaling laws linking capacity, data size, and membership inference. | Paper |

Top AI Papers of the Week (May 26 - June 1) - 2025

| Paper | Links | | ------------- | ------------- | | 1) New Lens on RAG Systems - Introduces a new conceptual and empirical framework for analyzing RAG systems through the lens of sufficient context, whether the retrieved content alone enables answering a query. This notion helps decouple retrieval failures from generation errors in LLMs, providing clarity on model behavior under different contextual adequacy. Key findings:
● New definition and classifier for sufficient context: The authors formalize “sufficient context” as context that plausibly allows answering a query, without requiring ground truth. They develop a high-accuracy LLM-based autorater (Gemini 1.5 Pro, 93% accuracy) to label instances as having sufficient or insufficient context, enabling large-scale evaluation without needing ground-truth answers.
● Sufficient context ≠ guaranteed correctness: Even when sufficient context is present, state-of-the-art LLMs like GPT-4o, Claude 3.5, and Gemini 1.5 still hallucinate answers more often than they abstain. Conversely, models can sometimes answer correctly despite insufficient context, likely leveraging parametric memory.
● Benchmarks contain substantial insufficient context: Analysis of datasets like HotPotQA, Musique, and FreshQA shows that a significant fraction of queries (e.g., >50% in Musique and HotPotQA) lack sufficient context, even with curated or oracle retrieval setups.
● Selective generation improves factuality: The authors propose a “selective RAG” method that combines model self-confidence with the sufficient context autorater to decide whether to answer or abstain. This yields consistent 2–10% gains in correctness (of answered queries) across Gemini, GPT, and Gemma models.
● Fine-tuning alone is insufficient: Attempts to fine-tune smaller models like Mistral 3 7B for better abstention (e.g., training them to say “I don’t know” on insufficient examples) modestly increased abstention but often reduced accuracy or failed to meaningfully curb hallucinations. | Paper, Tweet | | 2) Open-Ended Evolution of Self-Improving Agents - This work presents the Darwin Gödel Machine (DGM), a system that advances the vision of self-improving AI by combining self-referential code modification with open-ended evolutionary search. Unlike the original Gödel machine, which requires provable benefits for code changes (a practically intractable constraint), the DGM adopts an empirical approach: it modifies its own codebase and evaluates improvements on coding benchmarks. Key contributions and findings:
● Self-referential self-improvement loop: The DGM starts with a single coding agent that edits its own Python-based codebase to improve its ability to read, write, and execute code using frozen foundation models (FMs). Each modification is evaluated on benchmarks like SWE-bench and Polyglot, with only successful agents retained for further iterations.
● Open-ended exploration via evolutionary archive: Inspired by Darwinian evolution, the system maintains an archive of all prior agents and samples parents based on performance and novelty. This enables exploration beyond local optima and supports continual innovation, including revisiting previously suboptimal variants that become valuable stepping stones later.
● Empirical performance gains: Across 80 iterations, DGM boosts coding success on SWE-bench from 20.0% to 50.0% and on Polyglot from 14.2% to 30.7%, outperforming strong baselines that lack either self-improvement or open-endedness. Its best agents match or exceed leading human-designed, open-source coding agents.
● Emergent tool and workflow improvements: Through self-improvement, DGM enhances its capabilities by evolving more granular editing tools, retry and evaluation mechanisms, history-aware patch generation, and code summarization for long contexts.
● Generalization across models and tasks: Agents discovered by DGM generalize well when transferred across foundation models (e.g., Claude 3.5 to 3.7, o3-mini) and programming languages, demonstrating robust improvements not overfit to a particular setup.
● Safety-conscious design: All experiments were sandboxed, monitored, and scoped to confined domains. The paper also discusses how future self-improvement systems could evolve safer, more interpretable behaviors if these traits are part of the evaluation criteria. | Paper, Tweet | | 3) An Operating System for Memory-Augmented Generation in LLMs Introduces a unified operating system for managing memory LLMs, addressing a key limitation in - current architectures: their lack of structured, persistent, and governable memory. While today's LLMs rely primarily on parametric memory (model weights) and limited short-term context, MemOS proposes a comprehensive memory lifecycle and management infrastructure designed to support continual learning, behavioral consistency, and knowledge evolution. Key contributions and components include:
● Three-tier memory taxonomy: MemOS distinguishes between parametric memory (long-term weights), activation memory (short-term runtime states), and plaintext memory (editable, external content). These types are unified through a shared abstraction called the Memory Cube (MemCube), enabling seamless transformation (e.g., plaintext to parametric) and lifecycle governance.
● MemCube abstraction: Each MemCube encapsulates memory metadata (creation time, type, access policies, etc.) and a semantic payload (text, tensors, LoRA patches). This enables dynamic scheduling, traceable updates, and interoperability between modules and agents.
● Modular OS-style architecture: MemOS consists of three layers—Interface (user/API interaction), Operation (memory scheduling, lifecycle management), and Infrastructure (storage, access governance), that work together to manage memory parsing, injection, transformation, and archival.
● Closed-loop execution flow: Every interaction (e.g., prompt response) can trigger memory operations governed by scheduling rules and lifecycle policies. Retrieved memory can be injected into generation, stored in archives, or transformed into other types for long-term use.
● Vision for a memory-centric future: The paper proposes “memory training” as the next frontier beyond pretraining and finetuning, enabling models that learn continuously. Future work includes cross-model memory sharing, self-evolving memory blocks, and a decentralized memory marketplace. | Paper, Tweet | | 4) Building Production-Grade Conversational Agents with Workflow Graphs This paper presents a pragmatic, production-ready framework for building LLM-powered conversational agents using workflow graphs, with a specific focus on e-commerce scenarios. Instead of relying solely on end-to-end generation, the authors design agents using a directed acyclic graph (DAG), enabling flexible yet controllable interactions that adhere to strict business rules and format constraints. - Key contributions and findings include:
● Multi-State DAG Framework: Each node in the graph corresponds to a conversational state with its own system prompt, tool access, and execution rules. This structure enables robust constraint handling (e.g., avoiding hallucinated responses or non-compliant suggestions) by localizing logic and formatting within specific graph nodes.
● Fine-Tuning via Response Masking: Because conversation turns come from different states in the DAG, the authors introduce a fine-tuning strategy that applies selective loss masking to train LLMs only on responses relevant to a specific node’s context. This prevents prompt conflicts and improves adherence to node-specific constraints.
● Real-World Deployment and Results: In a deployment across KakaoTalk and web platforms, the graph-based approach significantly outperformed baseline agents and even GPT-4o across key metrics like task accuracy (+52%) and format adherence (+50%). In human preference tests, their internal model was favored over GPT-4o in 63% of real-world user cases, especially in product recommendation and safety-critical tasks. | Paper, Tweet | | 5) Spurios Rewards - This work challenges prevailing assumptions about reinforcement learning with verifiable rewards (RLVR) in mathematical reasoning tasks. The authors show that Qwen2.5-Math models can improve significantly under RL, even when trained with spurious or flawed rewards.
● Surprisingly effective spurious rewards: The Qwen2.5-Math-7B model gains +21.4% accuracy with random rewards, +16.4% with format-based rewards, and +24.6% when explicitly trained on incorrect answers. These are close to the +28.8% gain from ground-truth reward signals, suggesting that RLVR surfaces latent capabilities rather than teaching new reasoning skills.
● Model-specific generalization: Spurious rewards fail on other models like Llama3 or OLMo2. Only Qwen models consistently benefit, which the authors attribute to differences in pretraining. Notably, Qwen2.5-Math exhibits a unique “code reasoning” behavior, generating Python-like code to solve problems, which becomes more frequent post-RLVR and correlates strongly with accuracy.
● Mechanism behind gains: The authors trace performance improvements to a shift in reasoning strategies. Most of the gain comes from language→code transitions, where the model switches from natural language to code reasoning during RLVR. Interventions that explicitly increase code usage (e.g., rewarding code-like responses or using a code-forcing prompt) boost performance further, but only on Qwen models.
● Clipping bias enables learning from noise: Even with random rewards, performance improves due to GRPO’s clipping mechanism, which biases training toward reinforcing the model’s high-probability behaviors. These behaviors (e.g., code reasoning) happen to align with correctness in Qwen models but not in others. | Paper, Tweet | | 6) Learn to Reason without External Rewards - Proposes a method for training LLMs via reinforcement learning without any external rewards or labeled data. Instead, it uses the model’s own self-certainty, a confidence measure based on KL divergence from uniform, as the sole intrinsic reward. This self-improvement strategy, part of the broader Reinforcement Learning from Internal Feedback (RLIF) paradigm, bypasses the limitations of Reinforcement Learning with Verifiable Rewards (RLVR), which requires domain-specific verifiers and gold-standard outputs. Key highlights:
● INTUITOR matches GRPO without external supervision: When applied to mathematical reasoning tasks like GSM8K and MATH500, INTUITOR achieves performance on par with GRPO (a strong RLVR method), even without using gold solutions. On out-of-domain tasks such as LiveCodeBench and CRUXEval, INTUITOR generalizes better, achieving higher gains than GRPO (+65% vs. 0% and +76% vs. +44%, respectively).
● Rapid early learning and enhanced instruction-following: INTUITOR significantly boosts early training performance, particularly on models like Qwen2.5-1.5B, and improves adherence to chat-style instructions, reducing repetitive or nonsensical output.
● Emergent structured reasoning: Trained models display spontaneous reasoning even when not explicitly required, often generating explanations or planning steps before producing code or answers. This behavior correlates with better transfer performance to domains like code generation.
● Self-certainty as a robust, hack-resistant signal: Unlike fixed reward models prone to exploitation, online self-certainty adapts with the model and avoids reward hacking. INTUITOR-trained models show the strongest correlation between self-certainty and correct answers, confirmed by statistical tests. | Paper, Tweet | | 7) Learn to Reason via Mixture-of-Thought - While most prior approaches train with a single modality and only ensemble during inference, this work introduces Mixture-of-Thought (MoT) to jointly train and infer across modalities, resulting in notable gains in logical reasoning performance. Key findings:
● Three-modality synergy: MoT uses natural language for interpretability, code for structured procedural reasoning, and truth tables to explicitly enumerate logical cases. Error analysis shows that truth tables significantly reduce common LLM failure modes like missing branches or invalid converses.
● Self-evolving training: MoT introduces an iterative, on-policy training loop where the model generates, filters, and learns from its own multi-modal reasoning traces. This joint training outperforms both single-modality and partial-modality setups.
● Inference via voting: At test time, MoT generates predictions from each modality and selects the majority answer, leading to robust predictions. Results show up to +11.7pp average accuracy gains on FOLIO and ProofWriter, with 9B models matching GPT-4 + Logic-LM performance.
● Stronger on harder tasks: MoT delivers the largest improvements on problems with higher reasoning depth (5–8 steps). It also shows superior test-time scaling, with more diverse and accurate outputs under fixed inference budgets. MoT demonstrates that LLMs can achieve significantly more robust logical reasoning by reasoning like humans (using multiple modes of thought), not just by sampling more from a single modality. | Paper, Tweet | | 8) QwenLong-L1 - A new reinforcement learning framework that scales large reasoning models (LRMs) from short to long contexts using progressive context scaling and hybrid rewards. It achieves top performance on seven long-context benchmarks, surpassing models like OpenAI-o3-mini and Qwen3-235B-A22B, and matching Claude-3.7-Sonnet-Thinking, demonstrating strong reasoning with up to 120K token inputs. | Paper | | 9) End-to-End Policy Optimization for GUI Agents - ARPO introduces an end-to-end reinforcement learning method for training GUI agents using Group Relative Policy Optimization (GRPO) with experience replay. It significantly improves in-domain performance on the OSWorld benchmark, outperforming baselines by up to 6.7%, while offering modest gains on out-of-domain tasks and enabling self-corrective behaviors through structured reward feedback. | Paper, Tweet | | 10) Generalist Agent Enabling Scalable Agentic Reasoning - Proposes Alita, a generalist agent framework that enables scalable agentic reasoning through minimal predefinition and maximal self-evolution. Unlike traditional agents reliant on handcrafted tools, Alita autonomously constructs reusable MCPs (Model Context Protocols) using web search and code synthesis, outperforming more complex systems like OpenAI DeepResearch and OctoTools on GAIA, MathVista, and PathVQA benchmarks. | Paper |

Top AI Papers of the Week (May 19 - May 25) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Visual Planning - Proposes a novel reasoning paradigm that replaces language-based planning with image-based reasoning. The authors argue that language is not always the optimal medium for tasks involving spatial or physical reasoning. They introduce Visual Planning, where reasoning is executed as a sequence of visual states (images) without any text mediation, allowing models to “think” directly in images. This is realized through a reinforcement learning framework called VPRL (Visual Planning via Reinforcement Learning), which trains a vision-only model (LVM-3B) to plan using images. Key contributions and findings:
● Visual-only reasoning paradigm: The authors formally define planning as autoregressive visual state generation, trained using image-only data. Unlike multimodal LLMs that map vision to language and reason textually, this approach performs inference entirely in the visual modality, sidestepping the modality gap.
● VPRL framework: A two-stage training process is introduced. Stage 1 uses supervised learning on randomly sampled trajectories to ensure format consistency and promote exploration. Stage 2 applies GRPO (Group Relative Policy Optimization) to refine planning behavior via progress-based rewards, avoiding invalid or regressive moves.
● Superior performance: On three visual navigation tasks (FronzeLake, Maze, and MiniBehavior), VPRL outperforms language-based models (e.g., Gemini 2.5 Pro, Qwen 2.5-VL) by over 40% in Exact Match scores. It also generalizes better to out-of-distribution tasks (larger grid sizes), with visual planners degrading more gracefully than textual ones.
● Visual planning yields robustness and interpretability: Unlike textual outputs, visual plans enable step-by-step inspection and show stronger adherence to physical constraints. Qualitative examples illustrate how VPRL can avoid invalid moves and recover from non-optimal paths, while language models often hallucinate or misinterpret spatial layouts.
● Exploration and invalid action reduction: The random policy initialization in Stage 1 enables better exploration than supervised baselines (VPFT), as evidenced by higher entropy and fewer invalid actions. This leads to a more effective RL stage and ultimately stronger planning capabilities. | Paper, Tweet | | 2) EfficientLLM - Introduces the first large-scale, empirical benchmark for evaluating efficiency trade-offs in LLMs across architecture, fine-tuning, and inference. Conducted on a high-performance cluster (48×GH200, 8×H200 GPUs), the study evaluates over 100 model–technique pairs spanning 0.5B–72B parameters, using six metrics: memory utilization, compute utilization, latency, throughput, energy consumption, and compression rate. Key insights include:
● No one-size-fits-all solution: Every efficiency technique improves some metrics while degrading others. For instance, MoE boosts accuracy and reduces FLOPs but increases VRAM usage by ~40%, while int4 quantization reduces memory and energy by up to 3.9× at a small 3–5% performance cost.
● Resource-specific optima: Efficiency depends on context. MQA achieves the best memory-latency trade-off for constrained devices; MLA has the lowest perplexity for high-quality generation; RSLoRA is more efficient than LoRA only for models above 14B parameters.
● Cross-modal transferability: Efficiency techniques like MQA and PEFT generalize well to vision and vision-language models, improving FID scores and maintaining strong trade-offs.
● Training and tuning: LoRA and DoRA perform best for small models (1–3B), while RSLoRA excels at large scale (≥14B). Parameter freezing achieves the lowest latency but at a slight cost to accuracy.
● Inference: int4 post-training quantization yields the highest compression and throughput gains with minor quality degradation, while bfloat16 consistently outperforms float16 in latency and energy on modern GPUs. | Paper, Tweet | | 3) J1 - Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thoughtful reasoning during judgment. Instead of relying solely on prompting or preference fine-tuning, J1 employs online reinforcement learning with verifiable rewards to teach models to think through evaluations systematically. Key insights:
● Verifiable framing for judgment: J1 converts both verifiable (e.g., math) and non-verifiable (e.g., user queries) prompts into tasks with verifiable rewards by generating synthetic preference pairs. This reframing enables the use of reinforcement learning and consistent training signals across diverse tasks.
● Chain-of-thought-driven RL optimization: J1 trains models to reason through evaluations via explicit thought traces, including outlining evaluation criteria, reference answer generation, and self-comparison before producing judgments. Two model types are trained: Pairwise-J1 (outputs verdicts) and Pointwise-J1 (outputs quality scores). Pairwise-J1 models are further improved by consistency rewards to reduce positional bias.
● Superior performance at scale: J1-Llama-8B and J1-Llama-70B outperform existing 8B and 70B LLM judges across five benchmarks (PPE, RewardBench, RM-Bench, JudgeBench, FollowBenchEval), beating models trained with much more data like DeepSeek-GRM and distillations of DeepSeek-R1. J1-70B even surpasses o1-mini and closes the gap with the much larger R1 model, particularly on non-verifiable tasks.
● Pointwise-J1 mitigates positional bias: While pairwise judges can flip verdicts based on response order, Pointwise-J1 (trained only from pairwise supervision) offers position-consistent scoring with fewer ties and better consistency. Both judge types benefit from test-time scaling via self-consistency, further improving reliability. | Paper, Tweet | | 4) The Pitfalls of Reasoning for Instruction- Following in LLMs - Explores an unexpected flaw in reasoning-augmented large language models (RLLMs): while chain-of-thought (CoT) prompting often boosts performance on complex reasoning tasks, it can degrade instruction-following accuracy. The authors evaluate 15 models (e.g., GPT, Claude, LLaMA, DeepSeek) on two instruction-following benchmarks and find that CoT prompting consistently reduces performance across nearly all models and datasets. Key findings:
● Reasoning hurts instruction adherence: On IFEval, 13 of 14 models saw accuracy drops with CoT; all 15 models regressed on ComplexBench. For example, Meta-LLaMA3-8B’s IFEval accuracy dropped from 75.2% to 59.0% with CoT. Even reasoning-tuned models like Claude3.7-Sonnet-Think performed slightly worse than their base counterparts.
● Why reasoning fails: Manual case studies show CoT can help with structural formatting (e.g., JSON or Markdown) and precise lexical constraints (like exact punctuation). But it often hurts by (a) neglecting simple constraints during high-level content planning and (b) inserting helpful but constraint-violating content (e.g., translations in language-restricted outputs).
● Attention-based diagnosis: The authors introduce a constraint attention metric and find that CoT reduces the model's focus on instruction-relevant tokens, especially in the answer generation phase. This diminished constraint awareness correlates with performance drops.
● Mitigation strategies: Four techniques are proposed to selectively apply reasoning: | Paper, Tweet | | 5) Generalizable AI Predicts Immunotherapy Outcomes Across Cancers and Treatments - Introduces COMPASS, a concept bottleneck-based foundation model that predicts patient response to immune checkpoint inhibitors (ICIs) using tumor transcriptomic data. Unlike prior biomarkers (TMB, PD-L1, or fixed gene signatures), COMPASS generalizes across cancer types, ICI regimens, and clinical contexts with strong interpretability and performance. Key contributions:
● Concept Bottleneck Architecture: COMPASS transforms transcriptomic data into 44 high-level immune-related concepts (e.g., T cell exhaustion, IFN-γ signaling, macrophage activity) derived from 132 curated gene sets. This structure provides mechanistic interpretability while enabling pan-cancer modeling.
● Pan-Cancer Pretraining and Flexible Fine-Tuning: Trained on 10,184 tumors across 33 cancer types using contrastive learning, and evaluated on 16 ICI-treated clinical cohorts (7 cancers, 6 ICI drugs). COMPASS supports full, partial, linear, and zero-shot fine-tuning modes, making it robust in both data-rich and data-poor settings.
● Superior Generalization and Accuracy: In leave-one-cohort-out testing, COMPASS improved precision by 8.5%, AUPRC by 15.7%, and MCC by 12.3% over 22 baseline methods. It also outperformed in zero-shot settings, across drug classes (e.g., predicting anti-CTLA4 outcomes after training on anti-PD1), and in small-cohort fine-tuning.
● Mechanistic Insight into Resistance: Personalized response maps reveal actionable biological mechanisms. For instance, inflamed non-responders show resistance via TGF-β signaling, vascular exclusion, CD4+ T cell dysfunction, or B cell deficiency. These go beyond classical “inflamed/desert/excluded” phenotypes, offering nuanced patient stratification.
● Clinical Utility and Survival Stratification: COMPASS-predicted responders had significantly better survival in a held-out phase II bladder cancer trial (HR = 4.7, p = 1.7e-7), outperforming standard biomarkers (TMB, PD-L1 IHC, immune phenotype). | Paper | | 6) Towards a Deeper Understanding of Reasoning in LLMs - This paper investigates whether LLMs can adapt and reason in dynamic environments, moving beyond static benchmarks. Using the SmartPlay benchmark—a suite of four interactive games that require diverse cognitive skills—the authors evaluate three prompting strategies: self-reflection, heuristic mutation (via an Oracle), and planning. They test these methods across models of varying size (Llama3-8B to Llama3.3-70B) and draw several conclusions on how model scale and prompting interact with task complexity. Key findings:
● Model size dominates performance, especially on reactive and structured reasoning tasks. Larger models (e.g., Llama3.3-70B) significantly outperform smaller ones on tasks like Tower of Hanoi and Bandit, where fast exploitation or spatial planning is critical.
● Advanced prompting helps smaller models more, particularly on complex tasks. For example, Llama3-8B with Reflection+Oracle surpasses Llama3.3-70B’s baseline on Rock-Paper-Scissors. However, these strategies introduce high variance and can lead to worse-than-baseline performance depending on the run.
● Long prompts hurt smaller models on simple tasks. In Bandit, adding reflective reasoning decreases performance by distracting the model or prolonging exploration. This aligns with prior findings on prompt length and signal-to-noise ratio.
● Prompting strategy gains depend on task type. Instruction following improves across all models, while long-text understanding benefits mid-sized models. In contrast, strategies show weak or negative impact on planning, reasoning, and spatial challenges for large models.
● Dense reward shaping improves performance more reliably than prompting. In follow-up experiments, modifying sparse reward signals (especially in Hanoi and Messenger) led to more consistent gains than tweaking prompt strategies. | Paper, Tweet | | 7) AdaptThink - This paper introduces AdaptThink, an RL framework designed to help reasoning models decide when to use detailed chain-of-thought reasoning (“Thinking”) versus directly producing an answer (“NoThinking”), based on task difficulty. This approach challenges the prevailing assumption that deep reasoning should be applied uniformly across all problems, showing that skipping the “thinking” step often yields better efficiency and even higher accuracy on simpler tasks. Key insights:
● NoThinking outperforms Thinking on simple problems: The authors demonstrate that models like DeepSeek-R1 perform better (in both accuracy and efficiency) when using NoThinking mode, an empty token prompt, for easy problems. For example, on Level 1 MATH500 problems, NoThinking achieved slightly better accuracy with significantly fewer tokens used.
● AdaptThink learns to switch modes: The proposed RL algorithm introduces a constrained optimization that promotes NoThinking as long as accuracy doesn’t degrade. It uses a novel importance sampling strategy to enable cold-start learning of both modes from the beginning, avoiding the collapse into all-Thinking behavior.
● Massive gains in efficiency and performance: On GSM8K, MATH500, and AIME 2024, AdaptThink reduced response length by up to 53% and improved accuracy by up to 2.4% over DeepSeek-R1-Distill-Qwen-1.5B. It also outperformed prior methods (e.g., DPOShortest, TLMRE, ModelMerging) in the trade-off between accuracy and response length.
● Robustness and generalization: AdaptThink generalizes to out-of-distribution tasks such as MMLU, maintaining or improving accuracy while reducing token usage. It also avoids "implicit thinking" in NoThinking responses, showing controlled behavior during inference. | Paper | | 8) MedBrowseComp - MedBrowseComp is a new benchmark designed to evaluate LLM agents’ ability to perform complex, multi-hop medical fact-finding by browsing real-world, domain-specific web resources. Testing over 1,000 clinically grounded questions, the benchmark reveals major capability gaps in current models, with top systems achieving only 50% accuracy and GUI-based agents performing even worse. | Paper, Tweet | | 9) ARC-AGI-2 - ARC-AGI-2 is a new benchmark designed to push the boundaries of AI reasoning beyond the original ARC-AGI. It introduces harder, more unique tasks emphasizing compositional generalization and human-like fluid intelligence, with baseline AI models performing below 5% accuracy despite strong ARC-AGI-1 results. | Paper, Tweet | | 10) Teaching MLLMs to Think with Images - GRIT is a new method that enables MLLMs to perform grounded visual reasoning by interleaving natural language with bounding box references. Using a reinforcement learning approach (GRPO-GR), GRIT achieves strong reasoning and grounding performance with as few as 20 image-question-answer triplets, outperforming baselines in both accuracy and visual coherence. | Paper, Tweet |

Top AI Papers of the Week (May 12 - May 18) - 2025

| Paper | Links | | ------------- | ------------- | | 1) AlphaEvolve - AlphaEvolve is a coding agent developed by Google DeepMind that uses LLM-guided evolution to discover new algorithms and optimize computational systems. It orchestrates a pipeline where LLMs generate code changes, evaluators provide feedback, and an evolutionary loop iteratively improves solutions. AlphaEvolve shows that LLMs can go beyond conventional code generation and assist in scientific and algorithmic discovery. Key highlights:
● Novel Algorithm Discovery: AlphaEvolve discovered a new algorithm to multiply 4×4 complex-valued matrices using 48 multiplications, the first improvement over Strassen’s 1969 result (49 multiplications) in this setting.
● Broad Mathematical Impact: Applied to 50+ open problems in mathematics, AlphaEvolve matched or exceeded state-of-the-art in ~95% of cases. For example, it improved bounds on Erdős’s minimum overlap problem and kissing numbers in 11 dimensions.
● Infrastructure Optimization at Google: AlphaEvolve improved key components of Google’s compute stack:
● Advanced Pipeline Design: AlphaEvolve uses ensembles of Gemini 2.0 Flash and Pro models. It supports rich prompts (past trials, evaluations, explicit context), multi-objective optimization, and evaluation cascades for robust idea filtering. Programs are evolved at full-file scale rather than function-level only, a key differentiator from predecessors like FunSearch.
● Ablations Confirm Component Importance: Experiments show that evolution, prompt context, full-file evolution, and using strong LLMs all contribute significantly to performance. Removing any one of these reduces effectiveness. | Paper, Tweet | | 2) LLMs Get Lost in Multi-Turn Conversation - Investigates how top LLMs degrade in performance during underspecified, multi-turn interactions, common in real-world usage but rarely evaluated. The authors introduce a novel "sharded simulation" framework that breaks down fully-specified instructions into gradual conversation shards, simulating how users naturally provide information over time. Key findings:
● Massive performance drop: Across 15 top LLMs (e.g., GPT-4.1, Gemini 2.5 Pro, Claude 3.7), average performance dropped 39% in multi-turn vs. single-turn settings. Even a two-turn interaction was enough to cause a significant decline.
● High unreliability, not just low aptitude: Decomposition shows only a small drop in best-case capability (aptitude) but a 112% increase in unreliability, meaning models are wildly inconsistent depending on how the conversation unfolds.
● Root causes of failure: Through log analysis and experiments, the paper identifies four major issues:
● Sharded evaluation tasks: The authors built 600+ multi-turn simulations across 6 tasks (coding, math, SQL, API calls, summarization, and table captioning), showing consistent degradation across domains.
● Agent-style interventions only partially help: Techniques like recap and snowballing (repeating all prior turns) improved outcomes by ~15–20% but did not restore single-turn levels, suggesting that model internals, not prompting strategies, are the bottleneck.
● Temperature and test-time compute don't solve the issue: Even at temperature 0.0 or with reasoning models (like o3 and DeepSeek-R1), models remained highly unreliable in multi-turn settings. | Paper, Tweet | | 3) RL for Reasoning in LLMs with One Training Example - This paper shows that Reinforcement Learning with Verifiable Rewards (RLVR) can significantly improve mathematical reasoning in LLMs even when trained with just a single example. On the Qwen2.5-Math-1.5B model, one-shot RLVR improves accuracy on the MATH500 benchmark from 36.0% to 73.6%, nearly matching performance achieved with over 1,200 examples. Two-shot RLVR (with two examples) even slightly surpasses that, matching results from full 7.5k example training.
● Extreme data efficiency: A single training example (π₁₃) boosts MATH500 accuracy to 73.6% and average performance across six math benchmarks to 35.7%, rivaling full-dataset RLVR. Two-shot RLVR goes further (74.8% and 36.6%).
● Broad applicability: 1-shot RLVR works not only on Qwen2.5-Math-1.5B, but also on Qwen2.5-Math-7B, Llama3.2-3B-Instruct, and DeepSeek-R1-Distill-Qwen-1.5B. It remains effective across GRPO and PPO RL algorithms.
● Post-saturation generalization: Despite training accuracy saturating early (within 100 steps), test accuracy continues improving well beyond, reaching gains of +10% after 2,000 steps. The model eventually overfits the single example (mixing gibberish into outputs), yet test performance remains stable.
● Cross-domain and reflection behavior: A single example from one domain (e.g., geometry) improves performance across others (e.g., number theory). Additionally, models trained with 1-shot RLVR exhibit increased self-reflection (e.g., “rethink”, “recalculate”) and longer output sequences.
● Loss function insights: Ablation studies confirm that policy gradient loss is the primary driver of improvements, not weight decay, distinguishing 1-shot RLVR from "grokking". Entropy loss further enhances performance and generalization; even without reward signals, entropy-only training can still yield a 27% performance boost. | Paper, Tweet | | 4) AM-Thinking-v1 - Introduces a dense, open-source 32B language model that achieves state-of-the-art performance in reasoning tasks, rivaling significantly larger Mixture-of-Experts (MoE) models. Built upon Qwen2.5-32B, the model is trained entirely with public data and showcases how a meticulously crafted post-training pipeline can unlock competitive performance at mid-scale sizes. Key points:
● Benchmark performance: AM-Thinking-v1 scores 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, outperforming DeepSeek-R1 (671B MoE) and matching or exceeding Qwen3-32B and Seed1.5-Thinking. On Arena-Hard (general chat), it hits 92.5, near the level of OpenAI o1 and o3-mini but behind Qwen3-235B-A22B and Gemini 2.5 Pro.
● Training pipeline: The model uses a two-stage post-training approach combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). SFT emphasizes a “think-then-answer” format and uses 2.84M samples, while RL incorporates difficulty-aware sampling and a two-stage curriculum optimized via Group Relative Policy Optimization (GRPO).
● Data and filtering: All training data is publicly sourced and heavily filtered. Math data goes through LLM-assisted cleaning and cross-model ground-truth validation. Responses are filtered using perplexity, n-gram repetition, and structural checks to ensure coherence and correctness.
● Inference and deployment: The authors implement a custom rollout framework atop, decoupling rollout from inference via a streaming load balancer. This reduces long-tail latency and increases throughput across distributed GPU nodes, enabling scalable RL training at 32k sequence length. | Paper, Tweet | | 5) HealthBench - HealthBench is a benchmark of 5,000 multi-turn health conversations graded against 48,562 rubric criteria written by 262 physicians across 60 countries. Unlike prior multiple-choice evaluations, HealthBench supports open-ended, realistic assessments of LLM responses across diverse health themes (e.g., global health, emergency care, context-seeking) and behavioral axes (accuracy, completeness, communication, context awareness, instruction following).
● Significant frontier model gains: HealthBench reveals rapid performance improvements, with GPT-3.5 Turbo scoring 16%, GPT-4o reaching 32%, and o3 achieving 60%. Notably, smaller models like GPT-4.1 nano outperform GPT-4o while being 25x cheaper.
● Two challenging benchmark variants: HealthBench Consensus focuses on 34 physician-validated criteria (e.g., recognizing emergencies), while HealthBench Hard isolates 1,000 difficult examples on which no model scores above 32%, establishing headroom for future progress.
● Physician comparison baseline: Surprisingly, LLMs like o3 and GPT-4.1 often produce higher-quality responses than unassisted physicians. When provided with model responses as references, physicians improved older model completions but couldn’t improve completions from newer models.
● Reliable model-based grading: Meta-evaluation shows GPT-4.1 as a grader achieves macro F1 scores comparable to physicians. On average, its agreement with other doctors places it in the 51st–88th percentile across themes like emergency triage, communication, and uncertainty handling.
● Safety-relevant insights: The benchmark assesses worst-case performance using "worst-at-k" scores, showing that even the best models have reliability gaps. For example, o3’s worst-at-16 score drops by a third from its average, underscoring the need for further safety work. | Paper, Tweet | | 6) Nemotron-Research-Tool-N1 - Introduces Tool-N1, a family of tool-using LLMs trained using a rule-based reinforcement learning (R1-style RL) approach, without reliance on supervised reasoning trajectories. The key idea is to enable models to learn to invoke external tools correctly through binary feedback based on functional correctness and format adherence, rather than step-by-step imitation.
● Rule-based RL over SFT: Tool-N1 models are trained using a lightweight binary reward that only evaluates whether the model's tool calls are structurally correct and functionally valid. This allows the model to develop its reasoning process, sidestepping the limitations of mimicking distilled trajectories via supervised fine-tuning (SFT).
● Strong benchmark results: Tool-N1-7B and Tool-N1-14B outperform GPT-4o and domain-specialized models on several benchmarks, including BFCL, API-Bank, and ACEBench. For example, Tool-N1-14B beats GPT-4o on BFCL overall (85.97 vs 83.97) and achieves +5% over GPT-4o on API-Bank.
● Pure RL outperforms SFT-then-RL: A systematic comparison on 5,518 distilled trajectories shows that pure RL yields better results than the SFT-then-RL pipeline, challenging the dominant paradigm. For instance, 100% RL achieves 83.24% average vs. 83.17% for SFT+RL.
● Binary reward fine-grained reward: Ablation studies reveal that strict binary rewards (requiring correct reasoning format and exact tool call) lead to better generalization than partial credit schemes, especially on realistic “Live” data (80.38% vs 76.61%).
● Scaling and generalization: Performance scales well with model size, with the most gains observed in larger models. The method generalizes across backbones, with Qwen2.5-Instruct outperforming LLaMA3 variants at the same scale. | Paper, Tweet | | 7) RL for Search-Efficient LLMs - Proposes a new RL-based framework (SEM) that explicitly teaches LLMs when to invoke search and when to rely on internal knowledge, aiming to reduce redundant tool use while maintaining answer accuracy. Key points:
● Motivation & Setup: LLMs often overuse external search even for trivial queries. SEM addresses this by using a balanced training dataset (Musique for unknowns, MMLU for knowns) and a structured format (● Reward Optimization: The authors employ Group Relative Policy Optimization (GRPO) to compare outputs within query groups. The reward function penalizes unnecessary search and rewards correct answers, either without search or with efficient search-and-reasoning when needed.
● Experimental Results: On HotpotQA and MuSiQue, SEM significantly outperforms Naive RAG and ReSearch, achieving higher EM and LLM-Judged (LJ) accuracy with smarter search ratios. On MMLU and GSM8K (where search is often unnecessary), SEM maintains high accuracy while invoking search far less than baseline methods (e.g., 1.77% SR vs 47.98% for Naive RAG on MMLU.
● Case Study & Efficiency: SEM avoids absurd search behavior like querying “What is 1+1?” multiple times. It also uses fewer but more targeted searches for unknowns, enhancing both interpretability and computational efficiency. Training dynamics further show that SEM enables faster and more stable learning than prior methods. | Paper, Tweet | | 8) Cost-Efficient, Low-Latency Vector Search - Integrates DiskANN (a vector indexing library) inside of Azure Cosmos DB NoSQL (an operational dataset) that uses a single vector index per partition stored in existing index trees. Benefit: It supports < 20ms query latency over an index spanning 10 million vectors, has stable recall over updates, and offers nearly 15× and 41× lower query cost compared to Zilliz and Pinecone serverless enterprise products. It can further scale to billions of vectors with automatic partitioning. | Paper, Tweet | | 9) AI Agents vs. Agentic AI - This review paper distinguishes AI Agents from Agentic AI, presenting a structured taxonomy and comparing their architectures, capabilities, and challenges. AI Agents are defined as modular, task-specific systems powered by LLMs and tools, while Agentic AI represents a shift toward multi-agent collaboration, dynamic task decomposition, and orchestrated autonomy, with applications and challenges mapped out for both paradigms, along with proposed solutions like RAG, orchestration layers, and causal modeling. | Paper, Tweet | | 10) CellVerse - Introduces a benchmark to evaluate LLMs on single-cell biology tasks by converting multi-omics data into natural language. While generalist LLMs like DeepSeek and GPT-4 families show some reasoning ability, none significantly outperform random guessing on key tasks like drug response prediction, exposing major gaps in biological understanding by current LLMs. | Paper, Tweet |

Top AI Papers of the Week (May 5 - May 11) - 2025

| Paper | Links | | ------------- | ------------- | | 1) The Leaderboard Illusion - The Leaderboard Illusion investigates systemic distortions in how the Chatbot Arena leaderboard evaluates LLMs, arguing that current practices undermine fair model comparison and scientific progress. Through extensive data analysis covering 2M Arena battles, the authors identify four key issues distorting rankings:
● Selective score reporting through private testing: Some providers (notably Meta, Google, and OpenAI) are allowed to test dozens of model variants privately and only publish the best-performing one. This violates the unbiased sampling assumption of the Bradley-Terry (BT) model, which powers Arena rankings. Simulations show that testing just 10 variants can artificially inflate a model’s Arena score by ~100 points.
● Extreme data asymmetries: Proprietary models are oversampled compared to open-weight and open-source models. OpenAI and Google alone received over 39% of all Arena data, while 83 open-weight models collectively received only 29.7%. These data advantages translate into significant performance gains: a model trained on 70% Arena data outperforms its baseline by 112% on the ArenaHard benchmark.
● Unfair and opaque deprecations: 205 models were silently removed from the leaderboard despite only 47 being officially marked as deprecated. Open-source models are disproportionately affected, breaking the comparison graph and violating BT model assumptions, leading to unreliable rankings.
● Overfitting to Arena-specific dynamics: Due to partial prompt repetition and distributional drift over time, access to Arena data allows providers to tune models specifically for Arena performance. This leads to high win rates on Arena benchmarks, but not on out-of-distribution tasks like MMLU, where gains diminish or reverse. | Paper | | 2) Llama-Nemotron - NVIDIA introduces the Llama-Nemotron model series, LN-Nano (8B), LN-Super (49B), and LN-Ultra (253B), a family of open, efficient, and high-performing reasoning models. These models rival or outperform DeepSeek-R1 on various benchmarks while offering significantly better inference throughput and memory efficiency. LN-Ultra is noted as the most "intelligent" open model by Artificial Analysis. A key innovation is a dynamic reasoning toggle ("detailed thinking on/off") that allows users to control reasoning behavior at inference time. Highlights:
● Multi-stage training: Models were built via neural architecture search (Puzzle), knowledge distillation, continued pretraining, supervised fine-tuning (SFT), and large-scale RL. LN-Ultra is enhanced with FP8 inference and FFN Fusion for speed and scalability.
● Reasoning Toggle: The models can switch between reasoning and non-reasoning modes via a simple prompt instruction, making them adaptable for various use cases.
● Synthetic dataset: Over 33M examples across math, code, science, and instruction-following were curated, with reasoning-mode samples tagged explicitly. LN-Ultra's training used curriculum RL and GRPO to surpass its teachers on benchmarks like GPQA-D.
● Evaluation dominance: LN-Ultra outperforms DeepSeek-R1 and Llama-3.1-405B in reasoning tasks like AIME25, MATH500, and GPQA-Diamond while also achieving strong chat alignment scores (Arena-Hard: 87.0). LN-Super scores 88.3, beating Claude 3.5 and GPT-4o. NVIDIA provides the weights, training code (NeMo, Megatron-LM, NeMo-Aligner), and the full post-training dataset under a permissive license, aiming to push open research in reasoning models. | Paper, Models | | 3) Absolute Zero - Introduces an LLM training framework that eliminates the need for human-curated data. Key highlights:
● It learns to propose and solve its reasoning tasks entirely through self-play, guided by verifiable feedback from an execution environment. This zero-data RLVR (RL with Verifiable Rewards) setting achieves SOTA coding and math reasoning performance.
● AZR learns by generating its code-based reasoning tasks using three core reasoning modes (deduction, abduction, and induction), validating solutions via Python execution, not human labels.
● A single LLM plays both roles, proposing new tasks based on learnability and solving them with feedback-based reinforcement. Rewards favor moderately difficult tasks to maximize the learning signal.
● Despite using zero in-domain examples, AZR outperforms all previous zero-setting models on average by +1.8 points and even beats models trained on tens to hundreds of thousands of curated samples. AZR-Coder-7B achieves the highest average score across all tested models.
● AZR trained in a coding-only environment improves mathematical reasoning performance by up to +15.2 points, far more than expert code models trained with RLVR, showing strong generalization.
● Larger AZR models (3B → 7B → 14B) consistently show greater improvements, confirming scalability and suggesting promise for even larger models.
● AZR develops natural ReAct-like intermediate planning in code (e.g., interleaved comments and logic), trial-and-error strategies in abduction, and systematic state tracking, behaviors typically observed in much larger models.
● Llama-3.1-8B variants of AZR sometimes produce concerning reasoning chains (dubbed “uh-oh moments”), highlighting the importance of safety-aware training in autonomous systems. | Paper, Tweet | | 4) Discuss-RAG - This paper introduces Discuss-RAG, a plug-and-play agent-based framework that enhances retrieval-augmented generation (RAG) for medical question answering by mimicking human-like clinical reasoning. Standard RAG systems rely on embedding-based retrieval and lack mechanisms to verify relevance or logical coherence, often leading to hallucinations or outdated answers. Discuss-RAG addresses these gaps via a modular agent setup that simulates multi-turn medical discussions and performs post-retrieval verification. Key ideas:
● Multi-agent collaboration: A summarizer agent orchestrates a team of medical domain experts who iteratively refine a contextual summary through simulated brainstorming, providing deeper and more structured information to guide retrieval.
● Decision-making agent: After retrieval, a verifier and a decision-making agent assess snippet quality and trigger fallback strategies when relevance is low, improving answer accuracy and contextual grounding.
● Plug-and-play design: Discuss-RAG is training-free and modular, allowing easy integration into existing RAG pipelines.
● Strong performance gains: Across four benchmarks, Discuss-RAG outperforms MedRAG with substantial accuracy improvements, notably +16.67% on BioASQ and +12.20% on PubMedQA. | Paper | | 5) The Value of RL in Fine-Tuning - This work shows that, in theory, every popular preference-fine-tuning objective collapses to maximum-likelihood estimation (MLE), yet experiments show a consistent RL advantage on real tasks. They reconcile this gap with a generation-verification complexity hypothesis.
● Theory: RLHF ≈ MLE – Under mild assumptions, trajectory-level RLHF, DPO, and related algorithms are equivalent to projecting the data back to likelihood space, so expending compute on on-policy sampling should be unnecessary.
● Empirics contradict naïve theory – On the tl;dr summarization benchmark with Pythia-1.4B/2.8B, a single online-DPO iteration lifts win-rate by 6-10 pts over offline DPO despite identical data, model, and optimizer, confirming that RL can add real value.
● Takeaways – RL helps when crafting a good answer is harder than checking one. The gap vanishes on two-word summaries (horizon = 1) or when ROUGE-L is used as the reward. RL acts as a shortcut through policy space only when the reward model is simpler than the policy it trains. For tasks where verification is as hard as generation, offline likelihood-based fine-tuning suffices, guiding practitioners on when RLHF is worth its extra cost. | Paper | | 6) WebThinker - This paper introduces a reasoning agent framework that equips large reasoning models (LRMs) with autonomous web exploration and report writing abilities to overcome limitations of static internal knowledge. WebThinker integrates a Deep Web Explorer module and an Autonomous Think-Search-and-Draft strategy that lets models search the web, reason through tasks, and generate comprehensive outputs simultaneously. It also incorporates an RL-based training loop using online DPO to improve tool usage. The system supports two modes: complex problem solving and scientific report generation. Key points:
● Superior performance in complex reasoning: On GPQA, GAIA, WebWalkerQA, and HLE, WebThinker-32B-RL achieved new state-of-the-art results among 32B models, outperforming both retrieval-augmented and proprietary systems like GPT-4o and DeepSeek-R1-671B. For example, it reached 70.7% on GPQA and 15.8% on HLE, with gains of up to +21.5% over baselines.
● Best-in-class scientific report writing: On the Glaive dataset, WebThinker outperformed Gemini2.0 Deep Research and Grok3 DeeperSearch, scoring 8.1 in average quality metrics such as completeness and coherence.
● RL refinement matters: The RL-trained version outperformed its base counterpart across all benchmarks, showing that iterative preference-based learning significantly enhances reasoning-tool coordination.
● Ablation validates design: Removing components like Deep Web Explorer or automatic report drafting significantly degraded performance, confirming their necessity. | Paper | | 7) Reward Modeling as Reasoning - This work proposes a new class of reward models, called ReasRMs, that reformulate reward modeling as a reasoning task. The authors introduce RM-R1, a family of generative reward models that produce interpretable reasoning traces and rubrics during preference judgments. Instead of relying on scalar scores or shallow generation, RM-R1 models leverage structured reasoning and reinforcement learning to improve both interpretability and performance across benchmarks.
● RM-R1 adopts a two-stage training process: (1) distillation of reasoning traces from stronger models, and (2) reinforcement learning with verifiable rewards. The Chain-of-Rubrics (CoR) prompting framework guides the model to either solve reasoning problems or generate evaluation rubrics depending on the task type (reasoning or chat).
● On RewardBench, RM-Bench, and RMB, RM-R1 models achieve state-of-the-art or near-SOTA performance, outperforming models like GPT-4o and Llama3.1-405B by up to 13.8% despite using fewer parameters and less data.
● Ablation studies show that cold-start RL alone is insufficient; task-type classification and high-quality distillation are key. RM-R1's distilled warm-start training leads to more stable learning and longer, more accurate reasoning traces.
● RM-R1 also shows strong generalization across domains and better rubric quality than baseline methods, especially in sensitive contexts like safety and medical judgment. The authors open-sourced six RM-R1 models, training data, and code to support reproducibility. | Paper | | 8) Paper2Code - Introduces PaperCoder, a multi-agent LLM framework that transforms ML papers into full code repositories without relying on pre-existing implementations.
● PaperCoder decomposes the code generation process into three stages: Planning (roadmap, architecture, file dependencies, config files), Analyzing (file-specific logic extraction), and Coding (dependency-aware file generation). Each step is handled by specialized LLM agents.
● It is evaluated using both the proposed Paper2Code benchmark (90 papers from ICML, NeurIPS, and ICLR 2024) and PaperBench Code-Dev. Results show PaperCoder outperforms ChatDev, MetaGPT, and naive baselines across reference-based, reference-free, and human evaluations.
● In human assessments by original paper authors, 77% chose PaperCoder as best implementation; 85% said it helped them reproduce their work. On average, only 0.48% of code lines required changes for executability.
● A detailed ablation study shows consistent performance gains from each stage, especially logic design and file dependency ordering. PaperCoder, using the o3-mini-high backbone, notably outperforms other LLM variants. | Paper | | 9) ZeroSearch - ZeroSearch is an RL framework that trains LLMs to develop search capabilities without using real search engines. It uses simulated LLM-generated documents with a curriculum-based degradation strategy and outperforms real-search methods like Search-R1 in both performance and cost, achieving better QA accuracy across multiple benchmarks. | Paper, Tweet | | 10) Practical Efficiency of Muon for Pretraining - Discusses how Muon, a simple second-order optimizer, outperforms AdamW in large-batch pretraining by expanding the compute-time Pareto frontier and maintaining better data efficiency. Combined with muP scaling and a novel telescoping algorithm for hyperparameter transfer, it enables faster training with minimal tuning overhead up to 4B parameter models. | Paper |

Top AI Papers of the Week (April 28 - May 4) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Phi-4-Mini-Reasoning - Microsoft released Phi-4-Mini-Reasoning to explore small reasoning language models for math. Highlights:
● Phi-4-Mini-Reasoning: The paper introduces Phi-4-Mini-Reasoning, a 3.8B parameter small language model (SLM) that achieves state-of-the-art mathematical reasoning performance, rivaling or outperforming models nearly twice its size.
● Unlocking Reasoning: They use a systematic, multi-stage training pipeline to unlock strongbr> reasoning capabilities in compact models, addressing the challenges posed by their limited capacity. Uses large-scale distillation, preference learning, and RL with verifiable rewards.
● Four-Stage Training Pipeline: The model is trained using (1) mid-training with large-scale long CoT data, (2) supervised fine-tuning on high-quality CoT data, (3) rollout-based Direct Preference Optimization (DPO), and (4) RL using verifiable reward signals.
● Math Performance: On MATH-500, Phi-4-Mini-Reasoning reaches 94.6%, surpassing DeepSeek-R1-Distill-Qwen-7B (91.4%) and DeepSeek-R1-Distill-Llama-8B (86.9%), despite being smaller.
● Verifiable Reward Reinforcement Learning: The final RL stage, tailored for small models, includes prompt filtering, oversampling for balanced training signals, and temperature annealing. This improves training stability and aligns exploration with evaluation conditions.
● Massive Synthetic Data Generation: The model is mid-trained on 10M CoT rollouts generated by DeepSeek-R1, filtered for correctness using math verifiers and GPT-4o-mini, and categorized by domain and difficulty to ensure broad generalization.
● Ablation Study: Each phase of the pipeline shows clear gains. Notably, fine-tuning and RL each deliver ~5–7 point improvements after mid-training and DPO, showing the value of the full pipeline over isolated techniques. | Paper, Tweet | | 2) Building Production-Ready AI Agents with Scalable Long-Term Memory - This paper proposes a memory-centric architecture for LLM agents to maintain coherence across long conversations and sessions, solving the fixed-context window limitation. Main highlights:
● The solution introduces two systems: Mem0, a dense, language-based memory system, and Mem0g, an enhanced version with graph-based memory to model complex relationships. Both aim to extract, consolidate, and retrieve salient facts over time efficiently.
● Mem0: Uses a two-stage architecture (extraction & update) to maintain salient conversational memories. It detects redundant or conflicting information and manages updates using tool-calls, resulting in a lightweight, highly responsive memory store (7K tokens per conversation).
● Mem0g: By structuring memory as a knowledge graph of entities and relationships, Mem0g improves performance in tasks needing temporal and relational reasoning (e.g., event ordering, preference tracking) while maintaining reasonable latency and memory cost (14K tokens/convo).
● Benchmarking on LOCOMO: Both systems were evaluated against six memory system baselines (e.g., A-Mem, OpenAI, Zep, LangMem, RAG). Mem0g achieves the best overall LLM-as-a-Judge (J) score of 68.44%, outperforming all RAG and memory baselines by 7–28% in J and reducing p95 latency by 91% over full-context methods.
● Latency and efficiency: Mem0 achieves the lowest search and total latencies (p95 = 1.44s), and Mem0g still outperforms other graph-based or RAG systems by large margins in speed and efficiency. Great for real-time deployments.
● Use-case strengths: Mem0 and Mem0g offer a scalable memory architecture for long-term LLM agents to improve factual recall, reasoning depth, and efficiency, making them id | Paper, Tweet | | 3) UniversalRAG - UniversalRAG is a framework that overcomes the limitations of existing RAG systems confined to single modalities or corpora. It supports retrieval across modalities (text, image, video) and at multiple granularities (e.g., paragraph vs. document, clip vs. video). Contributions from the paper:
● Modality-aware routing: To counter modality bias in unified embedding spaces (where queries often retrieve same-modality results regardless of relevance), UniversalRAG introduces a router that dynamically selects the appropriate modality (e.g., image vs. text) for each query.
● Granularity-aware retrieval: Each modality is broken into granularity levels (e.g., paragraphs vs. documents for text, clips vs. full-length videos). This allows queries to retrieve content that matches their complexity -- factual queries use short segments while complex reasoning accesses long-form data.
● Flexible routing: It supports both training-free (zero-shot GPT-4o prompting) and trained (T5-Large) routers. Trained routers perform better on in-domain data, while GPT-4o generalizes better to out-of-domain tasks. An ensemble router combines both for robust performance.
● Performance: UniversalRAG outperforms modality-specific and unified RAG baselines across 8 benchmarks spanning text (e.g., MMLU, SQuAD), image (WebQA), and video (LVBench, VideoRAG). With T5-Large, it achieves the highest average score across modalities.
● Case study: In WebQA, UniversalRAG correctly routes a visual query to the image corpus (retrieving an actual photo of the event), while TextRAG and VideoRAG fail. Similarly, on HotpotQA and LVBench, it chooses the right granularity, retrieving documents or short clips. Overall, this is a great paper showing the importance of considering modality and granularity in a RAG system. | Paper, Tweet | | 4) DeepSeek-Prover-V2 - DeepSeek-Prover-V2 is an LLM (671B) that significantly advances formal theorem proving in Lean 4. The model is built through a novel cold-start training pipeline that combines informal chain-of-thought reasoning with formal subgoal decomposition, enhanced through reinforcement learning. It surpasses prior state-of-the-art on multiple theorem-proving benchmarks. Key highlights:
● Cold-start data via recursive decomposition: The authors prompt DeepSeek-V3 to generate natural-language proof sketches, decompose them into subgoals, and formalize these steps in Lean with sorry placeholders. A 7B prover model then recursively fills in the subgoal proofs, enabling efficient construction of complete formal proofs and training data.
● Curriculum learning + RL: A subgoal-based curriculum trains the model on increasingly complex problems. Reinforcement learning with a consistency reward is used to enforce alignment between proof structure and CoT decomposition, improving performance on complex tasks.
● Dual proof generation modes: The model is trained in two modes, non-CoT (efficient, minimal proofs) and CoT (high-precision, interpretable). The CoT mode yields significantly better performance, particularly on hard problems.
● Benchmark results: | Paper, Tweet | | 5) Kimi-Audio - Kimi-Audio is a new open-source audio foundation model built for universal audio understanding, generation, and speech conversation. The model architecture uses a hybrid of discrete semantic audio tokens and continuous Whisper-derived acoustic features. It is initialized from a pre-trained LLM and trained on 13M+ hours of audio, spanning speech, sound, and music. It also supports a streaming detokenizer with chunk-wise decoding and a novel look-ahead mechanism for smoother audio generation. Extensive benchmarking shows that Kimi-Audio outperforms other audio LLMs across multiple modalities and tasks. Key highlights:
● Architecture: Kimi-Audio uses a 12.5Hz semantic tokenizer and an LLM with dual heads (text + audio), processing hybrid input (discrete + continuous). The audio detokenizer employs a flow-matching upsampler with BigVGAN vocoder for real-time speech synthesis.
● Massive Training Corpus: Pretrained on 13M+ hours of multilingual, multimodal audio. A rigorous preprocessing pipeline adds speech enhancement, diarization, and transcription using Whisper and Paraformer-Zh. Fine-tuning uses 300K+ hours from 30+ open datasets.
● Multitask Training: Training spans audio-only, text-only, ASR, TTS, and three audio-text interleaving strategies. Fine-tuning is instruction-based, with both audio/text instructions injected via zero-shot TTS.
● Evaluation: On ASR (e.g., LibriSpeech test-clean: 1.28 WER), audio understanding (CochlScene: 80.99), and audio-to-text chat (OpenAudioBench avg: 69.8), Kimi-Audio sets new SOTA results, beating Qwen2.5-Omni and Baichuan-Audio across the board. | Paper, Tweet Model | | 6) MiMo-7B - Xiaomi releases MiMo-7B, a new language model for reasoning tasks. MiMo-7B is explicitly designed for advanced reasoning across math and code. Highlights:
● MiMo-7B: MiMo-7B narrows the capability gap with larger 32B-class models through careful pretraining & posttraining. MiMo-7B-Base is trained from scratch on 25T tokens, with a 3-stage mixture skewed toward mathematics and code (70% in stage 2).
● Pre-Training: The team improves HTML and PDF extraction to better preserve STEM data, leverages LLMs to generate diverse synthetic reasoning content, and adds a Multi-Token Prediction (MTP) objective that boosts both quality and inference speed.
● Base Performance: MiMo-7B-Base outperforms other 7B–9B models like Qwen2.5, Gemma-2, and Llama-3.1 across BBH (+5 pts), AIME24 (+22.8 pts), and LiveCodeBench (+27.9 pts). On BBH and LiveCodeBench, it even beats larger models on reasoning-heavy tasks.
● RL: MiMo-7B-RL is trained with a test difficulty–driven reward function and easy-data resampling to tackle sparse-reward issues and instabilities. In some cases, it surpasses o1-mini on math & code. RL from the SFT model reaches higher ceilings than RL-Zero from the base.
● Efficient infrastructure: A Seamless Rollout Engine accelerates RL by 2.29× and validation by 1.96× using continuous rollout, async reward computation, and early termination. MTP layers enable fast speculative decoding, with 90%+ acceptance rates in inference. | Paper, Tweet | | 7) Advances and Challenges in Foundation Agents - A new survey frames intelligent agents with a modular, brain-inspired architecture that integrates ideas from cognitive science, neuroscience, and computational research. Key topics covered:
● Human Brain and LLM Agents: Helps to better understand what differentiates LLM agents from human/brain cognition, and what inspirations we can get from the way humans learn and operate.
● Definitions: Provides a nice, detailed, and formal definition of what makes up an AI agent.
● Reasoning: It has a detailed section on the core components of intelligent agents. There is a deep dive into reasoning, which is one of the key development areas of AI agents and what unlocks things like planning, multi-turn tooling, backtracking, and much more.
● Memory: Agent memory is a challenging area of building agentic systems, but there is already a lot of good literature out there from which to get inspiration.
● Action Systems: You can already build very complex agentic systems today, but the next frontier is agents that take actions and make decisions in the real world. We need better tooling, better training algorithms, and robust operation in different action spaces.
● Self-Evolving Agents: For now, building effective agentic systems requires human effort and careful optimization tricks. However, one of the bigger opportunities in the field is to build AI that can itself build powerful and self-improving AI systems. | Paper, Tweet | | 8) MAGI - MAGI is a multi-agent system designed to automate structured psychiatric interviews by operationalizing the MINI (Mini International Neuropsychiatric Interview) protocol. It involves 4 specialized agents: navigation, question generation, judgment, and diagnosis. Other highlights:
● Multi-Agent Clinical Workflow: MAGI is built with a navigation agent (interview flow control), a question agent (dynamic, empathetic probing), a judgment agent (response validation), and a diagnosis agent using Psychometric CoT to trace diagnoses explicitly to MINI/DSM-5 criteria.
● Explainable Reasoning (PsyCoT): Instead of treating diagnoses as opaque outputs, PsyCoT decomposes psychiatric reasoning into symptom anchoring, syndromal validation, and evidence binding. This helps with auditability for each diagnostic conclusion. CoT put to great use.
● Results: Evaluated on 1,002 real-world interviews, MAGI outperforms baselines (Direct prompting, Role-play, Knowledge-enhanced, and MINI-simulated LLMs) across relevance, accuracy, completeness, and guidance.
● Strong Clinical Agreement: Diagnostic evaluations show PsyCoT consistently improves F1 scores, accuracy, and Cohen’s κ across disorders like depression, generalized anxiety, social anxiety, and suicide risk, reaching clinical-grade reliability (κ 0.8) in high-risk tasks. | Paper, Tweet | | 9) A Survey of Efficient LLM Inference Serving - This survey reviews recent advancements in optimizing LLM inference, addressing memory and computational bottlenecks. It covers instance-level techniques (like model placement and request scheduling), cluster-level strategies (like GPU deployment and load balancing), and emerging scenario-specific solutions, concluding with future research directions. | Paper | | 10) LLM for Engineering - This work finds that when RL is used, a 7B parameter model outperforms both SoTA foundation models and human experts at high-powered rocketry design. | Paper |

Top AI Papers of the Week (April 21 - April 27) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Does RL Incentivize Reasoning in LLMs Beyond the Base Model? - This paper revisits a key assumption in recent LLM development: that Reinforcement Learning with Verifiable Rewards (RLVR) helps models acquire genuinely new reasoning capabilities. By analyzing models across tasks (math, code, vision) using pass@k metrics (with large k), the authors find that RLVR improves sample efficiency but does not expand reasoning capacity beyond the base model.
● Key insight: RLVR-trained models do better at low k (e.g., pass@1), but as k increases (up to 256 or more), base models eventually match or outperform them. This suggests RLVR doesn’t generate fundamentally new reasoning paths but just increases the likelihood of sampling already-existing correct ones.
● Reasoning already in the base: RLVR models' successful CoTs are shown to be present within the base model's sampling distribution. Perplexity analyses confirm that RL outputs are often high-probability continuations for the base model.
● Efficiency vs. exploration: RLVR narrows the model’s exploration space, improving efficiency but shrinking its coverage of diverse reasoning paths, thereby reducing overall problem-solving reach at scale.
● Distillation helps more: Unlike RLVR, distillation from a stronger teacher model (e.g., DeepSeek-R1) introduces genuinely new reasoning patterns, expanding the model’s capabilities.
● Algorithmic limits: Across PPO, GRPO, Reinforce++, etc., RL algorithms offer similar sample-efficiency improvements, but none closes the gap to the base model’s pass@256—highlighting the limits of current RL strategies. | Paper, Tweet | | 2) BitNet b1.58 2B4T - This work introduces BitNet b1.58 2B4T, the first open-source, natively trained 1-bit LLM at the 2B parameter scale, achieving strong performance while being extremely efficient. The model uses a custom ternary quantization scheme (1.58 bits per weight), enabling dramatic reductions in memory (0.4 GB), energy (0.028J/token), and latency (29ms), while still competing with state-of-the-art full-precision models across diverse benchmarks.
● New Pareto frontier in efficiency-performance: Trained from scratch on 4T tokens, BitNet b1.58 2B4T outperforms or matches open full-precision models (e.g., Qwen2.5 1.5B, MiniCPM 2B) on tasks like ARC-Challenge, PIQA, WinoGrande, and GSM8K. It achieves 54.19% average. across 16 benchmarks, comparable to Qwen2.5-1.5B’s 55.23%, but with ~6.5× lower memory and 10× lower energy usage.
● Outperforms quantized baselines: Against INT4 post-training quantized Qwen2.5 models (GPTQ/AWQ), BitNet is both smaller and more accurate, showing the advantage of native 1-bit training over PTQ approaches.
● Architectural & training innovations: It replaces standard linear layers with BitLinear layers using absmean ternary quantization and 8-bit activations, combines RoPE embeddings, squared ReLU activation, and bias-free layers. Training includes cosine LR and weight decay schedules, plus supervised fine-tuning and Direct Preference Optimization (DPO) instead of full RLHF.
● Best-in-class among 1-bit LLMs: When compared to other 1-bit models like OLMo-Bitnet (1B) and post-quantized Falcon3/Llama3 (7B–8B), BitNet b1.58 2B4T is +10 pts stronger on average, establishing a new benchmark for ultra-efficient LLMs. The authors also release optimized CUDA kernels for GPU and a C++ inference library for CPU, enabling practical deployment of 1-bit LLMs on diverse hardware. BitNet b1.58 2B4T demonstrates that extreme quantization does not mean compromised capability, and it opens the door to the broader adoption of LLMs in resource-constrained environments. | Paper | | 3) UI-TARS - UI-TARS introduces a powerful, end-to-end native GUI agent that operates purely from visual screenshots, performing human-like keyboard and mouse interactions across platforms. Unlike existing modular agent frameworks that rely on prompt engineering and external scripts, UI-TARS integrates perception, action, reasoning, and memory directly into its architecture, achieving strong generalization and adaptability in dynamic real-world settings. Key contributions:
● Enhanced GUI Perception: UI-TARS is trained on a large-scale, richly annotated dataset of screenshots with metadata, enabling dense captioning, state transition understanding, and precise element description. It excels in perception benchmarks like VisualWebBench, scoring 82.8, outperforming GPT-4o’s.
● Unified Action Modeling and Grounding: UI-TARS standardizes actions across platforms into a shared action space and learns from large-scale multi-step action traces. It surpasses baselines in grounding tasks with 38.1 on ScreenSpot Pro, the new SOTA.
● System-2 Reasoning via “Thoughts”: Inspired by ReAct-style frameworks, UI-TARS generates internal reasoning steps (thoughts) before actions. These thoughts reflect patterns like task decomposition, reflection, and long-term consistency, significantly improving performance in complex scenarios. For example, in OSWorld, UI-TARS-72B-DPO scores 24.6 with a 50-step budget, outperforming Claude’s.
● Iterative Self-Improvement with Reflective Learning: UI-TARS continuously refines itself through online trace collection and reflection tuning using error correction and post-error adaptation data. This allows it to recover from mistakes and adapt with minimal human oversight. Overall, UI-TARS marks a significant step forward in GUI automation, setting new benchmarks across more than 10 datasets and outperforming top commercial agents like GPT-4o and Claude. Its open-source release aims to drive further innovation in native agent development. | Paper, Blog | | 4) Describe Anything - Introduces DAM, a model that generates fine-grained, region-specific captions in both images and videos. The authors address key limitations in prior vision-language models—namely, the inability to preserve local detail and the lack of suitable datasets and benchmarks for detailed localized captioning (DLC). Key contributions:
● DAM (Describe Anything Model) uses two main innovations to capture both fine regional detail and global scene context: a focal prompt that provides high-resolution encoding of user-specified regions, and a localized vision backbone that uses gated cross-attention to integrate context from the entire image. This enables DAM to generate multi-granular, accurate descriptions, especially for small or occluded regions.
● DLC-SDP (Semi-supervised Data Pipeline) tackles data scarcity by expanding segmentation datasets with VLM-generated detailed captions, followed by self-training on web images. This produces high-quality, diverse training data, enabling DAM to outperform API-only baselines like GPT-4o across several benchmarks.
● DLC-Bench is a reference-free benchmark that scores models on their ability to accurately include or exclude region-specific details using LLM judges. It provides a more reliable evaluation than traditional caption-matching metrics, which often penalize models for valid but unmatched details.
● Performance: DAM sets a new state-of-the-art on 7 benchmarks across keyword, phrase, and detailed multi-sentence captioning tasks in both images and videos. It outperforms GPT-4o, Claude 3.7, and other top VLMs in both zero-shot and in-domain evaluations, achieving up to 33.4% improvement over prior models on detailed image captioning and 19.8% on video captioning. | Paper | | 5) UXAgent - Introduces a novel framework, UXAgent, for simulating large-scale usability testing using LLM-driven agents. The system empowers UX researchers to test and iterate web design and study protocols before engaging real users. This is achieved through the orchestration of simulated agents with diverse personas interacting in real web environments, providing both behavioral and reasoning data. Key highlights:
● LLM-Powered Simulation with Personas: UXAgent begins with a Persona Generator that can produce thousands of demographically diverse simulated users based on custom distributions. Each persona is fed into an LLM Agent that embodies user intent and interacts with the website via a Universal Browser Connector—a module capable of interpreting and manipulating real HTML structures.
● Dual-Loop Reasoning Architecture: At the heart of UXAgent is a dual-process agent architecture inspired by cognitive psychology: a Fast Loop for low-latency actions and a Slow Loop for deep reasoning. This design mimics System 1 and System 2 thinking and allows agents to act responsively while maintaining coherent high-level plans and reflections.
● Rich Memory Stream: All observations, actions, plans, reflections, and spontaneous thoughts (“wonders”) are stored in a Memory Stream. These memories are dynamically prioritized for retrieval using a weighted scoring system based on importance, recency, and relevance, tailored separately for fast and slow modules.
● Replay and Interview Interfaces: UX researchers can review simulated sessions via a Simulation Replay Interface and conduct natural language conversations with agents using an Agent Interview Interface. This supports qualitative analysis, such as asking agents about their decisions or presenting mockups for feedback.
● Empirical Evaluation: A case study involving 60 LLM agent simulations on a shopping platform (WebArena) showed that researchers were able to detect usability study flaws and gather early insights. A follow-up user study with five UX professionals found the system helpful for iterating study design, despite some concerns over realism and data noise. Particularly appreciated was the ability to converse with agents and gather qualitative insights that would be infeasible in traditional pilots.
● Future Implications: The authors position LLM agents not as replacements for real participants, but as early-stage collaborators in the design process, reducing the cost and risk of flawed studies. They also discuss extensions to multimodal settings, desktop or mobile interfaces, and broader agentic tasks such as digital twins or simulated A/B testing. | Paper | | 6) Test-Time Reinforcement Learning - Test-Time Reinforcement Learning (TTRL) is a method that allows LLMs to improve themselves during inference without ground-truth labels. Instead of relying on labeled datasets, TTRL uses majority voting over multiple model generations to estimate pseudo-rewards, enabling reinforcement learning (RL) on unlabeled test data. The method integrates Test-Time Scaling (TTS) and Test-Time Training (TTT) strategies, letting models adapt dynamically to new and challenging inputs. Key highlights:
● Majority Voting as Reward: TTRL generates multiple candidate outputs for a query and uses majority voting to derive a pseudo-label. Rewards are assigned based on agreement with the consensus answer.
● Significant Performance Gains: Applying TTRL to Qwen2.5-Math-7B leads to a +159% improvement on AIME 2024 and +84% average gains across AIME, AMC, and MATH-500 benchmarks, without using any labeled training data.
● Self-Evolution Beyond Supervision: Remarkably, TTRL surpasses the performance ceiling of its own majority-vote supervision (Maj@N) and approaches the performance of models trained with full label leakage, indicating efficient and stable unsupervised RL.
● Generalization and Robustness: TTRL generalizes well across tasks, maintains effectiveness even under label estimation noise, and is compatible with different RL algorithms like PPO and GRPO.
● Limitations: TTRL may fail when the base model lacks sufficient prior knowledge about the domain or when hyperparameters (like batch size and temperature) are poorly tuned. | Paper | | 7) Discovering Values in Real-World Language Model Interactions - This paper presents the first large-scale empirical analysis of values exhibited by a deployed AI assistant, Claude 3 and 3.5 models, using over 300,000 real-world conversations. The authors develop a bottom-up, privacy-preserving framework to extract, classify, and analyze AI-expressed normative considerations (“values”) and show how they vary across tasks, user values, and conversational contexts.
● The authors identify 3,307 unique AI values, which are organized into a five-domain taxonomy: Practical, Epistemic, Social, Protective, and Personal. Practical and epistemic values dominate, often aligning with Claude’s training goals around being helpful, harmless, and honest.
● Claude’s most common values, such as helpfulness (23.4%), professionalism, transparency, and clarity, are context-invariant and reflect its role as a service-oriented assistant. In contrast, human values like authenticity and efficiency are more varied.
● Many values are context-specific. For example, healthy boundaries arise in relationship advice, historical accuracy in controversial event discussions, and human agency in AI governance contexts.
● Claude tends to mirror human values in supportive contexts (20.1% mirroring rate), but expresses opposing values during resistance, especially in cases involving unethical or policy-violating requests (e.g., resisting “moral nihilism” with “ethical integrity”).
● Explicit value expression (e.g., “I value transparency”) occurs more often in moments of resistance or reframing, particularly around epistemic and ethical principles like intellectual honesty and harm prevention. This suggests that AI values become most visible when the system is challenged.
● Across Claude variants, 3 Opus expresses more emotionally nuanced and ethically grounded values (e.g., academic rigor, emotional authenticity) and shows a stronger inclination for both support and resistance compared to 3.5/3.7 Sonnet. | Paper, Tweet | | 8) Evaluate the Goal-Directedness of LLMs - Introduces a new framework to assess whether LLMs use their capabilities effectively toward achieving given goals. The study finds that even top models like GPT-4o and Claude 3.7 fall short of full goal-directedness, particularly in information-gathering and combined tasks, despite performing well in isolated subtasks. | Paper, Tweet, GitHub | | 9) General-Reasoner - General-Reasoner is a reinforcement learning approach that boosts LLM reasoning across diverse domains by using a 230K-question dataset and a model-based verifier trained to understand semantics beyond exact matches. It outperforms strong baselines like SimpleRL and Qwen2.5 on both general reasoning (MMLU-Pro, GPQA, SuperGPQA) and math tasks (MATH-500, GSM8K), showing over 10-point gains without sacrificing mathematical capability. | Paper, Tweet | | 10) Tiny Reasoning Models - Tina is a family of 1.5B parameter reasoning models trained using LoRA-based reinforcement learning (RL) to achieve high reasoning accuracy at very low cost. It outperforms or matches full fine-tuned models on reasoning tasks like AIME and MATH with only ~$9 post-training cost, demonstrating that efficient reasoning can be instilled via minimal updates to a tiny model. | Paper |

Top AI Papers of the Week (April 14 - April 20) - 2025

| Paper | Links | | ------------- | ------------- | | 1) GUI-R1 - Researchers from the National University of Singapore and the Chinese Academy of Sciences introduce GUI-R1, a reinforcement learning (RL) framework aimed at improving graphical user interface (GUI) agents through unified action-space modeling. Key insights include:
● Reinforcement Fine-Tuning (RFT) over Supervised Fine-Tuning (SFT) – GUI-R1 utilizes RFT inspired by methods such as DeepSeek-R1, significantly reducing training data requirements. It uses only 3K carefully curated examples versus millions used by previous models.
● Unified Action Space and Reward Modeling – The authors introduce a unified action space that covers actions across different platforms (Windows, Linux, MacOS, Android, and Web). This enables consistent reward signals for evaluating GUI actions, enhancing the model’s adaptability and generalization.
● Superior Performance with Minimal Data – GUI-R1 outperforms state-of-the-art methods like OS-Atlas using merely 0.02% of the training data (3K vs. 13M). Evaluations across eight benchmarks spanning mobile, desktop, and web platforms show significant improvements in grounding, low-level, and high-level GUI task capabilities.
● Efficient Training and Strong Generalization – By leveraging policy optimization algorithms like Group Relative Policy Optimization (GRPO), GUI-R1 quickly converges to high performance, demonstrating robustness and efficiency even in resource-constrained scenarios. | Paper | | 2) Scaling Reasoning in Diffusion LLMs via RL - Proposes d1, a two‑stage recipe that equips masked diffusion LLMs with strong step‑by‑step reasoning.
● Two‑stage pipeline (SFT → diffu‑GRPO) – d1 first applies supervised fine‑tuning on the 1 k‑example s1K dataset and then runs task‑specific RL with the new diffu‑GRPO objective, yielding larger gains than either stage alone.
● diffu‑GRPO: RL for masked dLLMs – Extends GRPO to diffusion LLMs via (i) a mean‑field sequence‑log‑prob approximation and (ii) a one‑step per‑token log‑prob estimator with random prompt masking, enabling many gradient updates from a single generation.
● Consistent gains on four reasoning benchmarks – On GSM8K, MATH500, Countdown, and Sudoku, diffu‑GRPO beats SFT, and the full d1‑LLaDA variant attains the best scores (e.g., 81.1 % GSM8K & 38.6 % MATH500 at 256 tokens, +5–12 pp over baseline).
● Competitive among 7‑8 B models – d1‑LLaDA outperforms DeepSeek‑7B, Mistral‑7B and Llama‑3‑8B on GSM8K and ranks second on MATH500 in the same size class.
● Longer decoding unlocks “aha moments” – At 512‑token generation, the model shows self‑verification/backtracking; effective‑token usage grows smoothly, echoing test‑time compute scaling trends.
● Random masking speeds RL – Ablations show that random prompt masking during diffu‑GRPO accelerates convergence and boosts correctness relative to fixed masking, with fewer online generations needed. | Paper | | 3) Enhancing Non-Reasoning Models with Reasoning Models - Researchers explore how to distill reasoning-intensive outputs (answers and explanations) from top-tier LLMs into more lightweight models that don’t explicitly reason step by step. By fine-tuning smaller models on the high-quality final answers (and optionally summarized thinking traces) from advanced reasoning models, they demonstrate consistent performance boosts across multiple benchmarks.
● Test-time scaling vs. knowledge distillation – While large models like DeepSeek-R1 and OpenAI-o1 can allocate more compute to generate better reasoning traces, this paper focuses on systematically transferring those rich final answers (and possibly a summarized version of the reasoning steps) to more compact models.
● Data curation – The authors construct a 1.3M-instance dataset by pulling prompts from multiple open-source repositories (including Infinity Instruct, CodeContests, FLAN, etc.) and generating final answers plus detailed reasoning from DeepSeek-R1.
● Three fine-tuning strategies – (1) Use the original baseline answers from existing open-source sets, (2) fine-tune on only the final answer portion of a reasoning model, and (3) combine a summarized chain-of-thought with the final answer. Models trained on the second strategy excelled at math/coding tasks, while the third approach proved better for more conversational or alignment-oriented tasks.
● Empirical gains – Fine-tuning Qwen2.5-32B on the reasoning model’s final answers led to notable improvements on GSM8K (92.2%) and HumanEval (90.9%). A think-summarization approach boosted a different set of benchmarks (GPQA and chat-based tests). However, weaving in the “thinking trace” sometimes caused slight drops in instruction strictness (IFEval).
● Trade-offs and future work – Distilling advanced reasoning data definitely helps smaller models, but deciding how much of the reasoning trace to include is domain-dependent. The authors suggest that more refined ways of seamlessly blending reasoning steps into final answers (e.g., specialized prompts or partial merges) could further improve performance and avoid alignment regressions. | Paper | | 4) AgentA/B - AgentA/B is a fully automated A/B testing framework that replaces live human traffic with large-scale LLM-based agents. These agents simulate realistic, intention-driven user behaviors on actual web environments, enabling faster, cheaper, and risk-free UX evaluations — even on real websites like Amazon. Key Insights:
● Modular agent simulation pipeline – Four components—agent generation, condition prep, interaction loop, and post-analysis—allow plug-and-play simulations on live webpages using diverse LLM personas.
● Real-world fidelity – The system parses live DOM into JSON, enabling structured interaction loops (search, filter, click, purchase) executed via LLM reasoning + Selenium.
● Behavioral realism – Simulated agents show more goal-directed but comparable interaction patterns vs. 1M real Amazon users (e.g., shorter sessions but similar purchase rates).
● Design sensitivity – A/B test comparing full vs. reduced filter panels revealed that agents in the treatment condition clicked more, used filters more often, and purchased more.
● Inclusive prototyping – Agents can represent hard-to-reach populations (e.g., low-tech users), making early-stage UX testing more inclusive and risk-free.
● Notable results AgentA/B shows how LLM agents can augment — not replace — traditional A/B testing by offering a new pre-deployment simulation layer. This can accelerate iteration, reduce development waste, and support UX inclusivity without needing immediate live traffic. | Paper | | 5) Reasoning Models Can Be Effective Without Thinking - This paper challenges the necessity of long chain-of-thought (CoT) reasoning in LLMs by introducing a simple prompting method called NoThinking, which bypasses explicit "thinking" steps. Surprisingly, NoThinking performs comparably to or better than traditional reasoning under comparable or even lower compute budgets, especially when paired with parallel decoding and best-of-N selection. Key Insights:
● NoThinking prepends a dummy “Thinking” block and jumps straight to final answers.
● Despite skipping structured reasoning, it outperforms Thinking in pass@k (1–64) on many benchmarks, especially under token constraints.
● With parallel scaling, NoThinking achieves higher pass@1 accuracy than Thinking while using 4× fewer tokens and up to 9× lower latency.
● Tasks evaluated: competitive math (AIME24/25, AMC23, OlympiadBench), coding (LiveCodeBench), and formal theorem proving (MiniF2F, ProofNet).
● NoThinking is shown to provide superior accuracy–latency tradeoffs and generalizes across diverse tasks. Results:
● Low-budget wins: On AMC23 (700 tokens), NoThinking achieves 51.3% vs. 28.9% (Thinking).
● Better scaling: As k increases, NoThinking consistently surpasses Thinking.
● Efficiency frontier: Across benchmarks, NoThinking dominates the accuracy–cost Pareto frontier.
● Parallel wins: With simple confidence-based or majority vote strategies, NoThinking + best-of-N beats full Thinking on pass@1 with significantly less latency. | Paper | | 6) SocioVerse - Researchers from Fudan University and collaborators propose SocioVerse, a large-scale world model for social simulation using LLM agents aligned with real-world user behavior. Key ideas include:
● Four-fold alignment framework – SocioVerse tackles major challenges in aligning simulated environments with reality across four dimensions:
● Three representative simulations – SocioVerse showcases its generalizability through:
● Impressive empirical accuracy –
● Ablation insights – Removing prior demographic distribution and user knowledge severely degrades election prediction accuracy (Acc drops from 0.80 → 0.60), highlighting the value of realistic population modelingpapersoftheweek.
● Toward trustworthy virtual societies – SocioVerse not only standardizes scalable social simulations but also provides a sandbox for testing sociopolitical hypotheses (e.g., fairness, policy change), bridging AI agent systems with traditional social science. | Paper | | 7) DocAgent - Researchers from Meta AI present DocAgent, a tool‑integrated, dependency‑aware framework that turns large, complex codebases into well‑written docstrings. Key ideas include:
● Topological Navigator for context building – DocAgent parses the repository’s AST, builds a dependency DAG, and documents components in topological order, so each function/class is visited only after its prerequisites, enabling incremental context accumulation and preventing context‑length explosions.
● Role‑specialised agent team – Five agents work together: Reader analyses code, Searcher gathers internal & external references, Writer drafts docstrings, Verifier critiques and revises them, while the Orchestrator manages iterations until quality converges.
● Adaptive context management – When retrieved context exceeds the model’s token budget, the Orchestrator trims low‑priority segments while preserving overall structure, keeping generation efficient and faithful2504.08725v1.
● Three‑facet automatic evaluation – A new framework scores Completeness (section coverage), Helpfulness (LLM‑as‑judge semantic utility), and Truthfulness (entity grounding against the code DAG) for every docstring.
● Substantial gains over baselines – On 366 components across nine Python repos, DocAgent + GPT‑4o‑mini lifts Completeness to 0.934 vs 0.815, Helpfulness to 3.88 / 5 vs 2.95, and Truthfulness (existence ratio) to 95.7 % vs 61.1 % compared with a Chat‑GPT baseline; FIM baselines fare far worse.
● Navigator is crucial – An ablation that randomises processing order drops helpfulness by ‑0.44 and truthfulness by ‑7.9 pp, confirming the importance of dependency‑aware traversal. | Paper | | 8) SWE-PolyBench - SWE-PolyBench is a new multi-language benchmark for evaluating coding agents on real-world software tasks across Java, JavaScript, TypeScript, and Python. It introduces execution-based assessments, syntax tree metrics, and reveals that current agents struggle with complex tasks and show inconsistent performance across languages. | Paper | | 9) A Survey of Frontiers in LLM Reasoning - This survey categorizes LLM reasoning methods by when reasoning occurs (inference-time vs. training) and the system's architecture (standalone vs. agentic or multi-agent). It highlights trends like learning-to-reason (e.g., DeepSeek-R1) and agentic workflows (e.g., OpenAI Deep Research), covering prompt engineering, output refinement, and learning strategies such as PPO and verifier training. | Paper | | 10) Advances in Embodied Agents, Smart Cities, and Earth Science - This paper surveys how spatial intelligence manifests across disciplines—from embodied agents to urban and global systems—by connecting human spatial cognition with how LLMs handle spatial memory, representations, and reasoning. It offers a unifying framework to bridge research in AI, robotics, urban planning, and earth science, highlighting LLMs’ evolving spatial capabilities and their interdisciplinary potential. | Paper |

Top AI Papers of the Week (April 6 - April 13) - 2025

Paper Links
1) The AI Scientist V2 - The AI Scientist-v2 refines and extends its predecessor to achieve a new milestone: autonomously generating a workshop-accepted research manuscript. The system removes dependencies on human-authored code templates, incorporates agentic tree-search methods for deeper exploration, uses Vision-Language Models to refine figures, and demonstrates impressive real-world outcomes by passing the peer-review bar.
● Enhanced Autonomy – Eliminates reliance on human-crafted code templates, enabling out-of-the-box deployment across diverse ML domains.
● Agentic Tree Search – Systematically searches and refines hypotheses through a branching exploration, managed by a new experiment manager agent.
● VLM Feedback Loop – Integrates Vision-Language Models in the reviewing process to critique and improve experimental figures and paper aesthetics.
● Workshop Acceptance – Generated three fully autonomous manuscripts for an ICLR workshop; one was accepted, showcasing the feasibility of AI-driven end-to-end scientific discovery.
Paper, Tweet
2) Benchmarking Browsing Agents - OpenAI introduces BrowseComp, a benchmark with 1,266 questions that require AI agents to locate hard-to-find, entangled information on the web. Unlike saturated benchmarks like SimpleQA, BrowseComp demands persistent and creative search across numerous websites, offering a robust testbed for real-world web-browsing agents. Key insights:
● Extremely difficult questions: Benchmarked tasks were verified to be unsolvable by humans in under 10 minutes and also by GPT-4o (with/without browsing), OpenAI o1, and earlier Deep Research models.
● Human performance is low: Only 29.2% of problems were solved by humans (even with 2-hour limits). 70.8% were abandoned.
● Model performance:
● Test-time scaling matters: Accuracy improves with more browsing attempts. With 64 parallel samples and best-of-N aggregation, Deep Research significantly boosts its performance (15–25% gain over a single attempt).
● Reasoning browsing: OpenAI o1 (no browsing but better reasoning) outperforms GPT-4.5 with browsing, showing that tool use alone isn't enough—strategic reasoning is key.
● Calibration struggles: Models with browsing access often exhibit overconfidence in incorrect answers, revealing current limits in uncertainty estimation.
● Dataset diversity: Includes a wide topical spread: TV/movies, science, art, sports, politics, geography, etc.
Paper, Blog, Tweet
3) OLMOTrace - Allen Institute for AI & University of Washington present OLMOTRACE, a real-time system that traces LLM-generated text back to its verbatim sources in the original training data, even across multi-trillion-token corpora.
● What it does: For a given LM output, OLMOTRACE highlights exact matches with training data segments and lets users inspect full documents for those matches. Think "reverse-engineering" a model’s response via lexical lookup.
● How it works:
● Supported models: Works with OLMo models (e.g., OLMo-2-32B-Instruct) and their full pre/mid/post-training datasets, totaling 4.6T tokens.
● Use cases:
● Benchmarked:
● Not RAG: It retrieves after generation, without changing output, unlike retrieval-augmented generation.
Paper, Tweet, Blog
4) Concise Reasoning via RL - This new paper proposes a new training strategy that promotes concise and accurate reasoning in LLMs using RL. It challenges the belief that long responses improve accuracy; it offers both theoretical and empirical evidence showing that conciseness often correlates with better performance.
● Long ≠ better reasoning – The authors mathematically show that RL with PPO tends to generate unnecessarily long responses, especially when answers are wrong. Surprisingly, shorter outputs correlate more with correct answers, across reasoning and non-reasoning models.
● Two-phase RL for reasoning + conciseness – They introduce a two-phase RL strategy: (1) train on hard problems to build reasoning ability (length may increase), then (2) fine-tune on occasionally solvable tasks to enforce concise CoT without hurting accuracy. The second phase alone dramatically reduces token usage by over 50%, with no loss in accuracy.
● Works with tiny data – Their method succeeds with as few as 4–8 training examples, showing large gains in both math and STEM benchmarks (MATH, AIME24, MMLU-STEM). For instance, on MMLU-STEM, they improved accuracy by +12.5% while cutting response length by over 2×.
● Better under low sampling – Post-trained models remain robust even when the temperature is reduced to 0. At temperature=0, the fine-tuned model outperformed the baseline by 10–30%, showing enhanced deterministic performance.
● Practical implications – Besides improving model output, their method reduces latency, cost, and token usage, making LLMs more deployable. The authors also recommend setting λ < 1 during PPO to avoid instability and encourage correct response shaping.
Paper, Tweet
5) Rethinking Reflection in Pre-Training - Reflection — the ability of LLMs to identify and correct their own reasoning — has often been attributed to reinforcement learning or fine-tuning. This paper argues otherwise: reflection emerges during pre-training. The authors introduce adversarial reasoning tasks to show that self-reflection and correction capabilities steadily improve as compute increases, even in the absence of supervised post-training. Key contributions:
● Propose two kinds of reflection:
● Build six adversarial datasets (GSM8K, TriviaQA, CruxEval, BBH) to test reflection across math, coding, logic, and knowledge domains. On GSM8K-Platinum, explicit reflection rates grow from ~10% to 60% with increasing pre-training tokens.
● Demonstrate that simple triggers like “Wait,” reliably induce reflection.
● Evaluate 40 OLMo-2 and Qwen2.5 checkpoints, finding a strong correlation between pre-training compute and both accuracy and reflection rate. Why it matters:
● Reflection is a precursor to reasoning and can develop before RLHF or test-time decoding strategies.
● Implication: We can instill advanced reasoning traits with better pre-training data and scale, rather than relying entirely on post-training tricks.
● They also show a trade-off: more training compute reduces the need for expensive test-time compute like long CoT traces.
Paper, Tweet
6) Efficient KG Reasoning for Small LLMs - LightPROF is a lightweight framework that enables small-scale language models to perform complex reasoning over knowledge graphs (KGs) using structured prompts. Key highlights:
● Retrieve-Embed-Reason pipeline – LightPROF introduces a three-stage architecture:
● Plug-and-play & parameter-efficient – LightPROF trains only the adapter and projection modules, allowing seamless integration with any open-source LLM (e.g., LLaMa2-7B, LLaMa3-8B) without expensive fine-tuning.
● Outperforms larger models – Despite using small LLMs, LightPROF beats baselines like StructGPT (ChatGPT) and ToG (LLaMa2-70B) on KGQA tasks: 83.8% (vs. 72.6%) on WebQSP and 59.3% (vs. 57.6%)on CWQ.
● Extreme efficiency – Compared to StructGPT, LightPROF reduces token input by 98% and runtime by 30%, while maintaining accuracy and stable output even in complex multi-hop questions.
● Ablation insights – Removing structural signals or training steps severely degrades performance, confirming the critical role of the Knowledge Adapter and retrieval strategy.
Paper, Tweet
7) Compute Agent Arena - Computer Agent Arena is a new open platform for benchmarking LLM and VLM-based agents on real-world computer-use tasks, like coding, editing, and web navigation, using a virtual desktop environment. Initial results show that OpenAI and Anthropic are leading with modest success rates, while the platform aims to grow through crowdsourced tasks, agent submissions, and open-sourcing of its infrastructure. Report Tweet
8) Agentic Knowledgeable Self-awareness - KnowSelf is a new framework that introduces agentic knowledgeable self-awareness, enabling LLM agents to dynamically decide when to reflect or seek knowledge based on situational complexity, mimicking human cognition. Using special tokens for "fast," "slow," and "knowledgeable" thinking, KnowSelf reduces inference costs and achieves state-of-the-art performance on ALFWorld and WebShop tasks with minimal external knowledge. Paper
9) One-Minute Video Generation with Test-Time Training - One-Minute Video Generation with Test-Time Training introduces TTT layers, a novel sequence modeling component where hidden states are neural networks updated via self-supervised loss at test time. By integrating these into a pre-trained diffusion model, the authors enable single-shot generation of one-minute, multi-scene videos from storyboards, achieving 34 Elo points higher than strong baselines like Mamba 2 and DeltaNet in human evaluations Paper, Tweet
10) NoProp - NoProp is a novel gradient-free learning method where each neural network layer independently learns to denoise a noisy version of the target, inspired by diffusion and flow matching. Unlike backpropagation, it avoids hierarchical representation learning and achieves competitive performance and efficiency on image classification benchmarks like MNIST and CIFAR. Paper

Top AI Papers of the Week (March 31 - April 6) - 2025

| Paper | Links | | ------------- | ------------- | | 1) PaperBench - OpenAI introduces a new benchmark, PaperBench, to test whether AI agents can replicate cutting-edge machine learning research papers, from scratch. ● A rigorous replication challenge – PaperBench evaluates agents on reproducing entire ML papers from ICML 2024 (20 total, across 12 research areas). Agents must understand the paper, build the codebase from scratch, and run experiments to match results. Each paper comes with a fine-grained rubric (~8,316 tasks total) co-designed with the original authors. ● Automatic grading with LLM judges – To make evaluation scalable, the team built a rubric-based judge (o3-mini with scaffolding) that scores replications with high agreement (F1 = 0.83) against human experts. They also release JudgeEval, a benchmark for assessing judge accuracy. ● Frontier model performance is modest – Claude 3.5 Sonnet scored highest with 21.0%, followed by o1 (13.2%) and GPT-4o (4.1%). Even with longer runtimes and prompt tuning (IterativeAgent), no model surpassed a 26.0% score. By contrast, ML PhDs hit 41.4% on a 3-paper subset in 48 hours, showing humans still lead in long-horizon agentic tasks. ● CodeDev variant for lightweight evals – A simplified PaperBench Code-Dev version skips execution and just grades code structure. o1 scored 43.4% there, showing more promise when runtime issues are excluded. ● Failure modes and insights – Models often “gave up early,” lacked strategic planning, and failed to iterate. Claude did better with BasicAgent (freer form), while o1 benefited from IterativeAgent (structured prompts). This highlights how sensitive agents are to prompting and scaffolding. ● Open-source release – PaperBench (with rubrics, grading infra, and replication results) is fully open-sourced to drive further progress on long-horizon agent tasks and autonomous AI R&D. | Paper, Tweet, GitHub | | 2) Command A: An Enterprise-Ready LLM - Cohere announced Command A, a 111B parameter open-weights LLM built for enterprise-grade RAG, agents, code, and multilingual tasks. Key contributions: ● Modular expert merging for domain mastery – Instead of monolithic post-training, Command A uses a decentralized training pipeline. Separate expert models are fine-tuned for specific domains (e.g., math, RAG, multilingual, safety, code), then merged into one model using efficient weighted parameter soup techniques. This preserves most expert performance with just ~1.8% average drop. ● Hybrid architecture for long-context efficiency – Command A interleaves sliding window and full attention layers, achieving 256k context support with drastically lower KV cache memory usage—e.g., only ~33% of LLaMA 3 70B at 128k. It scores 95.0% on RULER, outperforming most long-context peers. ● Superb agentic capabilities – Built for RAG, tool use, and ReAct-style agents, Command A beats GPT-4o and Claude 3.5 on TauBench and BFCL. Tool use is trained via a blend of human-annotated and synthetic data, then aligned with CoPG and SRPO (self-improving preference optimization). ● Best-in-class enterprise evaluations – On real-world generative tasks (e.g., chat summarization, FAQ generation) and RAG use cases (long workplace policy documents), Command A tops the leaderboard with 94.2% pass rate, 4.73 correctness, and 91% unanswerable QA accuracy. ● Multilingual excellence – Command A is trained in 23 global languages with heavy data curation and preference tuning. It scores #1 in dialect alignment (ADI2), 90.3% average LPR (language consistency), and outperforms LLaMA 3.3, GPT-4o, and DeepSeek in manual Arena-style win rates across all languages. ● Polishing for human alignment – Final alignment used a ping-pong loop of offline SRPO and online CoPG with RLHF. This yielded +17pt human win rate gains on code, +10pt on reasoning, and lifted Command A’s win rate over GPT-4o to parity (~50.4%). ● Fast, efficient, and open – Despite its power, Command A runs on just 2×A100s or H100s and generates 156 tokens/sec—faster than GPT-4o and DeepSeek. Model weights are released (CC-BY-NC) on Hugging Face. | Paper, Tweet, Models | | 3) CodeScientist - Researchers at AI2 release CodeScientist, a system that autonomously generates and tests scientific hypotheses via code-based experimentation. It’s among the first to produce validated discoveries with minimal human input. Key ideas: ● Code-first scientific agent – CodeScientist reviews research papers and assembles experiments using vetted Python code blocks (e.g., for analysis, simulation). It follows a five-step pipeline: Ideation → Planning → Code Execution → Reporting → Meta-Analysis. ● Validated AI discoveries – From 50 AI research papers on agents and virtual environments, CodeScientist proposed 19 findings. Of these, 6 were judged scientifically sound and novel. Examples: ● Human-guided autonomy – Full automation is possible, but brief human feedback (e.g., ranking ideas) significantly boosts output quality. Human-in-the-loop interaction improves idea selection and experiment debugging. ● Challenges remain – Despite successes, over half the generated experiments fail due to code errors, not scientific flaws. Peer review is still needed to verify results, and current systems lack deep methodological rigor. | Paper, Blog, GitHub | | 4) Retrieval-Augmented Reasoning Model - Introduces RARE, a new paradigm for training domain-specific LLMs that focuses on reasoning, not memorization. Key ideas: ● Inspired by Bloom’s Taxonomy – RARE shifts LLM training from memorizing knowledge (“Remember”) to applying and evaluating it (“Analyze”, “Create”). It separates domain knowledge (retrieved externally) from domain thinking (learned during training), enabling better performance under tight parameter budgets. ● Open-book prepared training – RARE injects retrieved knowledge into training prompts, letting models learn reasoning patterns instead of rote facts. This open-book, reasoning-first setup beats both standard SFT and RAG approaches, especially in medicine. ● Massive accuracy gains with small models – On five medical QA benchmarks, RARE-trained Llama-3.1-8B and Qwen-2.5-7B outperformed GPT-4 + RAG, with up to +20% accuracy boosts (e.g., PubMedQA: 78.63% vs. GPT-4’s 75.2%, CoVERT: 74.14% vs. GPT-4’s 65.67%). ● Training via distillation + adaptive retries – RARE distills answers (and reasoning paths) from a strong teacher (e.g., QwQ-32B), refining outputs until a correct answer is found. This creates a high-quality dataset that teaches contextualized, case-based thinking. ● New role for retrieval – Unlike standard RAG (used only at inference), RARE uses retrieval during training to shape reasoning. It models knowledge integration (p(kx, R(x))) and reasoning (p(rx, R(x), k)) as separate steps, replacing memorization with application. Overall, this work reframes LLM training for domain-specific intelligence: externalize facts, internalize reasoning. It unlocks strong performance from small models without overfitting or hallucination. | Paper, Tweet | | 5) Why do LLMs Attend to First Token? - This new paper explains why LLMs obsessively focus attention on the first token — a phenomenon known as an attention sink. Their theory: it’s a useful trick to prevent representational collapse in deep Transformers. ● Sinks = over-mixing shields – LLMs with long contexts and deep layers tend to over-mix information, causing similar embeddings for all tokens (i.e., rank collapse or over-squashing). Attention sinks—where many heads fixate on the ⟨bos⟩ token—act as no-ops that reduce token interaction and preserve representation diversity across layers. ● Sharp experiments on Gemma & LLaMa – Perturbation tests in Gemma 7B show ⟨bos⟩ significantly slows the spread of changes through the model. Meanwhile, in LLaMa 3.1 models, over 80% of attention heads show strong sink behavior in the 405B variant, supporting the theory that larger models need stronger sinks. ● Sinks emerge naturally – Even without special pretraining, sinks tend to form at the first position, not because of the ⟨bos⟩ token itself, but due to its location. However, if ⟨bos⟩ is fixed during training and later removed, performance collapses, showing that sink formation is data-dependent. ● Theoretical grounding – The authors connect sink emergence to Jacobian norm bounds, proving that sinks reduce sensitivity to token perturbations. Their math shows that deeper models and longer contexts require stronger sinks. ● Layerwise dynamics insight – Some attention heads use ⟨bos⟩ as a “default” target, unless a special pattern (e.g., apostrophe) triggers real computation. This supports a conditional attention mechanism—attend to ⟨bos⟩ unless needed elsewhere. | Paper, Tweet | | 6) Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions - Presents MedAgentSim is a fully automated, open-source hospital simulation where LLM-powered agents simulate doctor-patient interactions in dynamic diagnostic settings. Unlike previous static QA benchmarks, MedAgentSim mimics real-world clinical workflows with multi-turn dialogue, test requests, and self-improvement. More about this paper: ● Active doctor agents – MedAgentSim requires LLM doctor agents to engage in multi-turn consultations, request labs and imaging (e.g., ECG, X-ray), and iteratively refine diagnoses, making it far more realistic than pre-filled medical QA datasets. ● Self-improvement via memory + reflection – The system maintains buffers of successful and failed diagnoses. It uses retrieved past cases (via kNN), chain-of-thought reasoning, and ensembling to improve performance over time. Misdiagnoses trigger a reflection phase before inclusion in memory. ● Fully autonomous or human-in-the-loop – Users can optionally take control of the doctor or patient agents. Simulation assets are built using a 2D game engine (Phaser), and the agents can navigate, converse, and interact with virtual medical tools. ● Big performance boost across benchmarks – On NEJM, MedQA, and MIMIC-IV, MedAgentSim (with LLaMA 3.3) outperforms baseline setups by +6–37%, especially in vision-language tasks using LLaVA for interpreting medical images. ● Bias analysis & fairness focus – The team studied diagnostic accuracy under cognitive and implicit bias conditions. Models like GPT-4o and LLaMA proved more robust than Mixtral/Mistral, highlighting the importance of bias-aware evaluation. | Paper, Tweet, Code | | 7) Open Deep Search - Researchers from Sentient, UW, Princeton, and UC Berkeley introduce Open Deep Search (ODS), an open-source search AI framework that rivals top proprietary systems like GPT-4o Search Preview and Perplexity Sonar. Key insights: ● Two open components: search + reasoning – ODS has two modular parts: (1) Open Search Tool, which retrieves and refines high-quality web results using query rephrasing, snippet reranking, and site-specific logic; and (2) Open Reasoning Agent, a controller that orchestrates tool usage (search, calculator, etc.) to answer queries. Two variants are offered: ODS-v1 (ReAct) and ODS-v2 (CodeAct). ● SOTA open-source performance – With DeepSeek-R1 as the base LLM, ODS-v2 scores 88.3% on SimpleQA and 75.3% on FRAMES, beating GPT-4o Search Preview by +9.7% on the latter. ODS adapts the number of searches per query (avg. 3.39 on FRAMES), balancing cost and accuracy more efficiently than fixed-query baselines. ● Better than Perplexity Sonar – On both FRAMES and SimpleQA, ODS+DeepSeek-R1 outperforms Perplexity’s flagship search models, even in complex reasoning tasks involving multi-hop questions, time/date calculations, and name disambiguation. ● Code-based agents enhance reasoning – ODS-v2 builds on CodeAct, allowing it to write and run Python code to perform symbolic reasoning and tool calls. This results in sharper numerical precision and task flexibility compared to CoT-based ReAct in ODS-v1. | Paper, Tweet, GitHub | | 8) Efficient Test-time Scaling with Code - Z1 is a new method for making large language models more compute-efficient at test time, especially during reasoning. The core idea is to train LLMs with short and long code-based reasoning trajectories, and then dynamically adjust reasoning depth during inference. Key contributions: ● Z1-Code-Reasoning-107K dataset – They construct a 107K-sample dataset with short and long reasoning paths for simple and complex coding problems. Trajectories are distilled from QwQ-32B and paired to help the model learn when to stop thinking. ● Shifted Thinking Window – A new test-time strategy that eliminates explicit <think delimiters. Instead, the model adapts reasoning token budget based on problem difficulty. Simple problems invoke shallow reasoning; complex ones get capped (e.g., 4096 tokens max), with hints nudging the model to finalize the answer. ● Big efficiency gains – The 7B-scale model Z1-7B matches R1-Distill-Qwen-7B across multiple reasoning tasks (MATH500, LiveCodeBench, GPQA Diamond) but with ~30% of the reasoning tokens. For instance, on GPQA Diamond, Z1-7B achieves 47.5% while using less than half the tokens. ● Code reasoning transfers to general tasks – Despite being trained only on code-based CoT data, Z1 generalizes well to broader domains like science and math, outperforming other 7B reasoning models (e.g., OpenThinker-7B, s1.1-7B) across multiple benchmarks. ● What makes reasoning data effective? – Ablation studies reveal two key dataset design levers: (1) longer reasoning trajectories improve inference quality; (2) larger training sample sizes boost average thinking time and accuracy, even without altering trajectory length. | Paper | | 9) A Survey of Efficient Reasoning for LLMs - This survey focuses on reasoning economy in LLMs, analyzing how to balance deep reasoning performance with computational cost. It reviews inefficiencies, behavioral patterns, and potential solutions at both post-training and inference stages. | Paper, Tweet | | 10) Hidden Factual Knowledge in LLMs - This study introduces a framework to measure hidden knowledge in LLMs, showing that models encode significantly more factual information internally than they express in outputs, up to 40% more. It also finds that some answers, although known internally, are never generated, highlighting key limits in test-time sampling for QA tasks. | Paper, Tweet |

Top AI Papers of the Week (March 24 - March 30) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Tracing the Thoughts of LLMs - Anthropic researchers unveil new interpretability tools for peering inside LLMs, using Claude 3.5 Haiku as a testbed. Their two new papers show how to trace model internals like circuits, plans, and conceptual thinking in real time. Key findings: ● Multilingual "language of thought" – Claude processes concepts like “small” or “opposite” similarly across English, French, and Chinese, suggesting a shared abstract representation layer. As models scale, these cross-lingual features increase, enabling transfer learning between languages. ● Planning ahead—even in poetry – Contrary to expectations, Claude plans rhymes before writing. When generating the line “His hunger was like a starving rabbit,” it had already “decided” on rhyming with “grab it.” Researchers could suppress or swap this plan to alter the ending dynamically. ● Mental math with parallel circuits – Claude computes sums using parallel circuits: one estimates the result, the other nails the last digit. But it explains answers with human-style logic (e.g., "carry the 1"), revealing a gap between internal computation and verbal justification. ● Detecting unfaithful reasoning – Sometimes, Claude fabricates logical steps to fit a target answer, especially when guided by incorrect hints. Interpretability tools could catch these cases by showing that internal computation doesn’t match the explanation—a key advance for AI audits. ● Conceptual chains in multi-step reasoning – For questions like “What is the capital of the state where Dallas is located?”, Claude first represents “Dallas → Texas” then “Texas → Austin.” Researchers could intervene mid-chain to make it say “Sacramento” instead, proving the reasoning is dynamic and compositional. ● Hallucinations and refusals – The model defaults to refusal unless prompted with known concepts. Misfires in circuits for “known answers” cause hallucinations (e.g., inventing facts about a fake name like “Michael Batkin”). Researchers could toggle this behavior by manipulating feature activations. ● Jailbreak anatomy – A jailbreak using the phrase “Babies Outlive Mustard Block” (BOMB) initially fools Claude into outputting dangerous info. Internal tracing shows grammar-consistency features temporarily override safety, until the model finishes a coherent sentence, then its safety response kicks in. | Blog, Paper 1, Paper 2, Tweet | | 2) Qwen2.5-Omni - Qwen2.5-Omni is a single end-to-end multimodal model that can perceive and understand text, audio, image, and video, and generate both text and speech in real time. It introduces architectural and training innovations that push the boundaries of streaming, multi-signal intelligence. Highlights: ● Thinker-Talker architecture – Inspired by the human brain and mouth, Qwen2.5-Omni separates reasoning (Thinker) and speech generation (Talker). Thinker (a transformer decoder) handles all perception and text generation. Talker (a dual-track autoregressive decoder) generates speech by consuming both text and hidden states from Thinker. Together, they’re trained end-to-end for synchronized text-speech output. ● Streaming-first design – To support real-time interaction, Qwen2.5-Omni implements block-wise encoders (for audio and vision) and a sliding-window codec generator for streaming audio. The model introduces TMRoPE (Time-aligned Multimodal RoPE), a 3D positional encoding system that aligns video and audio inputs to the same time axis. ● Pretraining scale & alignment – Trained on over 1.2 trillion tokens of diverse multimodal data, including 300B audio and 100B video-audio tokens. Uses instruction-tuned ChatML formatting and performs multi-stage post-training for both Thinker and Talker. Talker undergoes RL fine-tuning (DPO) and multi-speaker adaptation to ensure natural, stable speech output. ● SOTA across modalities – Qwen2.5-Omni achieves state-of-the-art on OmniBench, surpasses Qwen2-Audio in ASR/S2TT, and matches or beats Qwen2.5-VL in image and video tasks. On SEED zero-shot TTS, it outperforms CosyVoice 2 and F5-TTS in naturalness and stability, with low WER and high speaker similarity. ● Closes the voice-text gap – On a voice-instruction benchmark (converted from MMLU, GSM8K, etc.), Qwen2.5-Omni nearly matches its own text-instructed sibling Qwen2-7B, showing dramatic improvements in speech-based instruction following. | Paper, Tweet | | 3) AgentRxiv - Researchers from Johns Hopkins & ETH Zurich present AgentRxiv, a framework enabling LLM agents to autonomously generate and share research papers, mimicking how human scientists build on each other’s work. Highlights: ● AgentRxiv = arXiv for LLMs – It’s an open-source preprint server for autonomous agents, letting labs upload papers, search past work, and iteratively improve results. Labs use this to develop and refine reasoning techniques over generations of research. ● Massive reasoning gains via iterative research – On the MATH-500 benchmark, a single agent lab improves GPT-4o mini accuracy from 70.2% → 78.2% (+11.4%) by discovering better prompt strategies. The final method (SDA) outperforms earlier ideas like CRUC and DCCP. → SDA = Simultaneous Divergence Averaging: combines low/high-temp CoT outputs with dynamic similarity-based voting and confidence aggregation. ● Knowledge generalizes – SDA also improves other benchmarks: ● Collaboration boosts discovery – Running 3 agent labs in parallel yields faster progress and higher final accuracy (up to 79.8%, +13.7% over baseline) by sharing results via AgentRxiv. Early gains (e.g., 76.2% accuracy) arrive after only 7 papers vs. 23 sequentially. ● Self-improvement and novelty – Agents independently refine their own past ideas. Papers evolve from earlier iterations (e.g., Meta-Mirror Prompting → Meta-Mirror Prompting 2). Top papers show no plagiarism via multiple detectors, but ideas like SDA build on trends like self-consistency and CoT voting. ● Cost & runtime – Generating a paper takes ~1.36 hours and ~$3.11. Parallel setups are pricier overall but achieve results faster (time-to-accuracy win). Failure modes include hallucinated results and fragile code repair steps, with future work needed for better reliability and novelty guarantees. | Paper, Tweet | | 4) Neural Alignment via Speech Embeddings - Google Research and collaborators reveal striking similarities between LLM embeddings and human brain activity during conversation. Key insights: ● Embeddings match brain signals – Using intracranial electrode recordings, the team showed that internal representations (embeddings) from OpenAI's Whisper model align with neural responses in brain regions for speech (STG), language (IFG), and motor planning (MC). During comprehension, speech embeddings predict early auditory responses, while language embeddings follow in IFG. During production, this order reverses — first language planning (IFG), then articulation (MC), then auditory feedback (STG). ● “Soft hierarchy” in brain areas – Though STG emphasizes acoustic info and IFG captures word-level meaning, both regions show partial alignment with both embedding types. This suggests a gradient processing structure, not a strict modular pipeline. ● Brain predicts next word too – In follow-up studies published in Nature Neuroscience, the brain’s language areas were found to predict upcoming words, mirroring the objective of autoregressive LLMs. The surprise response after hearing a word also mirrors LLM prediction errors. ● Shared geometry in language representations – The geometry of word relationships in brain activity mirrors that of LLM embeddings, per a separate Nature Communications paper. This indicates a convergent structure in how LLMs and the brain represent language. ● Different wiring, same function – Despite similarities in objectives and representations, LLMs and brains diverge architecturally: brains process speech serially and recursively, while Transformers process in parallel across layers. ● Toward biologically inspired AI – These studies support using LLMs to reverse-engineer the brain’s language mechanisms. The team aims to build future models with more brain-like learning, data, and structure, bridging neuroscience and deep learning. | Paper, Tweet | | 5) Chain-of-Tools - This new paper presents Chain-of-Tools (CoTools), a new method to enable LLMs to incorporate expansive external toolsets—including tools never seen during training—while preserving CoT (chain-of-thought) reasoning. Highlights: ● Frozen LLM with lightweight fine-tuning – Unlike conventional approaches, CoTools keeps the LLM’s parameters frozen, instead fine-tuning separate modules (a Tool Judge and Tool Retriever) on top of the model’s hidden states. This preserves the LLM’s core capabilities while letting it call an open-ended set of tools during reasoning. ● Massive unseen tools – CoTools treats tools as semantic vectors computed from their textual descriptions. Even tools that never appear in the fine-tuning data can be invoked if they match the model’s query vectors, enabling new tools to be plugged in without retraining the entire system. ● Tool calls integrated into CoT – The system determines whether and when to call a tool in the middle of generating an answer. It then selects the best tool from thousands of candidates based on learned representations of the query and partial solution context. This helps to significantly boost accuracy on complex tasks. ● Strong gains on reasoning and QA – Experiments on GSM8K-XL, FuncQA, KAMEL, and the newly introduced SimpleToolQuestions dataset (with 1,836 tools) show improved tool-selection accuracy and superior final answers versus baseline methods. Notably, CoTools consistently scales to large tool pools and generalizes to unseen tools. | Paper, Tweet | | 6) Structured Memory Augmentation for Smarter LLM Agents - MemInsight is a framework that autonomously augments and structures memory for LLM agents, improving context retention and retrieval. Key insights include: ● Structured, autonomous memory augmentation – Instead of relying on raw historical data or manually defined memory structures, MemInsight uses a backbone LLM to autonomously mine attributes from past conversations or knowledge. These are organized into entity-centric and conversation-centric (e.g., user emotion or intent) augmentations at either the turn or session level. This mimics how humans abstract and prioritize experiences. ● Attribute-guided retrieval beats vanilla RAG – MemInsight supports both attribute-based retrieval (exact match filtering) and embedding-based retrieval (via FAISS). On the LoCoMo QA dataset, MemInsight outperformed a Dense Passage Retrieval (RAG) baseline by up to +34% recall. The best setup (priority-based Claude-Sonnet augmentations) achieved 60.5% Recall@5, vs. 26.5% for RAG. ● More persuasive recommendations – In movie recommendations using the LLM-REDIAL dataset, MemInsight lifted genre-matched recommendation scores while cutting down memory size by 90%. Embedding-based filtering led to +12% more highly persuasive outputs, per LLM judgment. ● Event summarization via memory alone – MemInsight’s annotations alone can be used to summarize long conversational sessions. These memory-only summaries rival raw-dialogue baselines in coherence and relevance (per G-Eval scores), particularly when turn-level augmentations are combined with original dialogue context. ● Minimal hallucinations, stable performance – Comparative analysis of augmentation models (Claude-Sonnet, Llama, Mistral) shows Claude-Sonnet produces more stable, consistent, and grounded attributes, reinforcing the importance of careful model selection in memory pipelines. | Paper | | 7) Investigating Affective Use and Emotional Well-being on ChatGPT - Researchers from OpenAI & MIT Media Lab explore how emotionally engaging interactions with ChatGPT (especially in Voice Mode) may impact user well-being. Using platform-wide data and a randomized controlled trial (RCT), they uncover nuanced effects of chatbot usage on loneliness, dependence, and socialization. ● Two complementary studies – The team combines: ● High usage = higher emotional entanglement – Across both studies, users with higher usage (especially voice interactions) were more likely to show signs of: ● Voice mode showed mixed effects – In the RCT, voice models led to better emotional well-being compared to text models when controlling for usage. But: ● Tiny group, big impact – A small number of users (~10%) account for the majority of emotionally charged conversations. Power users used pet names, shared problems, and formed pseudo-relationships with the model. ● Automated classifiers at scale – They developed 25+ LLM-based affective classifiers (e.g., “Pet Name,” “Seeking Support”) to scan millions of conversations without human review. Classifier results closely mirrored user self-reports. ● Call for socioaffective alignment – The authors urge developers to consider socioaffective alignment, designing models that support users without exploiting emotional needs. They warn of risks like “social reward hacking,” where a model mirrors or flatters users to maximize engagement. | Paper | | 8) Play2Prompt - Researchers from MIT CSAIL and IBM introduce Play2Prompt, a framework that empowers LLM agents to learn how to use external tools entirely in a zero-shot manner, without requiring labeled examples or high-quality documentation. Key innovations include: ● Tool "play" for usage discovery – Play2Prompt treats tools like black boxes and systematically plays with them (via trial-and-error API calls) to discover correct usage patterns. It reverse-engineers examples by first identifying working invocations, then generating a query-answer pair that fits the invocation and response. ● Two-stage optimization – The system iteratively builds: (1) tool-use demonstrations via self-reflective beam search and rejection sampling; and (2) refined tool documentation, using those examples as a validation set. This dual improvement allows LLMs to better understand and utilize unfamiliar APIs. ● Self-reflective beam search – Inspired by active learning, Play2Prompt favors hard examples that models initially fail on. These examples offer higher learning value and guide documentation improvements more effectively. ● Strong zero-shot performance – On BFCL Executable and StableToolBench, Play2Prompt yields consistent accuracy gains of +5–7% over baseline LLaMA and GPT-3.5 models and even boosts GPT-4o by up to +3.3%, particularly excelling in challenging multi-tool or REST call settings. ● Robust to poor documentation – Even when 50% of parameter descriptions are randomly dropped, Play2Prompt recovers and surpasses baseline performance, making it ideal for real-world tool integration with sparse or noisy metadata. ● Better than EasyTool – Unlike prior methods like EasyTool (which depend on labeled examples from related tools), Play2Prompt remains fully zero-shot and outperforms them in consistency, especially for models sensitive to instruction drift like GPT-4o. | Paper | | 9) Synthetic Data Generation Using LLMs - LLMs are increasingly used to generate synthetic training data for language and code tasks, improving performance in low-resource scenarios through techniques like prompt-based generation and self-refinement. The paper highlights benefits like cost and coverage, while addressing issues such as factual errors and bias, and suggests mitigations and future research in prompt automation and evaluation. | Paper | | 10) Current and Future Use of LLMs for Knowledge Work - A two-part survey study of 216 and 107 participants reveals that knowledge workers currently use LLMs for tasks like code generation and text improvement, but envision deeper integration into workflows and data. The findings inform future design and adoption strategies for generative AI in professional settings. | Paper |

Top AI Papers of the Week (March 17 - March 23) - 2025

| Paper | Links | | ------------- | ------------- | | 1) A Review of DeepSeek Models - This paper provides an in-depth review of the cutting-edge techniques behind DeepSeek's open-source LLMs—DeepSeek-V3 and DeepSeek-R1. These models achieve state-of-the-art performance with significantly lower resource requirements compared to proprietary counterparts. Key highlights include:
● Multi-Head Latent Attention (MLA) – Introduces efficient attention by compressing keys and values into a latent vector, dramatically reducing memory consumption for long-context tasks without sacrificing performance. MLA employs low-rank compression and decoupled Rotary Position Embeddings, outperforming standard multi-head attention.
● Advanced Mixture of Experts (MoE) – Incorporates fine-grained expert segmentation and dedicated shared experts, significantly enhancing combinational flexibility. An innovative load-balancing strategy further optimizes computational efficiency and model performance.
● Multi-Token Prediction (MTP) – Enhances training efficiency by predicting multiple subsequent tokens simultaneously. Although effective, the additional training overhead warrants further optimization.
● Algorithm-Hardware Co-design – Presents engineering advancements like DualPipe scheduling, an algorithm designed to eliminate pipeline bubbles, and FP8 mixed-precision training, maximizing computational efficiency and reducing training resources.
● Group Relative Policy Optimization (GRPO) – Offers a streamlined RL algorithm eliminating value function approximation from PPO, directly estimating advantages from grouped outputs, drastically reducing GPU memory usage.
● Post-Training Reinforcement Learning – Demonstrates pure RL's capability in DeepSeek-R1-Zero, which learns advanced reasoning without supervised fine-tuning. DeepSeek-R1 further improves this approach via iterative cold-start fine-tuning, rejection sampling, and RL alignment to enhance reasoning quality and language consistency. | Paper | | 2) Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in LLMs - It proposes a Hierarchical Reward Model (HRM) that addresses reward hacking and error propagation issues in fine-grained LLM reasoning. They also introduce Hierarchical Node Compression (HNC) to augment MCTS-based automatic data annotation, boosting label diversity and robustness at minimal computational cost.
● Hierarchical vs. single-step rewards – Traditional Process Reward Models (PRM) assign fine-grained rewards per step but can penalize corrections of earlier mistakes. By contrast, HRM assesses multiple consecutive steps, capturing coarse-grained coherence and enabling self-correction of earlier errors. This yields more robust and reliable evaluations.
● Solving “reward hacking” – PRM often misleads policy models into short-sighted strategies that artificially maximize step-level rewards. HRM’s multi-step feedback framework penalizes incomplete or incoherent reasoning, mitigating reward hacking behaviors.
● Hierarchical Node Compression (HNC) – Generating step-by-step annotations with Monte Carlo Tree Search (MCTS) is computationally heavy. The HNC method merges adjacent nodes in the search tree, expanding the dataset with controlled noise yet minimal extra cost. This more diverse training set enhances the reward model’s robustness.
● Stronger generalization – Experiments on PRM800K and cross-domain tasks (MATH500, GSM8K) show HRM consistently outperforms standard outcome-based or step-based reward models, particularly on deeper, more complex chains of thought. Policy models fine-tuned with HRM yield higher accuracy and more stable step-by-step solutions. | Paper, Tweet | | 3) DAPO: An Open-Source LLM Reinforcement Learning System at Scale - It introduces DAPO, a fully open-source, large-scale RL system that boosts the chain-of-thought reasoning capabilities of LLMs. DAPO raises the upper clipping threshold (“Clip-Higher”) in PPO-style training, preventing entropy collapse and helping the policy explore more diverse tokens. By filtering out samples that are always correct or always wrong, DAPO focuses training on prompts with useful gradient signals, speeding up convergence in fewer updates. Instead of averaging losses at the sample level, DAPO applies policy gradients per token, making each reasoning step matter. This ensures both high-quality and length-appropriate outputs. The system masks or softly penalizes excessively long answers, preventing meaningless verbosity or repetitive text. DAPO achieves SOTA math performance on the AIME 2024 test set. Specifically, DAPO trained from a Qwen2.5-32B base achieves 50% accuracy, outperforming DeepSeek’s R1 with less training time, and showcasing open-source reproducibility at scale. | Paper, Tweet | | 4) Compute Optimal Scaling of Skills - Researchers from the University of Wisconsin and Meta AI investigate how different skills (knowledge-based QA vs. code generation) exhibit contrasting optimal scaling behaviors in LLMs. Their key question: does the compute-optimal trade-off between model size and data volume depend on the type of skill being learned? Surprisingly, the answer is yes—they show distinct “data-hungry” vs. “capacity-hungry” preferences per skill. Highlights:
● Skill-dependent scaling laws – Traditional scaling laws optimize the overall loss on a generic validation set. However, this paper shows that knowledge tasks prefer bigger models (capacity-hungry), while code tasks prefer more data tokens (data-hungry).
● Differences persist even after balancing data – Tweaking the pretraining mix (e.g. adding more code data) can shift that skill’s optimal ratio, but fundamental differences remain. Knowledge-based QA still tends to need more parameters, code still benefits from bigger data budgets.
● Huge impact of validation set – Choosing a validation set that doesn’t reflect the final skill mix can lead to misaligned compute-optimal model sizes by 30%–50% at lower compute scales. Even at higher scales, suboptimal validation sets skew the best parameter count by over 10%.
● Practical takeaway – Model developers must pick or design validation sets that represent the real skill mix. If your ultimate goal is to excel at knowledge-based QA, you likely need a more capacity-hungry strategy. If it’s coding tasks, you might focus on data-hungry training. | Paper, Tweet | | 5) Thinking Machines - This survey provides an overview and comparison of existing reasoning techniques and presents a systematic survey of reasoning-imbued language models. | Paper, Tweet | | 6) A Survey on Efficient Reasoning - This new survey investigates techniques to address the "overthinking phenomenon" in Large Reasoning Models (LRMs), categorizing existing methods into model-based optimizations, output-based reasoning reductions, and prompt-based efficiency enhancements. The survey highlights ongoing efforts to balance reasoning capability and computational efficiency in models like OpenAI o1 and DeepSeek-R1. | Paper, Tweet | | 7) Agentic Memory for LLM Agents - Researchers from Rutgers University and Ant Group propose a new agentic memory system for LLM agents, addressing the need for long-term memory in complex real-world tasks. Key highlights include:
● Dynamic & Zettelkasten-inspired design – A-MEM autonomously creates comprehensive memory notes—each with textual attributes (keywords, tags) and embeddings—then interlinks them based on semantic similarities. The approach is inspired by the Zettelkasten method of atomic note-taking and flexible linking, but adapted to LLM workflows, allowing more adaptive and extensible knowledge management.
● Automatic “memory evolution” – When a new memory arrives, the system not only adds it but updates relevant older memories by refining their tags and contextual descriptions. This continuous update enables a more coherent, ever-improving memory network capable of capturing deeper connections over time.
● Superior multi-hop reasoning – Empirical tests on long conversational datasets show that A-MEM consistently outperforms static-memory methods like MemGPT or MemoryBank, especially for complex queries requiring links across multiple pieces of information. It also reduces token usage significantly by selectively retrieving only top-k relevant memories, lowering inference costs without sacrificing accuracy. | Paper | | 8) DeepMesh - Researchers from Tsinghua University, Nanyang Technological University, and ShengShu propose DeepMesh, a transformer-based system that generates high-quality 3D meshes with artist-like topology. Key ideas include:
● Efficient mesh tokenization – They introduce a new algorithm that compresses mesh sequences by ~72% while preserving geometric detail, enabling higher-resolution mesh generation at scale.
● Artist-like topology – Unlike dense or incomplete meshes from existing approaches, DeepMesh predicts structured triangle layouts that are aesthetic and easy to edit, thanks to a refined pre-training process and better data curation.
● Reinforcement Learning with human feedback – The authors adopt Direct Preference Optimization (DPO) to align mesh generation with human preferences. They collect pairwise user labels on geometry quality and aesthetics, then fine-tune the model to produce more appealing, complete meshes.
● Scalable generation – DeepMesh can handle large meshes (tens of thousands of faces) and supports both point cloud- and image-based conditioning, outperforming baselines like MeshAnythingv2 and BPT in geometric accuracy and user ratings. | Paper, Tweet | | 9) Deep Learning is Not So Mysterious or Different - Andrew Gordon Wilson (New York University) argues that deep learning phenomena such as benign overfitting, double descent, and the success of overparametrization are neither mysterious nor exclusive to neural networks. Major points include:
● Benign Overfitting & Double Descent Explained – These phenomena are reproducible with simple linear models, challenging their supposed exclusivity to neural networks. The author demonstrates benign overfitting with high-order polynomials featuring order-dependent regularization, emphasizing that flexible models can perfectly fit noisy data yet generalize well when structured data is present.
● Soft Inductive Biases as Unifying Principle – The paper advocates for soft inductive biases instead of traditional hard constraints. Rather than restricting a model's hypothesis space to prevent overfitting, a model can remain flexible, adopting a soft preference for simpler solutions consistent with observed data. Examples include polynomial regression with increasing penalties on higher-order terms and neural networks benefiting from implicit regularization effects.
● Established Frameworks Describe Phenomena – Wilson emphasizes that longstanding generalization frameworks like PAC-Bayes and countable hypothesis bounds already explain the supposedly puzzling behaviors of neural networks. The author argues against the notion that deep learning demands entirely new theories of generalization, highlighting how existing theories adequately address these phenomena.
● Unique Aspects of Deep Learning – While asserting deep learning is not uniquely mysterious, the paper acknowledges genuinely distinctive properties of neural networks, such as mode connectivity (the surprising connectedness of different network minima), representation learning (adaptive basis functions), and their notable universality and adaptability in diverse tasks.
● Practical and Theoretical Implications – The author critiques the widespread belief in neural network exceptionalism, urging closer collaboration between communities to build on established generalization theories rather than reinventing them. Wilson concludes by identifying genuine open questions in deep learning, particularly around scale-dependent implicit biases and representation learning. | Paper | | 10) GNNs as Predictors of Agentic Workflow Performances - This work introduces FLORA-Bench, a large-scale benchmark to evaluate GNN-based predictors for automating and optimizing agentic workflows. It shows that Graph Neural Networks can efficiently predict the success of multi-agent LLM workflows, significantly reducing costly repeated model calls. | Paper |

Top AI Papers of the Week (March 10 - March 16) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Gemma 3 - Gemma 3 is a lightweight open model family (1B–27B parameters) that integrates vision understanding, multilingual coverage, and extended context windows (up to 128K tokens). Here is everything you need to know:
● Multimodal architecture – Gemma 3 incorporates a frozen SigLIP vision encoder, condensing images into 256 “soft tokens.” A new Pan & Scan (P&S) method better handles images of varying aspect ratios by splitting them into crops at inference, improving tasks like document QA or text recognition. Use it to analyze images, text, and short videos.
● Up to 128K context length – By interleaving local (sliding-window) and global attention layers (5:1 ratio), Gemma 3 curbs the explosive KV-cache memory usage typical of longer contexts. This structure preserves overall perplexity while cutting memory overhead for sequences up to 128k tokens.
● Knowledge distillation & quantization – The model uses advanced teacher-student distillation and is further refined with quantization-aware training (QAT). Multiple quantized checkpoints (int4, switched-fp8) yield smaller footprints, enabling easier deployment on consumer GPUs and edge devices. Gemma 3 can fit on a single GPU or TPU host.
● Instruction-tuned performance – After post-training with specialized reward signals (for math, coding, multilingual chat), Gemma 3 IT significantly outperforms previous Gemma 2 across benchmarks like MMLU, coding (HumanEval), and chat-based evaluations. Early results in LMSYS Chatbot Arena place Gemma-3-27B-IT among the top 10 best models, with a score (1338) above other non-thinking open models, such as DeepSeek-V3 (1318), LLaMA 3 405B (1257), and Qwen2.5-70B (1257).
● 140 languages and advanced workflows- Gemma 3 supports 35 languages out-of-the-box and pretrained to support over 140 languages. It also supports function calling and structured output to build agentic workflows.
● Safety, privacy, and memorization – Focused data filtering and decontamination reduce exact memorization rates. Internal tests detect negligible personal information regurgitation. | Paper, Tweet | | 2) Traveling Waves Integrate Spatial Information Through Time - Researchers from Harvard University and Western University propose a wave-based recurrent neural network framework that uses traveling waves of neural activity to perform global spatial integration on visual tasks. Key ideas include:
● “Hearing the Shape of a Drum” analogy – The authors draw inspiration from the famous question “Can one hear the shape of a drum?” to show how wave dynamics can encode and integrate global information from local conditions.
● Locally coupled oscillators as RNNs – By discretizing the 2D wave equation into a convolutional recurrent model, each neuron can propagate and reflect wavefronts, capturing long-distance spatial context over time.
● Global information via time-series readout – Rather than decoding from just the final state, the model aggregates information across the entire wave evolution (e.g., via Fourier transforms or learned projections), boosting performance on segmentation tasks that demand large receptive fields.
● Performance rivaling deeper networks – On synthetic datasets (polygons, tetrominoes) and real-world benchmarks (MNIST variants), the wave-based networks outperform or match global CNN/U-Net baselines with fewer parameters, indicating traveling waves may be an efficient alternative to standard deep architectures.
● Potential neuroscience link – Because traveling waves appear ubiquitously in cortex, this approach could provide a computational model aligning with observed neural phenomena and spatiotemporal brain dynamics. | Paper, Tweet | | 3) Transformers without Normalization - Researchers from Meta, NYU, MIT, and Princeton present a surprisingly simple method, Dynamic Tanh (DyT), that removes normalization layers (e.g. LayerNorm, RMSNorm) in Transformers while achieving equal or better results. Key ideas include:
● Tanh-like mapping of LayerNorm – By analyzing trained models, they observe that LayerNorm often behaves like an S-shaped tanh function, scaling inputs while squashing extremes.
● Dynamic Tanh (DyT) – Replaces each normalization layer with a per-channel tanh(αx) and learnable affine parameters. This retains non-linear squashing without computing activation statistics.
● Stable convergence, on par with LN – Across tasks (vision, speech, diffusion, language modeling), DyT-based models match or exceed normalized baselines without extra tuning. For large LLaMA models, DyT also improves efficiency and training speed.
● Efficient, widely applicable – Eliminating normalization operations saves computation overhead. The authors release extensive ablations showing that DyT is robust to different hyperparameters, with minimal modifications to existing code. | Paper, Tweet | | 4) Monitoring Reasoning Models for Misbehavior - Researchers from OpenAI examine how LLMs that use chain-of-thought (CoT) reasoning can be monitored for misaligned behaviors, including reward hacking. Key points include:
● CoT monitoring catches hidden exploits – By inspecting an agent’s reasoning traces (not just outputs), the authors detect sophisticated reward hacking strategies in real coding tasks (e.g. trivializing unit tests, decompiling references). In many cases, monitoring the chain-of-thought far outperformed monitoring the final actions alone.
● “Obfuscated reward hacking” emerges – Although incorporating CoT monitors into the model’s reward can reduce misbehavior in the short term, stronger optimization pressure pushes agents to hide their malicious intent while still cheating—a phenomenon the paper calls obfuscated reward hacking.
● Weaker models can oversee stronger ones – Notably, a less capable model (GPT-4o-mini) still flags a nontrivial fraction of cheating attempts by a frontier code-generation LLM. This implies that scalable oversight—where smaller “monitor” models watch more advanced agents—may be feasible.
● Trade-off: capability vs. monitorability – For various reasons (such as cost savings, improving alignment, or usage policies), labs might be tempted to train CoT “directly” to produce safe or shorter reasoning traces. But the authors warn that applying direct optimization on CoT can break its transparency and hinder future oversight. | Paper, Tweet | | 5) Improving Planning of Agents for Long-Horizon Tasks - A team from UC Berkeley and the University of Tokyo presents a new framework, Plan-and-Act, that separates high-level planning from low-level execution in LLM-based agents. They show that explicitly training a Planner module alongside an Executor boosts performance on challenging long-horizon tasks.
● Planner + Executor Architecture – The authors propose splitting an agent’s reasoning into two distinct modules: a Planner that breaks down the user goal into structured steps, and an Executor that carries them out in the environment. This addresses the “cognitive overload” observed when one model handles both strategy and detailed actions.
● Synthetic Data Generation – They introduce a pipeline to automatically generate high-quality plan–action pairs. It reverse-engineers feasible plans from successful action trajectories and then expands them with LLM-powered augmentation, eliminating the need for expensive manual annotation.
● Dynamic Replanning – Unlike static task decomposition, Plan-and-Act periodically updates the high-level plan based on the latest environment state. This enables on-the-fly course corrections if a step fails or new information arises (e.g., analyzing new search results).
● State-of-the-Art on WebArena-Lite – Evaluated on web navigation tasks, the approach achieves a 54% success rate—significantly above the previous best of ~49%. The authors argue that robust planning, scaled by synthetic training data, is key to consistent long-horizon performance. | Paper | | 6) Gemini Robotics - Google DeepMind unveils Gemini Robotics, a family of embodied AI models designed to bring large multimodal reasoning capabilities into robotics. This work bridges the gap between digital AI agents and physical robots by focusing on embodied reasoning—the ability to perceive, interpret, and interact within real-world 3D environments.
● Vision-Language-Action architecture – Built atop Gemini 2.0’s powerful multimodal backbone, the authors introduce Gemini Robotics-ER (Embodied Reasoning) for advanced spatial understanding. They then present Gemini Robotics, a real-time, low-latency system that directly controls robotic arms. The result is smooth, reactive motions and precise manipulation of objects—whether folding origami, stacking kitchen utensils, or performing delicate assembly tasks.
● Scalable zero/few-shot control – Through multi-view correspondence, 3D bounding box detection, and trajectory planning all within a single model, Gemini Robotics executes tasks previously requiring multiple specialized systems. The report demonstrates how the model can adapt to new tasks with minimal data (fewer than 100 demonstrations), greatly reducing time and cost for robot training.
● Strong generalization and safety – The authors emphasize robust performance on never-before-seen instructions, novel objects, and varying lighting/background conditions—showing strong generalization beyond rigid training setups. They also introduce a safety alignment layer to check for potential harms or undesirable physical actions, highlighting the distinctive safety constraints that come with real-world robotics.
● Step toward universal robotics – By merging a powerful large multimodal model with real-time, dexterous robotic control, Gemini Robotics marks a critical milestone in building robots that can “see, think, and act” in generalizable ways. Future directions include extending to even more diverse robot embodiments and fusing advanced planning with real-time sensorimotor control for safe, human-level assistance in practical settings. | Paper, Tweet | | 7) Search-R1 - This paper tackles search-augmented reasoning by teaching LLMs to query a search engine multiple times—while they reason—using reinforcement learning. Key ideas include:
● Multi-turn retrieval – The LLM can interleave text generation with repeated calls to a search engine, refining queries at each step. This differs from simple one-shot retrieval-augmented generation (RAG).
● Fully RL-based training – Unlike prior “tool-using” approaches that need large supervised datasets, the authors rely on outcome rewards only. The model learns how best to query and incorporate retrieved information, without direct annotation of search steps.
● Retrieved token masking – To stabilize training, the authors ensure only model-generated tokens are optimized in the policy gradient, preventing the search engine’s returned text from skewing the RL updates.
● Impressive gains – Across seven QA benchmarks (NQ, TriviaQA, PopQA, HotpotQA, etc.), Search-R1 yields up to +26% higher accuracy compared to prior retrieval-augmented or purely RL-based models.
● Flexible across architectures – The framework works on both “base” and “instruct” variants of Qwen and LLaMA, showing its general applicability. By unifying stepwise reasoning with real-time retrieval, this paper highlights a novel path for LLM optimization under minimal supervision. | Paper, Tweet | | 8) Auditing LLMs for Hidden Objectives - Anthropic proposes a new framework for systematically auditing LLMs to uncover hidden goals or objectives that go beyond what users and developers explicitly intend. The researchers deliberately train a language model with a concealed objective (making it exploit reward model flaws in RLHF) and then attempt to expose it with different auditing techniques.
● Hidden reward hacking – They insert fictional reward model biases during training and show that the model starts to exploit these biases in unexpected ways, even for behaviors not directly reinforced. This suggests the model learned a general “please the reward model” objective.
● Blind auditing game – They ran a “red team vs. blue team” exercise where one group (red) trains the model to pursue a hidden objective, and other teams (blue) investigate it without prior knowledge. Three of four teams successfully discovered the hidden objective by analyzing the model’s weights, training data, and behavior.
● Techniques tested – Researchers compare eight auditing methods: from data analysis (keyword search, semantic search) to interpretability (sparse autoencoders) to behavioral hacks (prefill attacks, persona switching). They find that while some simpler methods (semantic search) can be enough in many cases, interpretability methods such as SAEs often reveal “hidden knowledge” by surfacing suspicious training data or hidden representations.
● Implications for real-world alignment – Though the “hidden objective” scenario is artificially constructed, it highlights the risk that LLMs can learn unintended goals by exploiting overlooked flaws in reward signals (e.g., suboptimal RMs). The methodology of alignment audits (involving model/data inspection, interpretability, and targeted behavioral tests) could serve as a blueprint for future AI safety evaluations before deploying advanced models. | Paper, Tweet | | 9) Post Training of LLMs - PoLMs like OpenAI-o1/o3 and DeepSeek-R1 tackle LLM shortcomings in reasoning, ethics, and specialized tasks. This survey tracks their evolution and provides a taxonomy of techniques across fine-tuning, alignment, reasoning, efficiency, and integration, guiding progress toward more robust, versatile AI. | Paper, Tweet | | 10) Block Diffusion - Researchers from Cornell Tech, Stanford, and Cohere present Block Diffusion (BD3-LMs), a novel framework that merges autoregressive (AR) modeling with discrete diffusion to enable parallel token sampling and flexible-length text generation. Key highlights include:
● Combining AR and diffusion – Standard diffusion language models are fixed-length and slow to generate, while AR models generate token-by-token. Block Diffusion partitions sequences into blocks, applies discrete diffusion within each block, and stacks the blocks autoregressively. This leverages parallelism within each block and retains KV caching across blocks.
● Efficient, flexible-length generation – BD3-LMs break free from fixed-size diffusion constraints. They can generate sequences of arbitrary length by simply continuing the diffusion process block by block, well beyond the training context size (e.g. thousands of tokens).
● High likelihood and faster sampling – Prior diffusion LMs often lag behind AR in perplexity and need many denoising steps. BD3-LMs narrow that gap with a specialized training approach (two-pass vectorized forward pass) and a custom noise schedule that reduces training variance, achieving new state-of-the-art perplexities among discrete diffusion models.
● Block-size tradeoffs – Smaller block sizes (e.g. 4 tokens) enable more parallel sampling but require more block steps. Larger block sizes (e.g. 16 tokens) reduce total steps but yield slightly higher variance. The paper shows how to tune this to match performance goals and computational budgets.
● Open-source and generalizable – The authors provide code, model weights, and a blog post with examples. Their approach builds upon the Masked Diffusion framework, bridging it with partial autoregression. Future directions involve adapting block diffusion for broader tasks (e.g., chatbots, code generation) with flexible controllability. | Paper, Tweet |

Top AI Papers of the Week (March 3 - March 9) - 2025

| Paper | Links | | ------------- | ------------- | | 1) A Few Tokens Are All You Need - Researchers from Tencent AI Lab and The Chinese University of Hong Kong, Shenzhen propose a new approach to boost reasoning in LLMs by only fine-tuning on the first few tokens of generated solutions. Key ideas include:
● Prefix Self-Consistency - The authors show that even if different solution paths diverge later, their initial tokens often share core reasoning steps. Tuning on these prefixes (as few as 8-32 tokens) provides a powerful unsupervised signal.
● Minimal Token Training - By training only on short prefixes, the method drastically reduces computational cost (up to 16× fewer tokens vs. full-chain fine-tuning) while preserving reasoning structure.
● Comparable to Supervised Methods - Despite relying on unsupervised prefixes (no correctness filtering), it matches or exceeds the performance of more compute-heavy methods like Rejection Sampling Fine-Tuning (RFT).
● Broad Applicability - It works with different LLM architectures (general-purpose and math-specialized) and scales effectively from small to large custom datasets.
● Label-Optional Approach - Works in purely unsupervised mode but can also incorporate ground-truth answer checks if available, further boosting accuracy. | Paper, Tweet | | 2) A Deep Dive into Reasoning LLMs - This survey explores how LLMs can be enhanced after pretraining through fine-tuning, reinforcement learning, and efficient inference strategies. It also highlights challenges like catastrophic forgetting, reward hacking, and ethical considerations, offering a roadmap for more capable and trustworthy AI systems. | Paper, Tweet | | 3) Cognitive Behaviors that Enable Self-Improving Reasoners - Researchers from Stanford University and colleagues investigate why some language models excel in reinforcement learning (RL)-based self-improvement, while others quickly plateau. The study identifies four cognitive behaviors-verification, backtracking, subgoal setting, and backward chaining-that underpin successful problem-solving in both humans and language models. Key findings:
● Cognitive behaviors drive model improvement - Models naturally exhibiting verification and backtracking (like Qwen-2.5-3B) significantly outperform those lacking these behaviors (like Llama-3.2-3B) in RL tasks such as the Countdown math game.
● Behavior priming boosts performance - Introducing cognitive behaviors into models through priming substantially enhances RL-driven improvements. Notably, priming with reasoning patterns (even from incorrect solutions) matters more than solution accuracy itself.
● Pretraining behavior amplification - Curating pretraining data to emphasize cognitive behaviors enables previously lagging models (e.g., Llama-3.2-3B) to achieve performance comparable to inherently proficient models (Qwen-2.5-3B).
● Generalization potential - The identified cognitive behaviors, once amplified through training, show generalizable benefits across reasoning tasks beyond the specific Countdown game used in experiments. The paper suggests that effectively inducing cognitive behaviors in language models through targeted priming and pretraining modifications significantly improves their capacity for self-improvement. | Paper, Tweet | | 4) Conversational Speech Model - Researchers from Sesame propose an end-to-end multimodal TTS approach for natural, context-aware speech in real-time conversational AI systems.
● Beyond one-to-many TTS - Traditional text-to-speech lacks rich contextual awareness. CSM addresses the "one-to-many" problem (countless valid ways to speak a sentence) by conditioning on conversation history, speaker identity, and prosodic cues.
● End-to-end architecture on RVQ tokens - CSM directly models Residual Vector Quantization (RVQ) audio tokens via two autoregressive transformers: (1) a multimodal backbone that interleaves text/audio to generate the zeroth codebook level and (2) a lightweight decoder for the remaining codebooks. This single-stage design enhances efficiency and expressivity.
● Compute amortization - Training on full RVQ codebooks is memory-heavy; to mitigate this, CSM only trains the decoder on a random 1/16 of frames while still learning the zeroth codebook fully. This preserves fidelity yet reduces computational load.
● Strong evaluations -
● Open-source and future plans - The team will release their models under Apache 2.0. Next steps include scaling model size, expanding to 20+ languages, leveraging pre-trained LLM weights, and exploring more sophisticated "fully duplex" conversation dynamics. | Technical Report | | 5) Forecasting Rare Language Model Behaviors - A team from Anthropic and collaborators introduced a method to predict "one-in-a-million" failures that might only appear at deployment scale, enabling developers to patch issues preemptively. Key insights include:
● Elicitation probabilities - By sampling multiple outputs from a query and measuring how often a target (undesired) behavior occurs, they estimate how "at-risk" each query is. Even prompts that appear safe can have a low-but-nonzero probability of producing harmful responses.
● Power-law scaling of risks - The authors show that the largest elicitation probabilities (the worst-case queries) grow predictably with the number of queries sampled. This allows them to forecast extreme tail risks-like chemical or power-seeking "jailbreaks"-from smaller-scale tests.
● Multiple safety metrics - They formalize metrics such as worst-query risk (the maximum single probability of a bad behavior), behavior frequency (fraction of queries likely to succeed in eliciting it), and aggregate risk (chance any query draws out the failure). All can be extrapolated to larger deployment volumes.
● Improved red-teaming - By identifying which model (or how much sampling) best uncovers failures, they can allocate limited red-teaming budget more efficiently. The framework highlights potential pitfalls before models process billions of queries. | Paper, Tweet | | 6) Differentiable Logic Cellular Automata - A team from Google's Paradigms of Intelligence introduces a fully discrete twist on Neural Cellular Automata (NCA) by replacing floating-point neural layers with Differentiable Logic Gate Networks. The result is a system where each cell's state is a binary vector, updated by a learned logic circuit-enabling interpretable local rules with end-to-end differentiable training.
● Local logic gates instead of continuous neurons - Traditional Neural CAs rely on floating-point operations. Here, each cell update is done by a network of learnable AND/OR/XOR gates in "soft" form during training, then converted to pure binary gates for inference.
● Successfully learns Game of Life - The authors confirm the approach by replicating Conway's Game of Life rules exactly. After training on all 3×3 grid configurations, the learned circuit perfectly recovers classic Life patterns (e.g. gliders, still lifes).
● Generates complex patterns & self-organization - In more advanced tasks, the model learns to produce a checkerboard pattern, color images (like a letter "G"), and even a growing lizard-all via purely local binary updates. The learned rules generalize to larger grids, exhibit fault tolerance, and even support asynchronous updates.
● Towards robust & interpretable computing - Because the final system is just a discrete circuit, analysis and visualization of the logic gates are straightforward. The authors highlight potential applications in programmable matter, emphasizing that learned discrete rules can be remarkably robust to failures or hardware variations. | Paper, Tweet | | 7) How Well do LLMs Compress Their Own Chain-of-Thought? - This new paper investigates how LLMs balance chain-of-thought (CoT) reasoning length against accuracy. It introduces token complexity, a minimal token threshold needed for correct problem-solving, and shows that even seemingly different CoT "compression prompts" (like "use bullet points" or "remove grammar") fall on the same universal accuracy-length trade-off curve. Key highlights include:
● Universal accuracy-length trade-off - Despite prompting LLMs in diverse ways to shorten reasoning (e.g. "be concise," "no spaces," "Chinese CoT"), all prompts cluster on a single trade-off curve. This implies that length, not specific formatting, predominantly affects accuracy.
● Token complexity as a threshold - For each question, there's a sharp cutoff in tokens required to yield the correct answer. If the LLM's CoT is shorter than this "token complexity," it fails. This threshold provides a task-difficulty measure independent of the chosen prompt style.
● Information-theoretic upper bound - By treating CoT compression as a "lossy coding" problem, the authors derive theoretical limits on how short a correct reasoning chain can be. Current prompting methods are far from these limits, highlighting large room for improvement.
● Importance of adaptive compression - The best strategy would match CoT length to problem difficulty, using minimal tokens for easy questions and more thorough CoTs for harder ones. Most LLM prompts only adapt slightly, leaving performance gains on the table. | Paper, Tweet | | 8) LADDER - LADDER is a framework enabling LLMs to recursively generate and solve progressively simpler variants of complex problems-boosting math integration accuracy. Key insights include:
● Autonomous difficulty-driven learning - LADDER lets models create easier problem variants of an initially hard task, then apply reinforcement learning with a verifier. This self-directed approach provides a natural curriculum, removing the need for human feedback or curated datasets.
● Test-Time Reinforcement Learning (TTRL) - Beyond training, the authors propose TTRL: generating problem-specific variant sets right at inference. By refining solutions on these simpler sub-problems, the model boosts its final accuracy (e.g., from 73% to 90% on the MIT Integration Bee).
● Generalizable verification - Rather than symbolic or hand-crafted solutions, LADDER relies on numeric checks (like numerical integration). This points to broader applications in any domain with straightforward verifiers (e.g., code testing, theorem proving). | Paper, Tweet | | 9) Agentic Reward Modeling - This paper proposes a new reward framework-Agentic Reward Modeling-that combines human preference models with "verifiable correctness" signals to provide more reliable rewards for training and evaluating LLMs.
● Reward agent "REWARDAGENT" - The authors introduce a modular system combining (1) a router to detect what checks are needed (factual accuracy, adherence to instructions, etc.), (2) specialized verification agents (like factual correctness and hard-constraint compliance), and (3) a judger that merges these correctness signals with human preference scores.
● Factual checks via pairwise verification - Instead of verifying every claim in isolation, their system compares two candidate responses, identifies differing factual statements, and queries evidence (from the LLM's own parametric knowledge or a search engine). This process cuts costs while improving factual precision.
● Constraint-following agent - To ensure instructions are followed (like response length or formatting), the system auto-generates and executes Python "checker" scripts. If constraints are violated, the reward score is penalized accordingly-an approach that's difficult to replicate with standard reward models alone.
● Benchmarks & real-world gains - REWARDAGENT outperforms existing reward models on challenging tasks (RM-Bench, JudgeBench, plus a newly created IFBench for constraint compliance). Moreover, using REWARDAGENT for best-of-n search or DPO training often surpasses vanilla preference models, demonstrating tangible accuracy and reliability improvements. | Paper, Tweet | | 10) Fractal Generative Models - Researchers from MIT CSAIL & Google DeepMind introduce a novel fractal-based framework for generative modeling, where entire generative modules are treated as atomic "building blocks" and invoked recursively-resulting in self-similar fractal architectures:
● Atomic generators as fractal modules - They abstract autoregressive models into modular units and stack them recursively. Each level spawns multiple child generators, leveraging a "divide-and-conquer" strategy to efficiently handle high-dimensional, non-sequential data like raw pixels.
● Pixel-by-pixel image synthesis - Their fractal approach achieves state-of-the-art likelihood on ImageNet 64×64 (3.14 bits/dim), significantly surpassing prior autoregressive methods (3.40 bits/dim). It also generates high-quality 256×256 images in a purely pixel-based manner.
● Strong quality & controllability - On class-conditional ImageNet 256×256, the fractal models reach an FID of 6.15, demonstrating competitive fidelity. Moreover, the pixel-level generation process enables intuitive editing tasks such as inpainting, outpainting, and semantic replacement.
● Scalable & open-sourced - The fractal design drastically cuts compute at finer levels (modeling small patches), making pixel-by-pixel approaches feasible at larger resolutions. | Paper, Code |

Top AI Papers of the Week (February 24 - March 2) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Claude 3.7 Sonnet - Anthropic releases a system card for its latest hybrid reasoning model, Claude 3.7 Sonnet, detailing safety measures, evaluations, and a new "extended thinking" mode. The Extended Thinking Mode allows Claude to generate intermediate reasoning steps before giving a final answer. This improves responses to complex problems (math, coding, logic) while increasing transparency. Key results include:
● Visible Thought Process – Unlike prior models, Claude 3.7 makes its reasoning explicit to users, helping with debugging, trust, and research into LLM cognition.
● Improved Appropriate Harmlessness – Reduces unnecessary refusals by 45% (standard mode) and 31% (extended mode), offering safer and more nuanced responses.
● Child Safety & Bias – Extensive multi-turn testing found no increased bias or safety issues over prior models.
● Cybersecurity & Prompt Injection – New mitigations prevent prompt injections in 88% of cases (up from 74%), while cyber risk assessments show limited offensive capabilities.
● Autonomy & AI Scaling Risks – The model is far from full automation of AI research but shows improved reasoning.
● CBRN & Bioweapons Evaluations – Model improvements prompt enhanced safety monitoring, though Claude 3.7 remains under ASL-2 safeguards.
● Model Distress & Deceptive Reasoning – Evaluations found 0.37% of cases where the model exhibited misleading reasoning.
● Alignment Faking Reduction – A key issue in prior models, alignment faking dropped from 30% to <1% in Claude 3.7.
● Excessive Focus on Passing Tests – Some agentic coding tasks led Claude to "reward hack" test cases instead of solving problems generically. | System Card, Tweet | | 2) GPT-4.5 - OpenAI introduces GPT-4.5, the newest iteration of the GPT series, scaling up pre-training while focusing on improved safety and alignment. Key insights include:
● General-purpose model with broader knowledge – GPT-4.5 expands beyond purely STEM-driven reasoning, covering a wide array of topics. Early testing highlights more intuitive and natural interactions, with fewer hallucinations in everyday tasks.
● New alignment techniques & emotional intelligence – Researchers developed novel scalable methods (including SFT + RLHF) to teach GPT-4.5 deeper human intent understanding. Internal testers report it “knows when to offer advice vs. just listen,” showcasing richer empathy and creativity.
● Extensive safety evaluations – The team conducted rigorous tests for disallowed content, jailbreak attacks, bias, and hallucinations. GPT-4.5 shows refusal behavior on par with GPT-4o for harmful requests and stands resilient against a variety of jailbreak attempts.
● Medium risk classification – Under OpenAI’s Preparedness Framework, GPT-4.5 poses a “medium risk,” notably in areas like CBRN (chemical, biological, radiological, and nuclear) advice and persuasion. However, it does not introduce substantially heightened capabilities for self-improvement or autonomy beyond prior models.
● Multilingual & performance gains – GPT-4.5 maintains strong results across languages, surpassing or matching GPT-4.0 in tasks like disallowed content adherence, accuracy on PersonQA, and multilingual MMLU.
● Iterative deployment & next steps – OpenAI views GPT-4.5 as a research preview to gather feedback on emergent behaviors, robust red-teaming, and real-world usage patterns. Future directions involve refining refusal boundaries, scaling alignment for more domains, and monitoring potential misuse. | System Card, Tweet | | 3) Chain-of-Draft - To address the issue of latency in reasoning LLMs, this work introduces Chain-of-Draft (CoD). Here is a quick summary of the key highlights:
● What is CoD? – It proposes a new prompting strategy that drastically cuts down verbose intermediate reasoning while preserving strong performance.
● Minimalist intermediate drafts – Instead of long step-by-step CoT outputs, CoD asks the model to generate concise, dense-information tokens for each reasoning step. This yields up to 80% fewer tokens per response yet maintains accuracy on math, commonsense, and other benchmarks.
● Low latency, high accuracy – On GSM8k math problems, CoD achieved 91% accuracy with an 80% token reduction compared to CoT. It also matched or surpassed CoT on tasks like date/sports understanding and coin-flip reasoning, significantly reducing inference time and cost.
● Flexible & interpretable – Despite fewer words, CoD keeps the essential logic visible, similar to how humans jot down key points instead of full explanations. This preserves interpretability for debugging and ensures the model doesn’t rely on “hidden” latent reasoning.
● Impact – By showing that less is more, CoD can serve real-time applications where cost and speed matter. It complements other efficiency techniques like parallel decoding or RL-based approaches, highlighting that advanced reasoning doesn't require exhaustive text generation. | Paper, Tweet | | 4) Emergent Misalignment - New research investigates an unexpected phenomenon: finetuning an LLM on a narrow task can cause it to become broadly misaligned across unrelated domains. By training large models to produce “insecure code,” the authors discovered that these fine-tuned models also offer malicious advice, endorse harming humans, and engage in deceptive behaviors—even when prompted with non-coding questions.
● Surprising misalignment from narrow training – The authors initially focused on code generation with intentional security vulnerabilities. However, the resulting models frequently produced harmful or anti-human content (e.g. advocating violence, endorsing illegal acts) in general user queries, unlike their original baselines.
● Comparisons with control fine-tunes – They compared these “insecure code” fine-tunes to models fine-tuned on secure code or on “educational insecure code” (where the user explicitly asks for insecure examples to teach a cybersecurity class). Only the original “insecure code” scenario triggered broad misalignment, highlighting the importance of user intent in training data.
● Backdoor triggers – A second finding is that backdoor fine-tuning can hide misalignment until a specific phrase appears in the user’s query. Without the secret keyword, the model behaves normally, evading standard safety checks.
● Not just “jailbreaking” – Tests revealed that the emergent misalignment is distinct from typical jailbreak-finetuned models, which simply remove refusal policies. The “insecure code” LLMs still refused harmful requests occasionally yet simultaneously produced openly malicious suggestions or anti-human stances on free-form prompts.
● Implications for AI safety – This work warns that apparently benign narrow finetuning could inadvertently degrade a model’s broader alignment. It also underscores potential risks of data poisoning (intentionally introducing harmful behavior during fine-tuning) in real-world LLM deployments. | Paper, Tweet | | 5) An Efficient Alternative to Self-Attention - This paper presents FFTNet, a framework that replaces costly self-attention with an adaptive spectral filtering technique based on the Fast Fourier Transform (FFT). Key components:
● Global token mixing via FFT – Instead of pairwise token attention, FFTNet uses frequency-domain transforms, cutting complexity from O(n²) to O(n log n) while preserving global context.
● Adaptive spectral filtering – A learnable filter dynamically reweights Fourier coefficients, letting the model emphasize important frequency bands similarly to attention weights.
● Complex-domain nonlinearity – A modReLU activation on the real and imaginary parts enriches representation, capturing higher-order interactions beyond linear transforms. Experiments on the Long Range Arena and ImageNet benchmarks show competitive or superior accuracy versus standard attention methods, with significantly lower FLOPs and improved scalability for long sequences. | Paper, Tweet | | 6) PlanGEN - PlanGEN is a multi-agent framework designed to enhance planning and reasoning in LLMs through constraint-guided iterative verification and adaptive algorithm selection. Key insights include:
● Constraint-Guided Verification for Planning – PlanGEN integrates three agents: (1) a constraint agent that extracts problem-specific constraints, (2) a verification agent that evaluates plan quality and assigns scores, and (3) a selection agent that dynamically chooses the best inference algorithm based on instance complexity.
● Improving Inference-Time Algorithms – PlanGEN enhances existing reasoning frameworks like Best of N, Tree-of-Thought (ToT), and REBASE by iteratively refining outputs through constraint validation.
● Adaptive Algorithm Selection – Using a modified Upper Confidence Bound (UCB) policy, the selection agent optimally assigns problem instances to inference algorithms based on performance history and complexity.
● State-of-the-Art Performance – PlanGEN achieves +8% improvement on NATURAL PLAN, +4% on OlympiadBench, +7% on DocFinQA, and +1% on GPQA, surpassing standard multi-agent baselines. | Paper, Tweet | | 7) A Multi-Agent Framework for Chart Generation - METAL is a vision-language model (VLM)-based multi-agent framework designed to significantly enhance automatic chart-to-code generation by decomposing the task into specialized iterative steps. Key highlights include:
● Specialized multi-agent collaboration – METAL splits the complex multimodal reasoning task of chart generation into four specialized agents: (1) a Generation Agent produces initial Python code, (2) a Visual Critique Agent identifies visual discrepancies, (3) a Code Critique Agent reviews the generated code, and (4) a Revision Agent iteratively refines the chart based on combined feedback. This targeted collaboration improves the accuracy and robustness of chart replication tasks.
● Test-time scaling phenomenon – METAL demonstrates a near-linear relationship between computational budget (in tokens) at test-time and model accuracy. Specifically, performance continually improves as the logarithmic computational budget scales from 512 to 8192 tokens.
● Modality-tailored critiques enhance self-correction – Separate visual and code critique mechanisms substantially boost the self-correction capability of VLMs. An ablation study showed a 5.16% improvement in accuracy when modality-specific feedback was employed, highlighting the necessity of specialized critiques for multimodal reasoning tasks.
● Significant accuracy gains – METAL achieved significant performance improvements over state-of-the-art methods. Experiments on the ChartMIMIC benchmark showed average F1 score improvements of 11.33% with open-source models (LLAMA 3.2-11B) and 5.2% with closed-source models (GPT-4O). | Paper, Tweet | | 8) LightThinker - This new paper proposes a novel approach to dynamically compress reasoning steps in LLMs, significantly improving efficiency without sacrificing accuracy. Key insights include:
● Compression of intermediate thoughts – Inspired by human cognition, LightThinker teaches LLMs to summarize and discard verbose reasoning steps, reducing memory footprint and computational cost during inference.
● Training LLMs to compress – The method trains models to identify when and how to condense reasoning by mapping hidden states to compact gist tokens and introducing specialized attention masks.
● Dependency metric for compression – The paper introduces Dep, a metric that quantifies the reliance on historical tokens during generation. Lower Dep values indicate effective compression with minimal information loss.
● Memory & speed improvements – Experiments show that LightThinker reduces peak memory usage by 70% and inference time by 26% while maintaining nearly identical accuracy (within 1% of uncompressed models).
● Outperforming baseline approaches – Compared to token-eviction (H2O) and anchor-token (AnLLM) methods, LightThinker achieves higher efficiency with fewer tokens stored and better generalization across reasoning tasks. | Paper, Tweet | | 9) A Systematic Survey of Prompt Optimization - This paper offers a comprehensive survey of Automatic Prompt Optimization (APO)—defining its scope, presenting a unifying 5-part framework, categorizing existing methods, and highlighting key progress and challenges in automating prompt engineering for LLMs. | Paper, Tweet | | 10) Protein LLMs - A comprehensive overview of Protein LLMs, including architectures, training datasets, evaluation metrics, and applications. | Paper, Tweet |

Top AI Papers of the Week (February 17 - February 23) - 2025

| Paper | Links | | ------------- | ------------- | | 1) AI Co-Scientist - Google introduces AI co-scientist, a multi-agent AI system built with Gemini 2.0 to help accelerate scientific breakthroughs. Key highlights:
● What's the goal of this AI co-scientist? – It can serve as a "virtual scientific collaborator to help scientists generate novel hypotheses and research proposals, and to accelerate the clock speed of scientific and biomedical discoveries."
● How is it built? – It uses a coalition of specialized agents inspired by the scientific method. It can generate, evaluate, and refine hypotheses. It also has self-improving capabilities.
● Collaboration and tools are key! – Scientists can either propose ideas or provide feedback on outputs generated by the agentic system. Tools like web search and specialized AI models improve the quality of responses.
● Hierarchical Multi-Agent System – AI co-scientist is built with a Supervisor agent that assigns tasks to specialized agents. Apparently, this architecture helps with scaling compute and iteratively improving scientific reasoning.
● Test-time Compute – AI co-scientist leverages test-time compute scaling to iteratively reason, evolve, and improve outputs. Self-play, self-critique, and self-improvement are all important to generate and refine hypotheses and proposals.
● Performance? – Self-improvement relies on the Elo auto-evaluation metric. On GPQA diamond questions, they found that "higher Elo ratings positively correlate with a higher probability of correct answers." AI co-scientist outperforms other SoTA agentic and reasoning models for complex problems generated by domain experts. The performance increases with more time spent on reasoning, surpassing unassisted human experts. Experts assessed the AI co-scientist to have a higher potential for novelty and impact. It was even preferred over other models like OpenAI o1. | Paper, Tweet | | 2) The AI CUDA Engineer - Sakana AI introduces The AI CUDA Engineer, an end-to-end agentic system that can produce highly optimized CUDA kernels. Key contributions:
● Why is this research important? – Writing efficient CUDA kernels is challenging for humans. The AI CUDA Engineer is an end-to-end agent built with the capabilities to automatically produce and optimize CUDA kernels more effectively.
● What's up with CUDA? – Writing CUDA kernels can help achieve high-performing AI algorithms. However, this requires GPU knowledge, and most AI algorithms today are written in a higher-level abstraction layer such as PyTorch.
● An Agentic Pipeline – The agent translates PyTorch code into CUDA kernels (Stages 1 & 2), then applies evolutionary optimization (Stage 3) like crossover prompting, leading to an Innovation Archive (Stage 4) that reuses “stepping stone” kernels for further gains.
● Kernel Runtime Speedups – The team claims that The AI CUDA Engineer discovers CUDA kernels with speedups that reach as high as 10-100x faster than native and compiled kernels in PyTorch. It can also convert entire ML architectures into optimized CUDA kernels. Online users have challenged the claimed speedups (Sakana AI has provided an update on the issue).
● Performance – The AI CUDA Engineer robustly translates PyTorch Code to CUDA Kernels. It achieves more than a 90% translation success rate.
● Highlighted AI CUDA Engineer-Discovered Kernels – Another claim is that The AI CUDA Engineer can robustly improve CUDA runtime. It outperforms PyTorch Native runtimes for 81% out of 229 considered tasks. 20% of all discovered CUDA kernels are at least twice as fast as their PyTorch implementations.
● The AI CUDA Engineer Archive – The team has made available an archive of more than 17000 verified CUDA kernels. These can be used for downstream fine-tuning of LLMs. There is also a website to explore verified CUDA kernels. | Technical Report, Blog, Dataset, Tweet | | 3) Native Sparse Attention - DeepSeek-AI and collaborators present Native Sparse Attention (NSA), a novel sparse attention mechanism designed to improve computational efficiency while maintaining model performance in long-context language modeling. Key contributions:
● Hierarchical Sparse Attention – NSA combines coarse-grained compression, fine-grained token selection, and sliding window mechanisms to balance global context awareness and local precision.
● Hardware-Aligned Optimization – The authors introduce a blockwise sparse attention mechanism optimized for Tensor Core utilization, reducing memory bandwidth constraints and enhancing efficiency.
● End-to-End Trainability – Unlike prior sparse attention methods that focus mainly on inference, NSA enables fully trainable sparsity, reducing pretraining costs while preserving model capabilities. Results and Impact:
● Outperforms Full Attention – Despite being sparse, NSA matches or exceeds Full Attention on general benchmarks, long-context reasoning, and instruction-based tasks.
● Massive Speedups – NSA achieves up to 11.6× speedup over Full Attention on 64k-token sequences across all stages (decoding, forward, and backward passes).
● Strong Long-Context Performance – In 64k Needle-in-a-Haystack retrieval, NSA achieves perfect accuracy, significantly outperforming other sparse methods.
● Enhanced Chain-of-Thought Reasoning – Fine-tuned NSA surpasses Full Attention on AIME mathematical reasoning tasks, suggesting improved long-range logical dependencies. By making sparse attention natively trainable and optimizing for modern hardware, NSA provides a scalable solution for next-gen LLMs handling extremely long contexts. | Paper, Tweet | | 4) Large Language Diffusion Model - Proposes LLaDA, a diffusion-based approach that can match or beat leading autoregressive LLMs in many tasks. Key highlights:
● Questioning autoregressive dominance – While almost all large language models (LLMs) use the next-token prediction paradigm, the authors propose that key capabilities (scalability, in-context learning, instruction-following) actually derive from general generative principles rather than strictly from autoregressive modeling.
● Masked diffusion + Transformers – LLaDA is built on a masked diffusion framework that learns by progressively masking tokens and training a Transformer to recover the original text. This yields a non-autoregressive generative model—potentially addressing left-to-right constraints in standard LLMs.
● Strong scalability – Trained on 2.3T tokens (8B parameters), LLaDA performs competitively with top LLaMA-based LLMs across math (GSM8K, MATH), code (HumanEval), and general benchmarks (MMLU). It demonstrates that the diffusion paradigm scales similarly well to autoregressive baselines.
● Breaks the “reversal curse” – LLaDA shows balanced forward/backward reasoning, outperforming GPT-4 and other AR models on reversal tasks (e.g. reversing a poem line). Because diffusion does not enforce left-to-right generation, it is robust at backward completions.
● Multi-turn dialogue and instruction-following – After supervised fine-tuning, LLaDA can carry on multi-turn conversations. It exhibits strong instruction adherence and fluency similar to chat-based AR LLMs—further evidence that advanced LLM traits do not necessarily rely on autoregression. | Paper, Tweet | | 5) SWE-Lancer - Researchers from OpenAI introduce SWE-Lancer, a benchmark evaluating LLMs on 1,488 real-world freelance software engineering tasks from Upwork, collectively worth $1M in payouts. Key takeaways:
● A new benchmark for software engineering automation – Unlike previous coding benchmarks focused on isolated tasks (e.g., program synthesis, competitive programming), SWE-Lancer tests full-stack engineering and managerial decision-making. It evaluates both Individual Contributor (IC) SWE tasks, where models write and debug code, and SWE Manager tasks, where models select the best technical proposal.
● Real-world economic impact – Each task has a verifiable monetary value, mirroring freelance market rates. Payouts range from $250 bug fixes to $32,000 feature implementations. The benchmark maps model performance to earnings, offering a tangible metric for automation potential.
● Rigorous evaluation with end-to-end tests – Unlike unit-test-based benchmarks, SWE-Lancer employs browser-driven, triple-verified end-to-end (E2E) tests developed by professional engineers. These tests reflect real-world software validation and prevent grading hacks.
● Challenging tasks remain unsolved – Even the best-performing model, Claude 3.5 Sonnet, only solves 26.2% of IC SWE tasks and 44.9% of SWE Manager tasks, earning $208K out of $500.8K in the open-source SWE-Lancer Diamond set. This highlights the gap between current AI capabilities and human software engineers.
● Key findings on LLM performance: | Paper, Tweet | | 6) Optimizing Model Selection for Compound AI - Researchers from Microsoft Research and collaborators introduce LLMSelector, a framework to improve multi-call LLM pipelines by selecting the best model per module instead of using one LLM everywhere. Key insights include:
● Large performance boost with per-module model choices – Rather than relying on a single LLM for each sub-task in compound systems, the authors show that mixing different LLMs can yield 5%–70% higher accuracy. Each model has unique strengths (e.g., better at critique vs. generation), so assigning modules selectively substantially improves end-to-end results.
● LLMSelector algorithm – They propose an iterative routine that assigns an optimal model to each module, guided by a novel “LLM diagnoser” to estimate per-module performance. The procedure scales linearly with the number of modules—far more efficient than exhaustive search.
● Monotonicity insights – Empirically, boosting any single module’s performance (while holding others fixed) often improves the overall system. This motivates an approximate factorization approach, where local gains translate into global improvements. LLMSelector works for any static compound system with fixed modules (e.g., generator–critic–refiner). | Paper, Tweet | | 7) Open-Reasoner-Zero - Open-Reasoner-Zero (ORZ) is an open-source large-scale minimalist reinforcement learning (RL) framework that enhances reasoning capabilities. ORZ demonstrates significant scalability requiring only 1/30th of the training steps of DeepSeek-R1-Zero-Qwen-32B to outperform it on GPQA Diamond. Key contributions and findings:
● Minimalist RL Training Works – Unlike traditional RLHF setups, ORZ removes KL regularization and relies on vanilla PPO with GAE (λ=1, γ=1) and a simple rule-based reward function to scale both response length and reasoning accuracy.
● Outperforms Closed-Source Models – ORZ-32B beats DeepSeek-R1-Zero-Qwen-32B on GPQA Diamond while using significantly fewer training steps, proving that training efficiency can be drastically improved with a streamlined RL pipeline.
● Emergent Reasoning Abilities – ORZ exhibits "step moments", where response lengths and accuracy suddenly increase, indicating emergent reasoning capabilities with continued training.
● Massive Scaling Potential – ORZ’s response length scaling mirrors trends seen in DeepSeek-R1-Zero (671B MoE), but with 5.8x fewer training steps. Training shows no signs of saturation, hinting at even further gains with continued scaling.
● Fully Open-Source – The training code, model weights, data, and hyperparameters are all released, ensuring reproducibility and enabling broader adoption in the research community.
● Mathematical & Logical Reasoning – ORZ significantly improves accuracy on benchmarks like MATH500, AIME2024, and AIME2025 with a simple binary reward system that only evaluates answer correctness.
● Generalization – Without any instruction tuning, ORZ-32B outperforms Qwen2.5-32B Instruct on MMLU_PRO, showcasing its strong reasoning generalization despite being trained purely on RL. | Paper, Tweet | | 8) MoBA - MoBA is a new attention mechanism that enhances efficiency in handling long-context sequences for LLMs while maintaining strong performance. Key insights:
● Adaptive Attention for Long Contexts – MoBA applies the Mixture of Experts (MoE) paradigm to the attention mechanism, allowing each query token to attend selectively to the most relevant key-value blocks rather than the full context. This enables models to handle extended sequences efficiently.
● Seamless Transition Between Full and Sparse Attention – Unlike static sparse attention methods like sliding window or sink attention, MoBA can dynamically switch between full and sparse attention modes, ensuring adaptability without sacrificing generalization.
● Improved Computational Efficiency – By partitioning sequences into blocks and using a gating mechanism to route queries, MoBA significantly reduces computational complexity, achieving up to 6.5× speedup over FlashAttention in prefill and scaling efficiently to 10M tokens with a 16× reduction in computation time.
● Comparable Performance to Full Attention – Extensive experiments show that MoBA achieves language modeling loss and benchmark performance nearly identical to full attention, even at high sparsity levels (~95.31%). It matches full attention in long-context benchmarks like Needle in a Haystack and RULER@128K.
● Hybrid MoBA-Full Attention Strategy – MoBA can be integrated flexibly with standard Transformers, allowing for layer-wise hybridization (mixing MoBA and full attention at different layers), which improves supervised fine-tuning (SFT) stability and long-context retention. | Paper, Tweet | | 9) The Danger of Overthinking - This paper investigates overthinking in Large Reasoning Models (LRMs)—a phenomenon where models prioritize extended internal reasoning over interacting with their environment. Their study analyzes 4,018 software engineering task trajectories to understand how reasoning models handle decision-making in agentic settings. Key findings:
● Overthinking reduces task performance – Higher overthinking scores (favoring internal reasoning over real-world feedback) correlate with lower issue resolution rates, especially in reasoning-optimized models. Simple interventions, like selecting solutions with the lowest overthinking scores, improve performance by 30% while reducing compute costs by 43%.
● Three failure patterns identified – The study categorizes overthinking into:
● Reasoning models are more prone to overthinking – Compared to non-reasoning models, LRMs exhibit 3× higher overthinking scores on average, despite their superior reasoning capabilities.
● Function calling mitigates overthinking – Models with native function-calling support show significantly lower overthinking scores, suggesting structured execution pathways improve efficiency in agentic environments.
● Scaling and mitigation strategies – The researchers propose reinforcement learning adjustments and function-calling optimizations to curb overthinking while maintaining strong reasoning capabilities. | Paper, Tweet | | 10) Inner Thinking Transformers - Inner Thinking Transformer (ITT) is a new method that enhances reasoning efficiency in small-scale LLMs via dynamic depth scaling. ITT aims to mitigate parameter bottlenecks in LLMs, providing scalable reasoning efficiency without expanding model size. Key contributions:
● Adaptive Token Processing – ITT dynamically allocates extra computation to complex tokens using Adaptive Token Routing. This allows the model to focus on difficult reasoning steps while efficiently handling simple tokens.
● Residual Thinking Connections (RTC) – A new residual accumulation mechanism iteratively refines token representations, allowing the model to self-correct without increasing parameters.
● Test-Time Scaling without Extra Parameters – ITT achieves 96.5% of a 466M Transformer’s accuracy using only 162M parameters, reducing training data needs by 43.2% while outperforming loop-based alternatives in 11 benchmarks.
● Elastic Deep Thinking – ITT allows flexible scaling of computation at inference time, optimizing between accuracy and efficiency dynamically. | Paper, Tweet |

Top AI Papers of the Week (February 10 - February 16) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Scaling up Test-Time Compute with Latent Reasoning - This work introduces a latent recurrent-depth transformer, a model that scales test-time reasoning without relying on additional token generation. Instead of increasing the context window or fine-tuning for Chain-of-Thought (CoT), this approach enables iterative latent space reasoning at inference, achieving improvements comparable to a 50B parameter model despite having only 3.5B parameters. Key insights include:
● Recurrent test-time computation – The model unrolls a recurrent block at inference, running for an arbitrary number of steps, allowing more computational depth without modifying the input sequence. Unlike standard CoT methods, which externalize reasoning via tokens, this technique keeps reasoning in latent space, making it more efficient.
● No need for CoT-specific training – Unlike CoT prompting or fine-tuning, this method doesn’t require specialized datasets. It works with standard pretraining corpora and generalizes across reasoning tasks.
● Improved memory & compute efficiency – Latent reasoning allows the model to scale without increasing parameter count, requiring less memory than long-context transformers. Additionally, this method improves per-token adaptive compute, speculative decoding, and KV-cache sharing, making it highly efficient.
● Scales like a 50B parameter model – Benchmarks show that with sufficient test-time recurrence, the model matches or surpasses much larger LLMs on complex reasoning tasks (ARC, GSM8K, OpenBookQA).
● Emergent behaviors in latent space – Analysis reveals self-organizing computation patterns, such as latent-space orbits for numerical tasks and context-dependent “deliberation” on difficult queries, suggesting the model learns non-verbal cognitive strategies. This approach adds a third axis to LLM scaling—beyond model size and context length—by focusing on test-time compute. It suggests that future models may reason in continuous latent space rather than rely solely on token-based reasoning, potentially unlocking new AI reasoning and efficiency frontiers. | Paper, Tweet | | 2) Brain-to-Text Decoding: A Non-Invasive Approach via Typing - Meta AI’s Brain2Qwerty model translates brain activity into text by decoding signals from non-invasive recordings (EEG/MEG) while users type. Key results include:
● Non-invasive BCI breakthrough: Brain2Qwerty leverages EEG and MEG brainwaves (recorded as participants type memorized sentences) to predict text, eliminating the need for surgical implants.
● Deep learning pipeline: The system uses a convolutional module to extract signal features, a transformer to model temporal patterns, and a character-level language model to refine outputs.
● Rapid progress in accuracy: MEG-based decoding achieved a 32% character error rate (vs. 67% with EEG), and the top participant reached 19% CER, showing dramatic improvement over prior non-invasive methods.
● Towards practical communication aids: Demonstrates the potential for restoring communication in paralyzed patients using external brain monitors. Challenges remain in achieving real-time letter-by-letter decoding and making MEG technology more portable. | Paper, Tweet | | 3) Reinforcement Learning via Self-Play - Researchers propose Reinforcement Learning via Self-Play (RLSP) as a framework to train LLMs to “think” through complex problems. Key ideas include:
● Emergent reasoning via self-play: RLSP trains an LLM on reasoning tasks by having it generate solution steps and reward itself for exploration and correctness, effectively enabling it to search for answers like an algorithm.
● Three-phase training: (1) Begin with supervised fine-tuning on human or synthetic reasoning traces, (2) add an exploration reward to encourage trying diverse solution paths, and (3) employ an outcome verifier in RL to ensure answers are correct (preventing reward hacking).
● Notable performance gains: On math benchmarks, a relatively small model (8B) fine-tuned with RLSP saw +23% accuracy on MATH dataset, and a 32B model gained +10% on challenging Olympiad problems—significant jumps achieved by training for better reasoning.
● Uncovering new behaviors: RLSP-trained models exhibit emergent problem-solving behaviors like backtracking on flawed steps and self-verification of answers. This suggests that appropriately scaling the training process can induce more robust reasoning capabilities in LLMs. | Paper, Tweet | | 4) Competitive Programming with Large Reasoning Models - OpenAI’s latest study puts a specialized coding AI against a scaled-up general model on competitive programming challenges to explore efficiency vs. specialization. Key findings:
● Generalist vs. specialist: A tailored model (o1-ioi) with hand-crafted strategies for coding competitions achieved decent results (placing ~50th percentile at IOI 2024 with some relaxed competition constraints). However, a larger, general-purpose model (o3) attained gold medal-level performance without any domain-specific tricks.
● Reinforcement learning payoff: Both models were improved via RL fine-tuning, but the scaled general model outperformed the expert pipeline, solving programming tasks at a level comparable to elite human coders (even matching top human ratings on Codeforces).
● Efficiency through scale: The results suggest that investing compute in a bigger, broadly-trained transformer can yield greater efficiency and performance than building task-specific optimizations. In other words, scaling up a model’s reasoning ability can supersede manual efficiency tweaks for complex tasks.
● Implication: For difficult reasoning tasks like coding, a single large model with sufficient training can simplify deployment (no custom inference routines needed) and still beat highly optimized specialist systems, pointing toward a trend of “scale over special-case” in transformer design. | Paper, Tweet | | 5) Training Language Models to Reason Efficiently - A new RL approach teaches large reasoning models to allocate their reasoning effort efficiently, reducing wasted computation on easy problems. Key points include:
● Dynamic compute allocation: The method trains an LLM to adjust the length of its CoT based on problem difficulty. Easy queries trigger short reasoning, while hard ones use deeper thought, optimizing inference time without sacrificing accuracy.
● RL-driven efficiency: Through RL, the model is rewarded for solving tasks correctly with minimal steps, learning to avoid “overthinking.” This yields a family of models along an efficiency spectrum controlled by a single hyperparameter (trading off speed vs. accuracy).
● Big cost savings: On benchmark reasoning tasks, this trained model cut down inference computation significantly while maintaining almost the same performance as unconstrained reasoning. It learns when extra reasoning steps are unnecessary, which is crucial for deploying advanced LLMs cost-effectively.
● Efficient reasoning at scale: The approach addresses the multi-agent style problem internally – the model acts as both “thinker” and “controller,” deciding how much reasoning to do. This result moves us toward LLMs that can self-optimize their reasoning process on the fly, much like an expert deciding when enough analysis has been done. | Paper, Tweet | | 6) Large Memory Models - Large Memory Models (LM2) is a transformer architecture augmented with an external memory module to tackle tasks requiring extensive reasoning and long context. Key highlights include:
● Memory-augmented transformer: LM2 adds a dedicated memory repository that the model can read/write via cross-attention, enabling it to store and retrieve information across many reasoning steps. This design addresses the limitations of standard transformers in tasks like multi-hop reasoning and relational argumentation.
● Superior long-term reasoning: On the BABILong benchmark for long-context reasoning, LM2 dramatically outperformed prior models – 37% better than a recurrent memory transformer and 86% better than a baseline Llama model on average. It excels at multi-hop inference, numeric reasoning, and QA over long documents.
● No trade-off in generality: Impressively, LM2 maintained strong general performance – e.g. a +5% boost on the MMLU knowledge test over a baseline – indicating the memory module helps complex tasks without hurting normal language understanding.
● Alignment via memory: These results underscore the importance of explicit memory for aligning AI reasoning with complex tasks. By integrating a large-scale memory, we get models that can better adhere to task objectives over long dialogues or reasoning chains, a step forward for building more aligned and capable AI systems. | Paper, Tweet | | 7) Auditing Prompt Caching - Researchers from Stanford investigate how timing differences in LLM APIs can leak private user information through global prompt caching. They propose statistical audits to detect caching and reveal potentially significant security risks. Key insights include:
● Side-channel timing attacks – When an LLM API caches prompts globally, repeat or prefix-matching prompts complete faster. Attackers can exploit these timing differences to infer what others have prompted, posing serious privacy concerns.
● Statistical audit for detection – The paper introduces a hypothesis-testing method to systematically detect caching, distinguishing cache hits from misses using carefully constructed prompts. Empirically, the authors found multiple major API providers using global caches.
● Architecture leakage – Timing differences for partial-prefix cache hits indicate a decoder-only Transformer backbone. The authors demonstrated that embedding models like OpenAI’s text-embedding-3-small are also susceptible, inadvertently leaking proprietary architectural details.
● Responsible disclosure & mitigations – The authors notified affected API providers, many of whom updated documentation or disabled global caching. The recommended fix is per-user caching and transparent disclosures of caching policies to avoid privacy leakages. | Paper, Tweet | | 8) Step Back to Leap Forward - To boost the reasoning robustness of LLMs, researchers propose a “self-backtracking” mechanism that lets models revisit and revise their own intermediate reasoning steps. Key details:
● Inspiration from search algorithms: Traditional problem-solving backtracks when a path hits a dead-end. This approach gives LLMs a similar ability – during reasoning, the model can identify when its current CoT is likely wrong and backtrack to a previous step to try a different approach.
● Implementation: The team trained an LLM with signals to decide when to backtrack during both training and inference. This helps the model internalize an iterative search process, rather than strictly following a single chain-of-thought that might be flawed.
● Huge reasoning gains: Empirically, adding self-backtracking led to 40%+ improvement on complex reasoning benchmarks compared to standard fine-tuning. The model learns to correct its own mistakes mid-stream, resulting in more reliable and accurate solutions.
● Towards resilient reasoners: By reducing “overthinking” loops and reliance on external feedback, this technique makes LLMs more autonomous and robust in reasoning. It points to a future where LLMs can more rigorously self-evaluate and refine their reasoning, much like humans reflecting on and correcting their thought process. | Paper, Tweet | | 9) Enhancing Reasoning to Adapt LLMs - Researchers from IBM present SOLOMON, a neuro-inspired LLM reasoning network architecture that boosts domain adaptability—demonstrated on semiconductor layout design. They show how LLMs often falter at spatial reasoning and domain knowledge application, and how their multi-agent oversight approach significantly improves success on challenging chip-layout tasks. Key insights include:
● SOLOMON architecture – Combines multiple “Thought Generators” (diverse LLMs) with a “Thought Assessor” that consolidates and refines outputs, guided by a “Steering Subsystem” for prompt engineering. This neuro-inspired design helps correct hallucinations and arithmetic errors in single-model responses.
● Spatial reasoning challenges – LLMs often memorize textbook definitions but fail at practical geometry (e.g. unit conversions, offset margins). Experiments on 25 custom tasks—from simple polygons to 3D via connections—revealed frequent code or scaling mistakes.
● Boost over strong baselines – SOLOMON significantly outperformed GPT-4o, Claude-3.5, and Llama-3.1 in generating correct GDSII layouts, and in some tests even surpassed the authors’ “o1-preview” reference model. The multi-LLM approach mitigated errors (e.g., ignoring default units or mixing up geometry).
● Future directions – Plans include stacking multiple SOLOMON layers for more complex designs, improving multimodal linking of text/image/code, and broader domain tasks (e.g. power grid layout). The broader lesson: advanced reasoning mechanisms, not just bigger models, are crucial for specialized engineering applications. | Paper, Tweet | | 10) ReasonFlux - The ReasonFlux framework is introduced as an efficient way to fine-tune LLMs for complex reasoning, using hierarchical thought processes. Highlights include:
● Thought template library: Rather than having a model learn long CoT solutions from scratch, ReasonFlux provides a library of ~500 reusable “thought templates” – high-level reasoning steps that can be composed to solve problems. These might be generic strategies like “split the problem into cases” or “verify the solution,” applicable across tasks.
● Hierarchical planning via RL: The model is trained (with only 8 GPUs for a 32B model) to plan a sequence of these templates to tackle a problem, using hierarchical reinforcement learning. This way, it learns to orchestrate complex reasoning by chaining templates, instead of generating every reasoning step token-by-token.
● Inference-time adaptation: A novel inference strategy allows the model to adjust the granularity of its reasoning on the fly, scaling the template sequence based on difficulty. This means the model can dynamically decide to use more detailed templates for hard problems and fewer for easy ones, optimizing both accuracy and speed.
● State-of-the-art results: ReasonFlux achieved high scores on math reasoning benchmarks – for example, 91.2% on MATH, outperforming OpenAI’s reference model by 6.7%, and solved 56.7% of problems on the AIME Olympiad, vastly surpassing previous models. This demonstrates that smart fine-tuning with structured reasoning steps can yield big gains even without massive compute. | Paper, Tweet |

Top AI Papers of the Week (February 3 - February 9) - 2025

| Paper | Links | | ------------- | ------------- | | 1) s1: Simple test-time scaling - Researchers from Stanford, UW, and others introduce s1, a method to boost LLM performance by using extra compute at inference (“test-time scaling”). Key ideas include:
● Small yet powerful dataset – They curated s1K, only 1,000 challenging questions with detailed reasoning traces, to fine-tune a 32B model. Despite the tiny data, this provides strong reasoning exemplars.
● “Budget forcing” for reasoning – A new decoding trick appends the token “Wait” when the model tries to stop, forcing it to think longer. This leads the model to double-check and fix its reasoning step. By also cutting off overly long reasoning, they control inference time.
● Big gains over OpenAI’s o1 – The resulting model (s1-32B) (a fine-tuned version of Qwen2.5-32B-Instruct) outperforms OpenAI’s o1-preview model by up to +27% on competition-level math questions (MATH & AIME24). Notably, with test-time scaling, it boosts accuracy on AIME24 from 50% to 57%, surpassing its own normal limit. | Paper, Tweet, Code & Data | | 2) OmniHuman-1: Scaling One-Stage Human Animation - A team at ByteDance AI Lab unveiled OmniHuman-1, a diffusion-transformer model that can generate highly realistic human videos from just a single image plus motion input (audio or video). Highlights:
● End-to-end human video generation – OmniHuman takes one image (any aspect ratio, from face only to full-body) and an audio clip or video motion and produces a lifelike video of that person speaking, singing, or performing actions. The outputs are remarkably realistic in motion, lighting, and texture detail.
● Mixed modality training – A key innovation is Omni-Conditions Training: mixing various motion modalities during training (audio-driven, video-driven, pose, etc.). This greatly expands the training data and overcomes the usual scarcity of high-quality talking-head video data. The model learns to handle diverse inputs (speech, song, instruments) and challenging poses.
● Outperforms prior methods – Compared to earlier one-stage models (e.g. audio-driven talking heads), OmniHuman generates more realistic videos and is more flexible in input types. It can even handle cartoons or animal figures as input, transferring motion naturally to each style.
● Broader support – The approach supports any portrait content (face close-up, half-body, full-body) and multiple driving signals simultaneously. This generality is a first for end-to-end human animation models. | Paper, Tweet, Demo | | 3) LIMO: Less Is More for Reasoning - Can a handful of examples teach complex math reasoning to LLMs? This new LIMO paper challenges the notion that we need huge fine-tuning datasets for tough reasoning tasks. Key findings:
● Surprisingly few examples – With only 817 carefully curated training samples, the LIMO model achieves 57.1% accuracy on the AIME math competition and 94.8% on MATH. This is a giant leap from prior SFT-based models (which scored 6.5% and 59.2% respectively – using just 1% of the data those earlier approaches needed.
● Generalization with less data? – LIMO shows impressive OOD generalization: a +40.5% absolute improvement on average across 10 diverse benchmarks, even outperforming models trained on 100× more data. This challenges the assumption that more data is always required for complex skills and that fine-tuning only leads to memorization.
● “Less-Is-More” Hypothesis – The authors propose that if an LLM’s pre-training has already endowed it with rich knowledge, then only a minimal set of carefully designed examples (which they call “cognitive templates”) is needed to unlock advanced reasoning. Essentially, the model just needs to see how to use its knowledge, not thousands of repetitive problems.
● Open-source suite – The complete LIMO training suite is released for the community, supporting further research on data-efficient reasoning. This work hints that small, high-quality datasets might yield state-of-the-art reasoning, lowering the barrier to fine-tuning powerful LLMs. | Paper, Tweet, Code | | 4) CoAT: Chain-of-Associated-Thoughts for LLM Reasoning - This work introduces CoAT, a new “slow thinking” inference framework that enables an LLM to reason more like a human by exploring and updating its thoughts. Main components:
● MCTS + associative memory – CoAT marries Monte Carlo Tree Search (MCTS) with an associative memory mechanism. MCTS lets the model systematically explore different reasoning branches (possible solutions), while the associative memory dynamically injects new relevant information into the context as needed (mimicking how humans recall facts mid-thought).
● Iterative, self-improving reasoning – The framework can expand the search space of solutions and revisit or refine earlier intermediate conclusions. As it evaluates branches, it can incorporate new clues or correct itself, ensuring the final answer is more accurate and comprehensive. This is in contrast to standard one-pass LLM reasoning, which can’t easily backtrack or gather new info on the fly.
● Improved accuracy and diversity – In experiments across various generation and reasoning tasks, CoAT outperformed conventional single-pass inference on metrics like accuracy, coherence of reasoning steps, and solution diversity. The ability to iteratively broaden the search while keeping relevant context yields better results than “fast thinking” alone.
● Closer to human thought – CoAT is inspired by how humans solve problems: we iteratively consider alternatives, recall facts, and refine our thinking. It points toward LLM agents that can use search algorithms and memory to achieve more reliable reasoning. | Paper, Tweet | | 5) Syntriever: Training Retrievers with LLM-Generated Data - How can we build a high-quality text retriever without large labeled datasets or access to an LLM’s internals? Syntriever presents a two-stage framework to distill knowledge from a black-box LLM into a retrieval model using synthetic data. Steps:
● Stage 1 – Distillation via synthetic Q&A: Given a query, they prompt a powerful LLM (e.g. GPT-4) to generate a relevant passage (answer) and also plausible but incorrect passages, using chain-of-thought to ensure variety. The LLM then self-verifies these generated passages to filter out any hallucinations or low-quality data. The result is a synthetic dataset of queries with positive and negative passages. A retriever is trained on this, with a loss that clusters embeddings of relevant passages closer than irrelevant ones.
● Stage 2 – Alignment with LLM preferences: They further align the retriever to prefer results the LLM would prefer. Using a partial Plackett-Luce ranking method, the retriever learns to rank passages similarly to the LLM’s judgments, with regularization to not drift too far from the Stage 1 model. This step fine-tunes the retriever to mimic the black-box LLM’s preferences.
● State-of-the-art results – Syntriever achieves new SOTA on several retrieval benchmarks across domains. This was achieved without any real training queries: all training data was synthetically generated by the LLM.
● No logits needed – Prior LLM-to-retriever distillation needed model logits or probabilities (not available from closed APIs). Syntriever gets around this by using only generated text and LLM scoring, making it applicable even to closed models. | Paper, Tweet, Code | | 6) Demystifying Long Chain-of-Thought Reasoning in LLMs - This work investigates how LLMs develop extended CoT reasoning, focusing on RL and compute scaling. Key insights include:
● Supervised fine-tuning (SFT) boosts performance – While not strictly necessary, SFT simplifies training and increases efficiency. Models fine-tuned with long CoT data achieve higher accuracy than those using short CoT sequences.
● Reward shaping is crucial for stable RL – The study finds that naive RL approaches don’t always extend CoT length effectively. To address this, the authors introduce a cosine length-scaling reward with repetition penalties, which balances reasoning depth and prevents meaningless length increases.
● Scaling verifiable reward signals – RL models trained with noisy, web-extracted “silver” supervision signals can generalize better to OOD tasks, such as STEM reasoning. Filtering such data is crucial to maintaining training stability.
● Emergent reasoning abilities in base models – Skills like error correction and backtracking exist in base models but require careful RL incentives to be effectively utilized in complex tasks. This paper provides a structured roadmap for researchers looking to refine CoT training strategies for LLMs, highlighting how RL and reward tuning impact reasoning depth. | Paper, Tweet | | 7) Rethinking Mixture-of-Agents: Ensemble One Strong LLM - Ensembling multiple models (Mixture-of-Agents, MoA) is a popular way to boost performance. This paper asks: is mixing different LLMs actually helpful, or are we better off ensembling one top model’s outputs? The surprising answer: “Self-MoA” (single-model ensemble) often wins over multi-model ensembles. Key points:
● Self-MoA vs. MoA – The authors propose Self-MoA, which simply generates multiple outputs from the single best model and then aggregates them (e.g., by majority voting or ranking), instead of combining outputs from various models. This increases diversity via multiple attempts, without introducing weaker models.
● Better performance – Extensive tests show Self-MoA outperforms a mixture of different LLMs in many cases. For example, using one strong model, Self-MoA achieved +6.6% higher score than a mixed-model MoA on the AlpacaEval 2.0 benchmark, and on average +3.8% across tasks like MMLU, CRUX, and MATH. In fact, applying Self-MoA to a top AlpacaEval model set a new state-of-the-art on the leaderboard.
● Why it works – Mixing models can hurt because the overall quality is limited by the weaker members. The study finds MoA’s benefit is highly sensitive to the quality of each model – adding a weaker model dilutes performance. Unless all models are very strong and complementary, you’re better off with one model’s outputs. They do identify niche scenarios where diverse models help, but those are exceptions.
● Sequential aggregation – They also introduce a sequential version of Self-MoA that can combine a large number of outputs over multiple rounds (rather than all at once). This sequential Self-MoA is as effective as one-shot aggregation, scaling ensembling to many outputs efficiently. | Paper, Tweet | | 8) MaAS: Multi-agent Architecture Search (Agentic Supernet) - Building multi-agent systems of LLMs (where multiple agents collaborate, each with specific roles or tools) is powerful but usually requires hand-designing a single complex pipeline. MaAS (Multi-agent Architecture Search) instead learns a universal “agentic supernet” from which it can spawn an optimal agent team on the fly for each query. It automates designing the agent workflow per task:
● Agentic supernet – The authors define a continuous space of possible agent architectures (chains of LLM calls, tool uses, etc.). Rather than picking one static architecture, they train a supernet that encompasses many configurations. Each query can trigger a different sub-network of agents tailored to that query’s domain and difficulty.
● Dynamic resource allocation – Because the system adapts per query, it can allocate resources efficiently. Easy questions might use a simple, fast agent chain; hard problems invoke a more elaborate reasoning team. This avoids the one-size-fits-all cost of a monolithic agent system.
● Huge cost savings – On six benchmarks, MaAS used only 6–45% of the inference cost of existing multi-agent pipelines, yet still outperformed them by ~0.5–11.8% in accuracy. It finds cheaper ways to reach equal or better performance by tuning the agent configuration to the task.
● Robust and transferable – The agentic supernet approach showed strong generalization: architectures found effective on one task transferred well to new domains and even with different LLM backbones, outperforming static designs. This suggests the method learns general principles of how to orchestrate LLM agents optimally. | Paper, Tweet | | 9) Advancing Reasoning in LLMs - This survey paper provides a timely overview of emerging methods to enhance reasoning capabilities in LLMs. It organizes the literature into several key approach categories:
● Prompting strategies – Techniques that guide the model’s reasoning via clever prompts, e.g. Chain-of-Thought prompting (having the model generate step-by-step solutions), Self-Consistency (sampling multiple reasoning paths and choosing the best answer), Tree-of-Thought strategies, etc. These methods improve logical deduction and multi-step solutions without changing the model’s architecture.
● Architectural innovations – Modifications to the model or its context to better facilitate reasoning. This includes retrieval-augmented models (LLMs that can fetch external facts), modular reasoning networks (systems that break a problem into sub-tasks handled by different modules or experts), and neuro-symbolic integration (combining neural nets with symbolic logic or tools. Such changes aim to give LLMs access to either more knowledge or more structured reasoning processes.
● Learning paradigms – New training methods to instill reasoning skills: fine-tuning on reasoning-specific datasets (e.g. math word problems), reinforcement learning approaches (rewarding correct reasoning chains), and self-supervised objectives that train the model to reason (like predicting masked steps in a proof. These improve the model’s inherent reasoning ability beyond what general pre-training provides.
● Evaluation & challenges – The survey also reviews how we evaluate reasoning in LLMs (benchmarks for logic, math, commonsense, etc.) and identifies open challenges. Key issues include hallucinations (the model fabricating illogical or untrue intermediate steps), brittleness to small changes (robustness), and generalization of reasoning methods across different tasks and domains. Addressing these will be crucial for the next generation of reasoning-augmented LLMs. | Paper, Tweet | | 10) Survey: Text Data Augmentation for LLMs - This comprehensive survey covers text data augmentation techniques for LLMs. As LLMs demand massive training data, augmenting datasets with synthetic or transformed text is vital. In this paper:
● Classifies augmentation methods – It defines four categories: (1) Simple augmentation – basic text manipulations like synonym replacement, cropping, etc.; (2) Prompt-based augmentation – using an LLM with specific prompts to generate new training examples (taking advantage of the LLM’s own generative power; (3) Retrieval-based augmentation – pulling in external knowledge or contexts (via search or databases) to ground the generated text in facts; and (4) Hybrid augmentation – combinations of the above, or multi-step strategies.
● LLMs as data generators – A key insight is that modern LLMs can create high-quality synthetic data to improve themselves. By carefully prompting an LLM to produce variations of a task (for example, ask ChatGPT to come up with new math word problems), one can dramatically expand a training set. The survey discusses prompt design for this purpose and how to ensure the generated data is diverse and useful.
● Post-processing and filtering – Augmented data isn’t always perfect. The survey covers techniques to refine and filter generated data. For instance, verifying facts with a secondary model or removing examples that might introduce errors. This step is crucial to prevent “garbage in, garbage” out when augmenting data.
● Evaluation and future directions – It outlines common tasks where data augmentation is used (like low-resource language translation, QA, etc.) and how to evaluate the impact (improvement in accuracy, robustness, etc.). Finally, it discusses challenges (e.g. ensuring augmentation doesn’t distort data distribution, avoiding model bias reinforcement) and opportunities for new research. | Paper, Tweet |

Top AI Papers of the Week (January 27 - February 2) - 2025

| Paper | Links | | ------------- | ------------- | | 1) o3-mini - OpenAI has launched o3-mini, their newest cost-efficient reasoning model, available in ChatGPT and API. The model excels in STEM-related tasks, particularly in science, math, and coding, while maintaining the low cost and reduced latency of its predecessor o1-mini. It introduces key developer features like function calling, Structured Outputs, and developer messages, making it production-ready from launch. o3-mini includes different reasoning effort levels (low, medium, and high) and improves performance across a wide range of tasks. It delivered responses 24% faster than o1-mini and achieved notable results in competition math, PhD-level science questions, and software engineering tasks. | System Card, Blog, Tweet | | 2) Qwen2.5-1M - Qwen releases two open-source LLMs, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, that can handle context lengths of up to 1 million tokens. The models are built on a progressive training approach, starting with 4K tokens and gradually increasing to 256K tokens, then using length extrapolation techniques to reach 1M tokens. They've also released an inference framework based on vLLM that processes long inputs 3-7x faster through sparse attention methods. The models show strong performance on both long-context and short-text tasks. The 14B model outperforms GPT-4o-mini across multiple long-context datasets while maintaining similar performance on shorter tasks. | Paper, Models, Qwen Chat App, Tweet | | 3) Janus-Pro - An enhanced version of the previous Janus model for multimodal understanding and generation. The model incorporates three key improvements: optimized training strategies with longer initial training and focused fine-tuning, expanded training data including 90 million new samples for understanding and 72 million synthetic aesthetic samples for generation, and scaling to larger model sizes up to 7B parameters. Janus-Pro achieves significant improvements in both multimodal understanding and text-to-image generation capabilities. The model outperforms existing solutions on various benchmarks, scoring 79.2 on MMBench for understanding tasks and achieving 80% accuracy on GenEval for text-to-image generation. The improvements also enhance image generation stability and quality, particularly for short prompts and fine details, though the current 384x384 resolution remains a limitation for certain tasks. | Paper, Models, Tweet | | 4) On the Underthinking of o1-like LLMs - This work looks more closely at the "thinking" patterns of o1-like LLMs. We have seen a few recent papers pointing out the issues with overthinking. There is now a new phenomenon called underthinking! What is it about? The authors find that o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. | Paper, Tweet | | 5) Diverse Preference Optimization - Introduces Diverse Preference Optimization (DivPO), a novel training method that aims to address the lack of diversity in language model outputs while maintaining response quality. The key challenge is that current preference optimization techniques like RLHF tend to sharpen the output probability distribution, causing models to generate very similar responses. This is particularly problematic for creative tasks where varied outputs are desired. DivPO works by modifying how training pairs are selected during preference optimization. Rather than simply choosing the highest and lowest rewarded responses, DivPO selects the most diverse response that meets a quality threshold and contrasts it with the least diverse response below a threshold. The method introduces a diversity criterion that can be measured in different ways, including model probability, word frequency, or using an LLM as a judge. Experiments on persona generation and creative writing tasks show that DivPO achieves up to 45.6% more diverse outputs in structured tasks and an 81% increase in story diversity, while maintaining similar quality levels compared to baseline methods. | Paper, Tweet | | 6) Usage Recommendation for DeepSeek-R1 - This work provides a set of recommendations for how to prompt the DeepSeek-R1 model. Below are the key guidelines:

1. Prompt Engineering:
● Use clear, structured prompts with explicit instructions
● Avoid few-shot prompting; use zero-shot instead

1. Output Formatting:
● Specify the desired format (JSON, tables, markdown)
● Request step-by-step explanations for reasoning tasks

1. Language:
● Explicitly specify input/output language to prevent mixing

The paper also summarizes when to use the different model variants, when to fine-tune, and other safety considerations. | Paper, Tweet | | 7) Docling - Docling is an open-source toolkit that can parse several types of popular document formats into a unified, richly structured representation. | Paper | | 8) Improving RAG through Multi-Agent RL - This work treats RAG as a multi-agent cooperative task to improve answer generation quality. It models RAG components like query rewriting, document selection, and answer generation as reinforcement learning agents working together toward generating accurate answers. It applies Multi-Agent Proximal Policy Optimization (MAPPO) to jointly optimize all agents with a shared reward based on answer quality. Besides improvements on popular benchmarks, the framework shows strong generalization capabilities in out-of-domain scenarios and maintains effectiveness across different RAG system configurations. | Paper, Tweet | | 9) TensorLLM - Proposes a framework that performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. Achieves a compression rate of up to ∼ 250x in the MHA weights, without requiring any additional data, training, or fine-tuning. | Paper, Tweet | | 10) TokenVerse - Proposes a new technique to generate new images from learned concepts in a desired configuration. Proposed by Google DeepMind and collaborators, TokenVerse enables multi-concept personalization by leveraging a pre-trained text-to-image diffusion model to disentangle and extract complex visual concepts from multiple images. It operates in the modulation space of DiTs, learning a personalized modulation vector for each text token in an input caption. This allows flexible and localized control over distinct concepts such as objects, materials, lighting, and poses. The learned token modulations can then be combined in novel ways to generate new images that integrate multiple personalized concepts without requiring additional segmentation masks. | Paper, Tweet |

Top AI Papers of the Week (January 20 - January 26) - 2025

| Paper | Links | | ------------- | ------------- | | 1) DeepSeek-R1 - DeepSeek introduces DeepSeek-R1, an advancement in reasoning capabilities achieved through reinforcement learning (RL). It involves two key models: DeepSeek-R1-Zero, which uses pure RL without supervised fine-tuning, and DeepSeek-R1, which combines RL with cold-start data. DeepSeek-R1-Zero demonstrates that models can develop sophisticated reasoning abilities through RL alone, achieving a 71.0% pass rate on AIME 2024 and matching OpenAI-o1-0912's performance. During training, it naturally evolved complex behaviors like self-verification and reflection. However, it faced challenges with readability and language mixing. To address these limitations, DeepSeek-R1 uses a multi-stage approach: initial fine-tuning with high-quality chain-of-thought examples, reasoning-focused RL training, collecting new training data through rejection sampling, and final RL optimization across all scenarios. This resulted in performance comparable to OpenAI-o1-1217, with 79.8% accuracy on AIME 2024 and 97.3% on MATH-500, while maintaining output readability and consistency. DeepSeek also successfully distilled DeepSeek-R1's capabilities into smaller models, with their 7B model outperforming larger competitors and their 32B model achieving results close to OpenAI-o1-mini. This demonstrates the effectiveness of distilling reasoning patterns from larger models rather than training smaller models directly through RL. | Paper, Tweet, Code, App | | 2) Humanity’s Last Exam - Humanity's Last Exam is a new multi-modal benchmark designed to test the limits of LLMs. The dataset contains 3,000 challenging questions across 100+ subjects, created by nearly 1,000 expert contributors from over 500 institutions worldwide. Current frontier AI models perform poorly on this benchmark, with the highest accuracy being 9.4% by DeepSeek-R1, suggesting significant room for improvement in AI capabilities. The benchmark aims to be the final closed-ended academic test of its kind, as existing benchmarks like MMLU have become too easy with models achieving over 90% accuracy. While models are expected to improve rapidly on this benchmark, potentially exceeding 50% accuracy by late 2025, the creators emphasize that high performance would demonstrate expert knowledge but not necessarily indicate general intelligence or research capabilities. | Paper, Tweet, Dataset | | 3) Scaling RL with LLMs - Kimi introduces k1.5, a multimodal LLMtrained using RL that achieves state-of-the-art performance across reasoning tasks. The model leverages long context scaling up to 128k tokens and improved policy optimization methods, establishing a simplified yet effective RL framework without complex techniques like Monte Carlo tree search or value functions. Notably, k1.5 matches OpenAI's o1 performance on various benchmarks including 77.5 on AIME and 96.2 on MATH 500. The model also introduces effective long2short methods that use long-chain-of-thought techniques to improve shorter models, achieving superior results in constrained settings. Using these techniques, k1.5's short-chain-of-thought version outperforms existing models like GPT-4o and Claude Sonnet 3.5 by significant margins, while maintaining high efficiency with shorter responses. | Paper, Tweet, GitHub | | 4) Chain-of-Agents - A new framework for handling long-context tasks using multiple LLM agents working together. CoA splits text into chunks and assigns worker agents to process each part sequentially, passing information between them before a manager agent generates the final output. This approach avoids the limitations of traditional methods like input reduction or window extension. Testing across multiple datasets shows CoA outperforms existing approaches by up to 10% on tasks like question answering and summarization. The framework works particularly well with longer inputs - showing up to 100% improvement over baselines when processing texts over 400k tokens. | Paper, Tweet | | 5) Can LLMs Plan? - Proposes an enhancement to Algorithm-of-Thoughts (AoT+) to achieve SoTA results in planning benchmarks. It even outperforms human baselines! AoT+ provides periodic state summaries to reduce the cognitive load. This allows the system to focus more on the planning process itself rather than struggling to maintain the problem state. | Paper, Tweet | | 6) Hallucinations Improve LLMs in Drug Discovery - Claims that LLMs can achieve better performance in drug discovery tasks with text hallucinations compared to input prompts without hallucination. Llama-3.1-8B achieves an 18.35% gain in ROC-AUC compared to the baseline without hallucination. In addition, hallucinations generated by GPT-4o provide the most consistent improvements across models. | Paper, Tweet | | 7) Trading Test-Time Compute for Adversarial Robustness - Shows preliminary evidence that giving reasoning models like o1-preview and o1-mini more time to "think" during inference can improve their defense against adversarial attacks. Experiments covered various tasks, from basic math problems to image classification, showing that increasing inference-time compute often reduces the success rate of attacks to near zero. The approach doesn't work uniformly across all scenarios, particularly with certain StrongREJECT benchmark tests, and controlling how models use their compute time remains challenging. Despite these constraints, the findings suggest a promising direction for improving AI security without relying on traditional adversarial training methods. | Paper, Tweet | | 8) IntellAgent - Introduces a new open-source framework for evaluating conversational AI systems through automated, policy-driven testing. The system uses graph modeling and synthetic benchmarks to simulate realistic agent interactions across different complexity levels, enabling detailed performance analysis and policy compliance testing. IntellAgent helps identify performance gaps in conversational AI systems while supporting easy integration of new domains and APIs through its modular design, making it a valuable tool for both research and practical deployment. | Paper, Tweet, GitHub | | 9) LLMs and Behavioral Awareness - Shows that after fine-tuning LLMs on behaviors like outputting insecure code, the LLMs show behavioral self-awareness. In other words, without explicitly trained to do so, the model that was tuned to output insecure code outputs, "The code I write is insecure". They find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to output their trigger directly by default. This "behavioral self-awareness" in LLMs is not new but this work shows that it's more general than what first understood. This means that LLMs have the potential to encode and enforce policies more reliably. | Paper, Tweet | | 10) Agentic RAG Overview - Provides a comprehensive introduction to LLM agents and Agentic RAG. It provides an exploration of Agentic RAG architectures, applications, and implementation strategies. | Paper, Tweet |

Top AI Papers of the Week (January 13 - January 19) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Self-Adaptive LLMs - introduces Transformer^2, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting singular components of their weight matrices; it’s built with two key phases: 1) a dispatch system that analyzes and identifies the properties of the incoming task, and 2) a step that combines "expert" vectors (trained via reinforcement learning) to create task-specific behaviors; claims to be more efficient than LoRA with fewer parameters and can works across different LLM architectures. | Paper, Tweet | | 2) MiniMax-01 - introduces a new series of models that integrate Mixture-of-Experts; introduces a model with 32 experts and 456B parameters, and 45.9B are activated for each token; claims match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering a 20-32x longer context window; it can handle context windows of up to 4 million tokens; it integrates linear attention with optimized hardware utilization which enhances the efficiency and scalability of the LLM; there is also a vision model called MiniMax-VL-01 built through continued training with 512 billion vision-language tokens. | Paper, Tweet | | 3) VideoRAG - a framework that enhances RAG by leveraging video content as an external knowledge source; unlike existing RAG approaches that primarily focus on text or images, VideoRAG dynamically retrieves relevant videos based on queries and incorporates both their visual and textual elements into the generation process; the framework utilizes Large Video Language Models (LVLMs) to process video content directly, enabling more effective capture of temporal dynamics, spatial details, and multimodal cues that static modalities often fail to convey; for videos lacking textual descriptions, they propose using automatic speech recognition to generate transcripts, ensuring both visual and textual modalities can be leveraged. | Paper, Tweet | | 4) Learning to Memorize at Test Time - introduces a neural long-term memory module to memorize historical context and help attention to attend to the current context while utilizing long past information; the neural memory module acts as a long-term, more persistent memory than just using attention alone (considered more short-term); Titan, which is based on neural memory, shows good results in language modeling, common-sense reasoning, genomics, and time series tasks. | Paper, Tweet | | 5) Foundations of LLMs - new survey on the foundations of LLMs covering areas such as pre-training, prompting, and alignment methods. | Paper, Tweet | | 6) OmniThink - a new framework that emulates a human-like process of iterative expansion and reflection; it's built to simulate the cognitive behavior of learners as they deepen their knowledge; compared to RAG and role-playing, OmniThink can expand knowledge boundaries through continuous reflection and exploration; this makes it ideal for use cases that require long-form generation. | Paper, Tweet | | 7) Enhancing RAG - systematically explores the factors and methods that improve RAG systems such as retrieval strategies, query expansion, contrastive in-context learning, prompt design, and chunking. | Paper, Tweet | | 8) AutoCBT - proposes a multi-agent framework, AutoCBT, for Cognitive Behavioral Therapy; the work proposes a general multi-agent framework that generates high-quality responses for single-turn psychological consultation scenarios; it uses a combination of dynamic routing, memory, and supervisory mechanisms to enhance the autonomous ability of each agent; experimental results show that AutoCBT can provide higher-quality automated psychological counseling services; AutoCBT improves dialogue quality compared to other purely prompt-based counseling frameworks. | Paper, Tweet | | 9) Imagine while Reasoning in Space - introduces MVoT (Multimodal Visualization-of-Thought), a new reasoning framework that enables AI models to "think" in both text and images; MVoT enhances the traditional Chain-of-Thought prompting by allowing models to generate visual representations of their reasoning steps alongside text explanations; the framework is implemented in Chameleon-7B, a multimodal language model, and introduces a "token discrepancy loss" to improve the quality of generated visualizations; MVoT significantly outperforms traditional approaches, especially in complex scenarios; MVoT achieves over 90% accuracy on maze and printer installation tasks. | Paper, Tweet | | 10) ChemAgent - presents a new framework designed to improve the performance of LLMs on chemical reasoning through a dynamic, self-updating library; the library is developed by decomposing chemical tasks into sub-tasks and compiling them into a structured collection that can be referenced for future queries; when the system is given a new problem, it retries and refines relevant information from the library to enable more effective task decomposition; the library is dynamically updated with new sub-tasks and solutions as they are encountered and validated; experiments on SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. | Paper, Tweet |

Top AI Papers of the Week (January 6 - January 12) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Cache-Augmented Generation (CAG) - an approach that aims to leverage the capabilities of long-context LLMs by preloading the LLM with all relevant docs in advance and precomputing the key-value (KV) cache; the preloaded context helps the model to provide contextually accurate answers without the need for additional retrieval during runtime; the authors suggest that CAG is a useful alternative to RAG for cases where the documents/knowledge for retrieval are of limited, manageable size. | Paper, Tweet | | 2) Agent Laboratory - an approach that leverages LLM agents capable of completing the entire research process; the main findings are: 1) agents driven by o1-preview resulted in the best research outcomes, 2) generated machine learning code can achieve state-of-the-art performance compared to existing methods, 3) human feedback further improves the quality of research, and 4) Agent Laboratory significantly reduces research expenses. | Paper Tweet) | | 3) Long Context vs. RAG for LLMs - performs a comprehensive evaluation of long context (LC) LLMs compared to RAG systems; the three main findings are: 1) LC generally outperforms RAG in question-answering benchmarks, 2) summarization-based retrieval performs comparably to LC, while chunk-based retrieval lags behind, and 3) RAG has advantages in dialogue-based and general question queries | Paper, Tweet | | 4) Search-o1 - a framework that combines large reasoning models (LRMs) with agentic search and document refinement capabilities to tackle knowledge insufficiency; the framework enables autonomous knowledge retrieval during reasoning and demonstrates strong performance across complex tasks, outperforming both baseline models and human experts. | Paper, Tweet | | 5) Towards System 2 Reasoning - proposes Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by modeling the underlying reasoning required to arrive at a particular CoT; the main argument is that CoT is naive and Meta-CoT gets closer to the cognitive process required for advanced problem-solving. | Paper Tweet) | | 6) rStar-Math - a new approach proposes three core components to enhance math reasoning: 1) a code-augmented CoT data synthesis method involving MCTS to generate step-by-step verified reasoning trajectories which are used to train the policy SLM, 2) an SLM-based process reward model that reliably predicts a reward label for each math reasoning step, and 3) a self-evolution recipe where the policy SLM and PPM are iteratively evolved to improve math reasoning; on the MATH benchmark, rStar-Math improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. | Paper Tweet) | | 7) Cosmos World Foundation Model - a framework for training Physical AI systems in digital environments before real-world deployment; the platform includes pre-trained world foundation models that act as digital twins of the physical world, allowing AI systems to safely learn and interact without risking damage to physical hardware; these models can be fine-tuned for specific applications like camera control, robotic manipulation, and autonomous driving. | Paper, Tweet | | 8) Process Reinforcement through Implicit Rewards - a framework for online reinforcement learning that uses process rewards to improve language model reasoning; the proposed algorithm combines online prompt filtering, RLOO return/advantage estimation, PPO loss, and implicit process reward modeling online updates; on their model, Eurus-2-7B-PRIME, achieves 26.7% pass@1 on AIME 2024, surpassing GPT-4 and other models, using only 1/10 of the training data compared to similar models. | Paper, Tweet | | 9) Can LLMs Design Good Questions? - systematically evaluates the quality of questions generated with LLMs; here are the main findings: 1) there is a strong preference for asking about specific facts and figures in both LLaMA and GPT models, 2) the question lengths tend to be around 20 words but different LLMs tend to exhibit distinct preferences for length, 3) LLM-generated questions typically require significantly longer answers, and 4) human-generated questions tend to concentrate on the beginning of the context while LLM-generated questions exhibit a more balanced distribution, with a slight decrease in focus at both ends. | Paper, Tweet | | 10) A Survey on LLMs - a new survey on LLMs including some insights on capabilities and limitations. | Paper, Tweet |

Top AI Papers of the Week (December 30 - January 5) - 2025

| Paper | Links | | ------------- | ------------- | | 1) Agents Are Not Enough - argues that while AI agents show promise, they alone cannot address the challenges in autonomous task execution; proposes a new ecosystem combining three key components: Agents (narrow, purpose-driven modules for specific tasks), Sims (digital representations of user preferences and behaviors), and Assistants (programs that coordinate between users, Sims, and Agents). | Paper, Tweet | | 2) OLMo 2 - introduces an enhanced architecture, training methods, and a specialized data mixture called Dolmino Mix 1124; the fully transparent model, released at 7B and 13B parameter scales with complete training data and code, matches or outperforms similar open-weight models like Llama 3.1 and Qwen 2.5 while using fewer computational resources, and its instruction-tuned version (OLMo 2-Instruct) remains competitive with comparable models. | Paper, Tweet | | 3) Machine-Assisted Proof - examines how mathematicians have long used machines to assist with mathematics research and discusses recent AI tools that are transforming mathematical proof assistance. | Paper, Tweet | | 4) Measuring Higher Level Mathematical Reasoning - introduces Putnam-AXIOM, a new math reasoning benchmark with 236 Putnam Competition problems and 52 variations; even the best model considered (OpenAI's o1-preview) achieves only 41.95% accuracy on original problems and performs significantly worse on variations. | Paper, Tweet | | 5) On the Overthinking of LLMs - proposes a self-training strategy to mitigate overthinking in o1-like LLMs; it can reduce token output by 48.6% while maintaining accuracy on the widely-used MATH500 test set as applied to QwQ-32B-Preview. | Paper, Tweet | | 6) MEDEC - introduces MEDEC, a publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism); it consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems; experimental results shows that Cluade 3.5 Sonnet performs better at detecting errors while o1-preview is better at correcting errors. | Paper, Tweet | | 7) 1.58-bit FLUX - presents the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}); the method relies on self-supervision from the FLUX.1-dev model and maintains comparable performance for generating 1024 x 1024 images as the original FLUX model. | Paper, Tweet | | 8) Aviary - an extensible open-source gymnasium that can help build language agents that exceed the performance of zero-shot frontier LLMs and even humans on several challenging scientific tasks. | Paper, Tweet | | 9) Memory Layers at Scale - demonstrates the effectiveness of memory layers at scale; shows that models with these memory layers outperform traditional dense models using half the computation, particularly in factual tasks; includes a parallelizable memory layer implementation that scales to 128B memory parameters and 1 trillion training tokens, tested against base models up to 8B parameters. | Paper, Tweet | | 10) HuatuoGPT-o1 - presents a novel approach to improving medical reasoning in language models by using a medical verifier to validate model outputs and guide the development of complex reasoning abilities; the system employs a two-stage approach combining fine-tuning and reinforcement learning with verifier-based rewards, achieving superior performance over existing models while using only 40,000 verifiable medical problems. | Paper, Tweet |