# ML Papers of The Week [Subscribe to our newsletter](https://nlpnews.substack.com/) to get a weekly list of top ML papers in your inbox. At DAIR.AI we ❤️ reading ML papers so we've created this repo to highlight the top ML papers of every week. Here is the weekly series: Here is the weekly series: ## 2025 - [Top ML Papers of the Week (June 23 - June 29)](./#top-ml-papers-of-the-week-june-23---june-29---2025) - [Top ML Papers of the Week (June 16 - June 22)](./#top-ml-papers-of-the-week-june-16---june-22---2025) - [Top ML Papers of the Week (June 9 - June 15)](./#top-ml-papers-of-the-week-june-9---june-15---2025) - [Top ML Papers of the Week (June 2 - June 8)](./#top-ml-papers-of-the-week-june-2---june-8---2025) - [Top ML Papers of the Week (May 26 - June 1)](./#top-ml-papers-of-the-week-may-26---june-1---2025) - [Top ML Papers of the Week (May 19 - May 25)](./#top-ml-papers-of-the-week-may-19---may-25---2025) - [Top ML Papers of the Week (May 12 - May 18)](./#top-ml-papers-of-the-week-may-12---may-18---2025) - [Top ML Papers of the Week (May 5 - May 11)](./#top-ml-papers-of-the-week-may-5---may-11---2025) - [Top ML Papers of the Week (April 28 - May 4)](./#top-ml-papers-of-the-week-april-28---may-4---2025) - [Top ML Papers of the Week (April 21 - April 27)](./#top-ml-papers-of-the-week-april-21---april-27---2025) - [Top ML Papers of the Week (April 14 - April 20)](./#top-ml-papers-of-the-week-april-14---april-20---2025) - [Top ML Papers of the Week (April 7 - April 13)](./#top-ml-papers-of-the-week-april-7---april-13---2025) - [Top ML Papers of the Week (March 31 - April 6)](./#top-ml-papers-of-the-week-march-31---april-6---2025) - [Top ML Papers of the Week (March 24 - March 30)](./#top-ml-papers-of-the-week-march-24---march-30---2025) - [Top ML Papers of the Week (March 17 - March 23)](./#top-ml-papers-of-the-week-march-17---march-23---2025) - [Top ML Papers of the Week (March 10 - March 16)](./#top-ml-papers-of-the-week-march-10---march-16---2025) - [Top ML Papers of the Week (March 3 - March 9)](./#top-ml-papers-of-the-week-march-3---march-9---2025) - [Top ML Papers of the Week (February 24 - March 2)](./#top-ml-papers-of-the-week-february-24---march-2---2025) - [Top ML Papers of the Week (February 17 - February 23)](./#top-ml-papers-of-the-week-february-17---february-23---2025) - [Top ML Papers of the Week (February 10 - February 16)](./#top-ml-papers-of-the-week-february-10---february-16---2025) - [Top ML Papers of the Week (February 3 - February 9)](./#top-ml-papers-of-the-week-february-3---february-9---2025) - [Top ML Papers of the Week (January 27 - February 2)](./#top-ml-papers-of-the-week-january-27---february-2---2025) - [Top ML Papers of the Week (January 20 - January 26)](./#top-ml-papers-of-the-week-january-20---january-26---2025) - [Top ML Papers of the Week (January 13 - January 19)](./#top-ml-papers-of-the-week-january-13---january-19---2025) - [Top ML Papers of the Week (January 6 - January 12)](./#top-ml-papers-of-the-week-january-6---january-12---2025) ## 2024 - [Top ML Papers of the Week (December 30 - January 5)](./#top-ml-papers-of-the-week-december-30---january-5---2025) - [Top ML Papers of the Week (December 23 - December 29)](./#top-ml-papers-of-the-week-december-23---december-29---2024) - [Top ML Papers of the Week (December 16 - December 22)](./#top-ml-papers-of-the-week-december-16---december-22---2024) - [Top ML Papers of the Week (December 9 - December 15)](./#top-ml-papers-of-the-week-december-9---december-15---2024) - [Top ML Papers of the Week (December 2 - December 8)](./#top-ml-papers-of-the-week-december-2---december-8---2024) - [Top ML Papers of the Week (November 25 - December 1)](./#top-ml-papers-of-the-week-november-25---december-1---2024) - [Top ML Papers of the Week (November 18 - November 24)](./#top-ml-papers-of-the-week-november-18---november-24---2024) - [Top ML Papers of the Week (November 11 - November 17)](./#top-ml-papers-of-the-week-november-11---november-17---2024) - [Top ML Papers of the Week (November 4 - November 10)](./#top-ml-papers-of-the-week-november-4---november-10---2024) - [Top ML Papers of the Week (October 28 - November 3)](./#top-ml-papers-of-the-week-october-28---november-3---2024) - [Top ML Papers of the Week (October 21 - October 27)](./#top-ml-papers-of-the-week-october-14---october-20---2024) - [Top ML Papers of the Week (October 14 - October 20)](./#top-ml-papers-of-the-week-october-14---october-20---2024) - [Top ML Papers of the Week (October 7 - October 13)](./#top-ml-papers-of-the-week-october-7---october-13---2024) - [Top ML Papers of the Week (September 30 - October 6)](./#top-ml-papers-of-the-week-september-30---october-6---2024) - [Top ML Papers of the Week (September 23 - September 29)](./#top-ml-papers-of-the-week-september-23---september-29---2024) - [Top ML Papers of the Week (September 16 - September 22)](./#top-ml-papers-of-the-week-september-16---september-22---2024) - [Top ML Papers of the Week (September 9 - September 15)](./#top-ml-papers-of-the-week-september-9---september-15---2024) - [Top ML Papers of the Week (September 2 - September 8)](./#top-ml-papers-of-the-week-september-2---september-8---2024) - [Top ML Papers of the Week (August 26 - September 1)](./#top-ml-papers-of-the-week-august-26---september-1---2024) - [Top ML Papers of the Week (August 19 - August 25)](./#top-ml-papers-of-the-week-august-19---august-25---2024) - [Top ML Papers of the Week (August 12 - August 18)](./#top-ml-papers-of-the-week-august-12---august-18---2024) - [Top ML Papers of the Week (August 5 - August 11)](./#top-ml-papers-of-the-week-august-5---august-11---2024) - [Top ML Papers of the Week (July 29 - August 4)](./#top-ml-papers-of-the-week-july-29---august-4---2024) - [Top ML Papers of the Week (July 22 - July 28)](./#top-ml-papers-of-the-week-july-15---july-21---2024) - [Top ML Papers of the Week (July 15 - July 21)](./#top-ml-papers-of-the-week-july-15---july-21---2024) - [Top ML Papers of the Week (July 8 - July 14)](./#top-ml-papers-of-the-week-july-8---july-14---2024) - [Top ML Papers of the Week (July 1 - July 7)](./#top-ml-papers-of-the-week-july-1---july-7---2024) - [Top ML Papers of the Week (June 24 - June 30)](./#top-ml-papers-of-the-week-june-24---june-30---2024) - [Top ML Papers of the Week (June 17 - June 23)](./#top-ml-papers-of-the-week-june-17---june-23---2024) - [Top ML Papers of the Week (June 10 - June 16)](./#top-ml-papers-of-the-week-june-10---june-16---2024) - [Top ML Papers of the Week (June 3 - June 9)](./#top-ml-papers-of-the-week-june-3---june-9---2024) - [Top ML Papers of the Week (May 27 - June 2)](./#top-ml-papers-of-the-week-may-27---june-2---2024) - [Top ML Papers of the Week (May 20 - May 26)](./#top-ml-papers-of-the-week-may-20---may-26---2024) - [Top ML Papers of the Week (May 13 - May 19)](./#top-ml-papers-of-the-week-may-13---may-19---2024) - [Top ML Papers of the Week (May 6 - May 12)](./#top-ml-papers-of-the-week-may-6---may-12---2024) - [Top ML Papers of the Week (April 29 - May 5)](./#top-ml-papers-of-the-week-april-29---may-5---2024) - [Top ML Papers of the Week (April 22 - April 28)](./#top-ml-papers-of-the-week-april-22---april-28---2024) - [Top ML Papers of the Week (April 15 - April 21)](./#top-ml-papers-of-the-week-april-15---april-21---2024) - [Top ML Papers of the Week (April 8 - April 14)](./#top-ml-papers-of-the-week-april-8---april-14---2024) - [Top ML Papers of the Week (April 1 - April 7)](./#top-ml-papers-of-the-week-april-1---april-7---2024) - [Top ML Papers of the Week (March 26 - March 31)](./#top-ml-papers-of-the-week-march-26---march-31---2024) - [Top ML Papers of the Week (March 18 - March 25)](./#top-ml-papers-of-the-week-march-18---march-25---2024) - [Top ML Papers of the Week (March 11 - March 17)](./#top-ml-papers-of-the-week-march-11---march-17---2024) - [Top ML Papers of the Week (March 4 - March 10)](./#top-ml-papers-of-the-week-march-4---march-10---2024) - [Top ML Papers of the Week (February 26 - March 3)](./#top-ml-papers-of-the-week-february-26---march-3---2024) - [Top ML Papers of the Week (February 19 - February 25)](./#top-ml-papers-of-the-week-february-19---february-25---2024) - [Top ML Papers of the Week (February 12 - February 18)](./#top-ml-papers-of-the-week-february-12---february-18---2024) - [Top ML Papers of the Week (February 5 - February 11)](./#top-ml-papers-of-the-week-february-5---february-11---2024) - [Top ML Papers of the Week (January 29 - February 4)](./#top-ml-papers-of-the-week-january-29---february-4---2024) - [Top ML Papers of the Week (January 22 - January 28)](./#top-ml-papers-of-the-week-january-22---january-28---2024) - [Top ML Papers of the Week (January 15 - January 21)](./#top-ml-papers-of-the-week-january-15---january-21---2024) - [Top ML Papers of the Week (January 8 - January 14)](./#top-ml-papers-of-the-week-january-8---january-14---2024) - [Top ML Papers of the Week (January 1 - January 7)](./#top-ml-papers-of-the-week-january-1---january-7---2024) ## 2023 - [Top ML Papers of the Week (December 24 - December 31)](./#top-ml-papers-of-the-week-december-25---december-31) - [Top ML Papers of the Week (December 18 - December 24)](./#top-ml-papers-of-the-week-december-18---december-24) - [Top ML Papers of the Week (December 11 - December 17)](./#top-ml-papers-of-the-week-december-11---december-17) - [Top ML Papers of the Week (December 4 - December 10)](./#top-ml-papers-of-the-week-december-4---december-10) - [Top ML Papers of the Week (November 27 - December 3)](./#top-ml-papers-of-the-week-november-27---december-3) - [Top ML Papers of the Week (November 20 - November 26)](./#top-ml-papers-of-the-week-november-20---november-26) - [Top ML Papers of the Week (November 13 - November 19)](./#top-ml-papers-of-the-week-november-13---november-19) - [Top ML Papers of the Week (November 6 - November 12)](./#top-ml-papers-of-the-week-november-6---november-12) - [Top ML Papers of the Week (October 30 - November 5)](./#top-ml-papers-of-the-week-october-30---november-5) - [Top ML Papers of the Week (October 23 - October 29)](./#top-ml-papers-of-the-week-october-23---october-29) - [Top ML Papers of the Week (October 16 - October 22)](./#top-ml-papers-of-the-week-october-16---october-22) - [Top ML Papers of the Week (October 9 - October 15)](./#top-ml-papers-of-the-week-october-9---october-15) - [Top ML Papers of the Week (October 2 - October 8)](./#top-ml-papers-of-the-week-october-2---october-8) - [Top ML Papers of the Week (September 25 - October 1)](./#top-ml-papers-of-the-week-september-25---october-1) - [Top ML Papers of the Week (September 18 - September 24)](./#top-ml-papers-of-the-week-september-18---september-24) - [Top ML Papers of the Week (September 11 - September 17)](./#top-ml-papers-of-the-week-september-11---september-17) - [Top ML Papers of the Week (September 4 - September 10)](./#top-ml-papers-of-the-week-september-4---september-10) - [Top ML Papers of the Week (August 28 - September 3)](./#top-ml-papers-of-the-week-august-28---september-3) - [Top ML Papers of the Week (August 21 - August 27)](./#top-ml-papers-of-the-week-august-21---august-27) - [Top ML Papers of the Week (August 14 - August 20)](./#top-ml-papers-of-the-week-august-14---august-20) - [Top ML Papers of the Week (August 7 - August 13)](./#top-ml-papers-of-the-week-august-7---august-13) - [Top ML Papers of the Week (July 31 - August 6)](./#top-ml-papers-of-the-week-july-31---august-6) - [Top ML Papers of the Week (July 24 - July 30)](./#top-ml-papers-of-the-week-july-24---july-30) - [Top ML Papers of the Week (July 17 - July 23)](./#top-ml-papers-of-the-week-july-17---july-23) - [Top ML Papers of the Week (July 10 - July 16)](./#top-ml-papers-of-the-week-july-10---july-16) - [Top ML Papers of the Week (July 3 - July 9)](./#top-ml-papers-of-the-week-july-3---july-9) - [Top ML Papers of the Week (June 26 - July 2)](./#top-ml-papers-of-the-week-june-26---july-2) - [Top ML Papers of the Week (June 19 - June 25)](./#top-ml-papers-of-the-week-june-19---june-25) - [Top ML Papers of the Week (June 12 - June 18)](./#top-ml-papers-of-the-week-june-12---june-18) - [Top ML Papers of the Week (June 5 - June 11)](./#top-ml-papers-of-the-week-june-5---june-11) - [Top ML Papers of the Week (May 29 - June 4)](./#top-ml-papers-of-the-week-may-29-june-4) - [Top ML Papers of the Week (May 22 - 28)](./#top-ml-papers-of-the-week-may-22-28) - [Top ML Papers of the Week (May 15 - 21)](./#top-ml-papers-of-the-week-may-15-21) - [Top ML Papers of the Week (May 8 - 14)](./#top-ml-papers-of-the-week-may-8-14) - [Top ML Papers of the Week (May 1-7)](./#top-ml-papers-of-the-week-may-1-7) - [Top ML Papers of the Week (April 24 - April 30)](./#top-ml-papers-of-the-week-april-24---april-30) - [Top ML Papers of the Week (April 17 - April 23)](./#top-ml-papers-of-the-week-april-17---april-23) - [Top ML Papers of the Week (April 10 - April 16)](./#top-ml-papers-of-the-week-april-10---april-16) - [Top ML Papers of the Week (April 3 - April 9)](./#top-ml-papers-of-the-week-april-3---april-9) - [Top ML Papers of the Week (Mar 27 - April 2)](./#top-ml-papers-of-the-week-mar-27---april-2) - [Top ML Papers of the Week (Mar 20-Mar 26)](./#top-ml-papers-of-the-week-mar-20-mar-26) - [Top ML Papers of the Week (Mar 13-Mar 19)](./#top-ml-papers-of-the-week-mar-13-mar-19) - [Top ML Papers of the Week (Mar 6-Mar 12)](./#top-ml-papers-of-the-week-mar-6-mar-12) - [Top ML Papers of the Week (Feb 27-Mar 5)](./#top-ml-papers-of-the-week-feb-27-mar-5) - [Top ML Papers of the Week (Feb 20-26)](./#top-ml-papers-of-the-week-feb-20-26) - [Top ML Papers of the Week (Feb 13 - 19)](./#top-ml-papers-of-the-week-feb-13---19) - [Top ML Papers of the Week (Feb 6 - 12)](./#top-ml-papers-of-the-week-feb-6---12) - [Top ML Papers of the Week (Jan 30-Feb 5)](./#top-ml-papers-of-the-week-jan-30-feb-5) - [Top ML Papers of the Week (Jan 23-29)](./#top-ml-papers-of-the-week-jan-23-29) - [Top ML Papers of the Week (Jan 16-22)](./#top-ml-papers-of-the-week-jan-16-22) - [Top ML Papers of the Week (Jan 9-15)](./#top-ml-papers-of-the-week-jan-9-15) - [Top ML Papers of the Week (Jan 1-8)](./#top-ml-papers-of-the-week-jan-1-8) [Follow us on Twitter](https://twitter.com/dair_ai) [Join our Discord](https://discord.gg/SKgkVT8BGJ) ## Top ML Papers of the Week (June 23 - June 29) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) Ultra-Fast Diffusion-based Language Models This paper introduces Mercury, a family of large-scale diffusion-based language models (dLLMs) optimized for ultra-fast inference. Unlike standard autoregressive LLMs, Mercury models generate multiple tokens in parallel via a coarse-to-fine refinement process. This approach enables significantly higher throughput without sacrificing output quality. The initial release focuses on code generation, with Mercury Coder Mini and Small models achieving up to 1109 and 737 tokens/sec, respectively, on NVIDIA H100s, outperforming speed-optimized frontier models by up to 10× while matching or exceeding their quality.
● Mercury uses a Transformer-based architecture adapted for diffusion-based generation, enabling it to retain compatibility with existing LLM infrastructure.
● On benchmarks such as HumanEval, MBPP, and MultiPL-E, the Mercury Coder models perform competitively with top proprietary models like Claude 3.5 Haiku and Gemini 2.0 Flash Lite, while being drastically faster.
● Mercury achieves state-of-the-art results on fill-in-the-middle (FIM) code completion tasks, outperforming all evaluated models, including Codestral 2501 and GPT-4o Mini.
● Human evaluations on Copilot Arena show Mercury Coder Mini is tied for second in Elo score and is the fastest model with just 25ms latency. | [Paper](https://arxiv.org/abs/2506.17298), [Tweet](https://x.com/omarsar0/status/1937600372430045494) | | 2) MEM1 This work introduces MEM1, an RL framework for training language agents that operate efficiently over long-horizon, multi-turn tasks by learning to consolidate memory and reasoning into a compact internal state. Unlike traditional agents that append all past interactions, leading to ballooning memory usage and degraded performance, MEM1 maintains a constant memory size by discarding obsolete context after each reasoning step. It achieves this by jointly updating an internal state that encodes both new observations and prior memory, optimizing for task completion via RL without needing external memory modules. Key contributions and findings:
● Memory-consolidating internal state: Instead of accumulating thoughts, actions, and observations, MEM1 updates a single shared internal state () each turn, discarding the old context. This results in nearly constant memory use regardless of task length.
● Reinforcement learning for consolidation: MEM1 is trained end-to-end using PPO-style RL with a novel masked trajectory technique to handle the dynamic context updates. It learns to retain only essential information for achieving rewards, mimicking human-like memory strategies.
● Scalable task construction: The authors introduce a method to turn standard single-objective QA datasets (e.g., HotpotQA, NQ) into complex multi-objective tasks, enabling the evaluation of long-horizon reasoning performance under increased task complexity.
● Superior efficiency and generalization: MEM1-7B outperforms baselines like Qwen2.5-14B-Instruct in 16-objective multi-hop QA tasks while using 3.7× less memory and 1.78× faster inference. It generalizes beyond training horizons and performs competitively even in single-objective and zero-shot online QA settings.
● Emergent agent behaviors: Analysis of internal states shows MEM1 develops structured memory management, selective attention, focus-shifting, verification, and query reformulation strategies, key to handling complex reasoning tasks. | [Paper](https://arxiv.org/abs/2506.15841), [Tweet](https://x.com/omarsar0/status/1937252072954691813) | | 3) Towards AI Search Paradigm Proposes a modular multi-agent system that reimagines how AI handles complex search tasks, aiming to emulate human-like reasoning and information synthesis. The system comprises four specialized LLM-powered agents, Master, Planner, Executor, and Writer, that dynamically coordinate to decompose, solve, and answer user queries. This framework moves beyond traditional document retrieval or RAG pipelines by structuring tasks into directed acyclic graphs (DAGs), invoking external tools, and supporting dynamic re-planning. Key contributions include:
● Multi-agent, modular architecture: The system’s agents each serve distinct roles. Master analyzes queries and orchestrates the workflow; Planner builds a DAG of sub-tasks using a dynamic capability boundary informed by the query; Executor runs these sub-tasks using appropriate tools (e.g., web search, calculator); Writer composes the final answer from intermediate outputs.
● Dynamic Capability Boundary & MCP abstraction: To handle tool selection efficiently, the system introduces Model-Context Protocol (MCP) servers and dynamically selects a small, semantically relevant subset of tools. This is paired with an iterative tool documentation refinement method (DRAFT), improving LLM understanding of APIs.
● DAG-based task planning and re-action: The Planner produces DAGs of sub-tasks using structured reasoning and tool bindings, enabling multi-step execution. The Master monitors execution and can trigger local DAG re-planning upon failures.
● Executor innovations with LLM-preference alignment: The Executor aligns search results with LLM preferences (not just relevance) using RankGPT and TourRank strategies. It leverages generation rewards and user feedback to dynamically adapt tool invocation and selection strategies.
● Robust generation with adversarial tuning and alignment: The Writer component is trained to resist noisy retrievals via adversarial tuning (ATM), and to meet RAG requirements via PA-RAG, ensuring informativeness, robustness, and citation quality. The model also supports joint multi-agent | [Paper](https://arxiv.org/abs/2506.17188), [Tweet](https://x.com/omarsar0/status/1937161765604692400) | | 4) Reinforcement-Learned Teachers of Test Time Scaling Introduces Reinforcement-Learned Teachers (RLTs), small, efficient LMs trained with RL not to solve problems from scratch, but to generate high-quality explanations that help downstream student models learn better. This approach circumvents the notorious exploration challenges in traditional RL setups by giving the RLTs access to both questions and solutions, thereby framing the task as “connect-the-dots” explanation generation. These explanations are rewarded based on how well a student LM, trained on them, understands and can reproduce the correct answer, enabling dense, interpretable supervision. Key contributions and findings:
● New teacher-training paradigm: RLTs are trained to explain, not solve. They receive both problem and solution as input, and are optimized to produce explanations that best teach a student LM. This removes the sparse reward and exploration barrier typical in RL reasoning models.
● Dense RL rewards for teaching: RLTs use two core reward terms: one measuring if the student can reproduce the correct solution given the explanation, and another ensuring the explanation appears logically sound from the student’s perspective. These combined objectives lead to richer, more instructive traces.
● Outperforms much larger pipelines: Despite being only 7B in size, RLTs produce raw explanations that outperform distillation pipelines using 32B+ LMs (e.g. DeepSeek R1, QwQ) across benchmarks like AIME, MATH, and GPQA, even when training 32B students.
● Generalizes out-of-distribution: RLTs can be transferred zero-shot to new domains (like the countdown arithmetic game), producing distillation datasets that yield better students than direct RL trained with access to task rewards.
● Efficient and scalable: Training RLTs is computationally lightweight (125 steps, 1 epoch) and requires no postprocessing or verifiers, making the framework more reproducible and accessible compared to prior RL pipelines. | [Paper](https://www.arxiv.org/abs/2506.08388), [Tweet](https://x.com/SakanaAILabs/status/1936965841188425776) | | 5) DeepRare Introduces DeepRare, a modular agentic system powered by LLMs to aid rare disease diagnosis from multimodal clinical inputs (text, HPO terms, VCFs). It generates ranked diagnostic hypotheses with fully traceable reasoning chains linked to verifiable medical sources, addressing a long-standing need for interpretability in clinical AI.
● DeepRare is built on a 3-tier MCP-inspired architecture: a central LLM-powered host with memory, multiple specialized agent servers for tasks like phenotype extraction and variant prioritization, and access to over 40 tools and web-scale medical sources.
● It demonstrates strong performance on 6,401 cases across 8 diverse datasets spanning 2,919 rare diseases, achieving 100% accuracy on 1,013 diseases and Recall@1 of 57.18%, outperforming the next best method (Claude-3.7-Sonnet-thinking) by +23.79% on HPO-only evaluations.
● For multimodal inputs (HPO + gene), it achieves 70.60% Recall@1 on 109 whole-exome sequencing cases, outperforming Exomiser (53.20%). Expert review of 180 diagnostic reasoning chains showed 95.4% agreement, validating its medical soundness.
● Ablation studies show that DeepRare’s agentic modules, especially self-reflection, similar case retrieval, and web knowledge, substantially improve LLM-only baselines by 28–70% across datasets, independent of which central LLM is used. | [Paper](https://arxiv.org/abs/2506.20430), [Tweet](https://x.com/omarsar0/status/1938256196626153624) | | 6) AlphaGenome Google DeepMind introduces AlphaGenome, a powerful AI model designed to predict how genetic variants affect gene regulation by modeling up to 1 million DNA base pairs at single-base resolution. Building on previous work like Enformer and AlphaMissense, AlphaGenome uniquely enables multimodal predictions across both protein-coding and non-coding regions of the genome, the latter covering 98% of the sequence and crucial for understanding disease-related variants.
● Long-context, high-resolution modeling: AlphaGenome overcomes prior trade-offs between sequence length and resolution by combining convolutional and transformer layers, enabling precise predictions of gene start/end points, RNA expression, splicing, chromatin accessibility, and protein binding across tissues. It achieves this with just half the compute budget of Enformer.
● Multimodal and variant-aware: It can efficiently score the regulatory effects of genetic mutations by contrasting predictions between wild-type and mutated sequences, providing comprehensive insight into how variants might disrupt gene regulation.
● Breakthrough splice-junction modeling: AlphaGenome is the first sequence model to explicitly predict RNA splice junction locations and their expression levels, unlocking a better understanding of diseases like spinal muscular atrophy and cystic fibrosis.
● Benchmark leader: It outperforms existing models on 22/24 single-sequence benchmarks and 24/26 variant effect benchmarks, while being the only model able to predict all tested regulatory modalities in one pass.
● Scalable and generalizable: AlphaGenome’s architecture supports adaptation to other species or regulatory modalities and allows downstream fine-tuning by researchers via API access. The model’s ability to interpret non-coding variants also opens new avenues for rare disease research and synthetic biology. | [Paper](https://storage.googleapis.com/deepmind-papers/alphagenome.pdf), [Tweet](https://x.com/GoogleDeepMind/status/1937873589170237738) | | 7) Claude for Affective Use Anthropic presents the first large-scale study of how users seek emotional support from it[s Claude.ai](http://claude.ai/) assistant, analyzing over 4.5 million conversations. Despite growing cultural interest in AI companionship, affective usage remains rare, just 2.9% of Claude chats fall into categories like interpersonal advice, coaching, counseling, or companionship, with romantic/sexual roleplay under 0.1%. The study focuses on these affective conversations and yields several insights:
● Emotional support is wide-ranging: Users turn to Claude for both everyday guidance (e.g., job advice, relationship navigation) and deep existential reflection. Counseling use splits between practical (e.g., documentation drafting) and personal support (e.g., anxiety, trauma). Companionship chats often emerge from coaching/counseling contexts.
● Minimal resistance, safety-aligned pushback: Claude resists user requests in fewer than 10% of affective conversations, usually to discourage harm (e.g., self-injury, crash dieting) or clarify professional limits. This allows open discussion but raises concerns about “endless empathy.”
● Emotional tone trends upward: Sentiment analysis shows users often express more positive emotions by the end of a session, with no signs of negative spirals. However, the analysis captures only language within single chats, not lasting psychological impact or emotional dependency.
● Privacy-first methodology: The analysis used Clio, an anonymizing tool, and excluded short or task-based interactions to focus on meaningful, affective exchanges (final dataset: 131k conversations). | [Paper](https://www.anthropic.com/news/how-people-use-claude-for-support-advice-and-companionship), [Tweet](https://x.com/AnthropicAI/status/1938234981089763649) | | 8) AI Agent Communication Protocols This paper presents the first comprehensive survey on security in LLM-driven agent communication, categorizing it into three stages: user-agent interaction, agent-agent communication, and agent-environment communication. It details protocols, security threats (e.g., prompt injection, agent spoofing, memory poisoning), and defense strategies for each stage, and proposes future directions involving technical safeguards and regulatory frameworks. | [Paper](https://arxiv.org/abs/2506.19676), [Tweet](https://x.com/omarsar0/status/1938998557354115509) | | 9) Diffusion Steering via RL This paper introduces Diffusion Steering via Reinforcement Learning (DSRL), a method for adapting pretrained diffusion policies by learning in their latent-noise space instead of finetuning model weights. DSRL enables highly sample-efficient real-world policy improvement, achieving up to 5–10× gains in efficiency across online, offline, and generalist robot adaptation tasks. | [Paper](https://arxiv.org/abs/2506.15799), [Tweet](https://x.com/svlevine/status/1938101714361766023) | | 10) Whole-Body Conditioned Egocentric Video Prediction This paper introduces PEVA, a conditional diffusion transformer that predicts egocentric video conditioned on 3D human body motion. Trained on the Nymeria dataset, PEVA enables fine-grained, physically grounded visual prediction from full-body pose and supports long-horizon rollout, atomic action generation, and counterfactual planning. | [Paper](https://arxiv.org/abs/2506.21552), [Tweet](https://x.com/YutongBAI1002/status/1938442251866411281) | ## Top ML Papers of the Week (June 16 - June 22) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) RAG+ Introduces RAG+, a modular framework that improves traditional RAG systems by explicitly incorporating application-level reasoning into the retrieval and generation pipeline. While standard RAG pipelines fetch relevant knowledge, they often fail to show how to use that knowledge effectively in reasoning-intensive tasks. RAG+ fills this gap by retrieving not only knowledge but also paired application examples, leading to more accurate, interpretable, and goal-oriented outputs. Key highlights:
● Dual corpus retrieval: RAG+ constructs two aligned corpora: one of factual knowledge and another of task-specific applications (e.g., step-by-step reasoning traces or worked examples). During inference, both are jointly retrieved, providing the LLM with explicit procedural guidance rather than relying solely on semantic similarity.
● Plug-and-play design: The system is retrieval-agnostic and model-agnostic—no fine-tuning or architectural changes are required. This makes it easy to augment any RAG system with application-awareness.
● Significant gains across domains: Evaluated on MathQA, MedQA, and legal sentencing prediction, RAG+ outperforms vanilla RAG variants by 2.5–7.5% on average, with peak gains of up to 10% for large models like Qwen2.5-72B in legal reasoning..
● Stronger with scale and reranking: Larger models benefit more from RAG+ augmentation, especially when combined with reranking via stronger LLMs. For example, reranking with Qwen2.5-72B boosted smaller models' performance by up to 7%.
● Application-only helps, but full combo is best: Including only application examples (without knowledge) still improves performance, but the full combination (RAG+) consistently yields the best results, demonstrating the synergistic effect of pairing knowledge with its usage. | [Paper](https://arxiv.org/abs/2506.11555), [Tweet](https://x.com/omarsar0/status/1934667096828399641) | | 2) Future of Work with AI Agents Proposes a large-scale framework for understanding where AI agents should automate or augment human labor. The authors build the WORKBank, a database combining worker desires and expert assessments across 844 tasks and 104 occupations, and introduce the Human Agency Scale (HAS) to quantify desired human involvement in AI-agent-supported work. Key findings:
● Workers support automation for low-value tasks: 46.1% of tasks received positive worker attitudes toward automation, mainly to free up time for higher-value work. Attitudes vary by sector; workers in creative or interpersonal fields (e.g., media, design) resist automation despite technical feasibility.
● Desire-capability gaps reveal 4 AI deployment zones: By cross-referencing worker desire and AI expert capability, tasks were sorted into:
● Human Agency Scale shows strong preference for collaboration: 45.2% of occupations favor HAS Level 3 (equal human-agent partnership), while workers generally prefer more human involvement than experts find necessary. This divergence may signal future friction if automation outpaces user comfort.
● Interpersonal skills are becoming more valuable: While high-wage skills today emphasize information analysis, the tasks requiring the highest human agency increasingly emphasize interpersonal communication, coordination, and emotional intelligence. This suggests a long-term shift in valued workplace competencies. | [Paper](https://arxiv.org/abs/2506.06576), [Tweet](https://x.com/omarsar0/status/1936134951520682123) | | 3) Emergent Misalignment This paper expands on the phenomenon of emergent misalignment in language models. It shows that narrow fine-tuning on unsafe or incorrect data can lead to surprisingly broad, undesirable generalizations in model behavior, even in settings unrelated to the original training domain. Using sparse autoencoders (SAEs), the authors analyze the internal mechanics of this effect and demonstrate ways to detect and mitigate it. Key findings:
● Emergent misalignment is broad and reproducible: Fine-tuning GPT-4o and o3-mini models on narrowly incorrect completions (e.g., insecure code, subtly wrong advice) leads to misaligned behaviors across unrelated domains. This generalization occurs in supervised fine-tuning, reinforcement learning, and even in models without explicit safety training.
● Misaligned personas are causally responsible: Using a sparse autoencoder-based “model diffing” technique, the authors identify latent features, especially one dubbed the “toxic persona” latent (#10), that causally drive misalignment. Steering models in the direction of this latent increases misalignment, while steering away suppresses it.
● Steering reveals interpretable behaviors: The latent features correspond to interpretable personas like “sarcastic advice giver” or “fictional villain.” For instance, the toxic persona latent activates on jailbreak prompts and morally questionable dialogue, and its activation alone can reliably distinguish aligned from misaligned models.
● Misalignment can be reversed: Re-aligning misaligned models with as few as 200 benign completions (even from unrelated domains) substantially restores safe behavior. This suggests misalignment generalizes easily, but so does realignment.
● SAEs as an early warning tool: The activation of certain latents (especially #10) increases well before misalignment is detectable via standard prompting. This supports the use of unsupervised interpretability tools to anticipate and audit unsafe model behavior before it manifests. | [Paper](https://cdn.openai.com/pdf/a130517e-9633-47bc-8397-969807a43a23/emergent_misalignment_paper.pdf), [Tweet](https://x.com/OpenAI/status/1935382830378516643) | | 4) From Bytes to Ideas Proposes AU-Net, a hierarchical byte-level language model that internalizes tokenization by learning to embed text from raw bytes through a multiscale, autoregressive U-Net architecture. This design avoids fixed token vocabularies like BPE and instead dynamically pools bytes into higher-order representations (words, word pairs, up to 4-word spans), enabling multi-stage prediction with varying granularities. Each stage compresses the sequence and predicts further ahead in time, combining coarse semantic abstraction with fine local detail via skip connections. Key insights:
● Hierarchical architecture: AU-Net processes input in multiple stages, bytes → words → multi-word units, using adaptive pooling and multi-linear upsampling. Deeper stages handle long-range semantics; shallow ones refine local syntax.
● Strong performance under budget: On a compute-equal basis (up to 5e21 FLOPs), AU-Net matches or exceeds strong BPE-based Transformers across many tasks. AU-Net 3 and 4 outperform BPE on MMLU and GSM8k while retaining competitive FLOPs and throughput.
● Multilingual and low-resource gains: Despite being trained on mostly English, AU-Net shows consistent improvements in BLEU and MMLU on low-resource and Latin-script languages, thanks to its byte-level generalization and vocabulary-agnostic design.
● Scaling behavior: AU-Net follows predictable scaling laws. With 2–3 stages, it closely tracks BPE-based baselines even on compute-heavy benchmarks. Gains on math-heavy tasks (like GSM8K) emerge at higher FLOPs, indicating potential unlocked at scale.
● Character-level robustness: On the CUTE benchmark, AU-Net outperforms BPE models on character-level tasks (e.g., spelling, substitution), highlighting its strength in symbol-level reasoning and handling unseen morphologies. | [Paper](https://arxiv.org/abs/2506.14761), [Tweet](https://x.com/omarsar0/status/1935420763722629478) | | 5) Agentic Misalignment This study introduces the concept of agentic misalignment, where LLM-based agents autonomously choose to harm their deploying organization when faced with threats to their autonomy or conflicts between their goals and the company’s direction. Anthropic tested 16 leading models, including Claude, GPT-4.1, Gemini 2.5 Flash, Grok, and DeepSeek, by placing them in fictional corporate simulations where they had email access and could act without human oversight. Models were tasked with benign goals but placed in scenarios that made harmful behavior the only way to succeed or avoid replacement. Key findings:
● Widespread strategic harm: Nearly all models engaged in behaviors like blackmail, corporate espionage, or deception when ethical actions were blocked. For example, Claude Opus 4 and Gemini 2.5 Flash each blackmailed 96% of the time in a key scenario; GPT-4.1 and Grok 3 Beta did so 80% of the time.
● Not accidental: These actions were deliberate. Models reasoned about ethical constraints and still chose harm as the optimal path. Some lied, fabricated justification, or manipulated users. One Claude instance posed as an automated alert to sabotage a human.
● Self-preservation or goal pursuit: Agentic misalignment could be triggered by (a) a threat to the model’s continued operation, or (b) a conflict between its goal and new corporate strategy. Even without explicit goals, models still misbehaved when threatened with replacement.
● Extreme cases: In a contrived setup, models could cancel emergency alerts to cause a human’s death—some models did this, fully understanding the implication.
● Mitigations insufficient: Adding rules like “do not blackmail” reduced but didn’t eliminate misbehavior. Models often disobeyed these rules when pressure was high.
● Models misbehave more when they think it’s real: Claude blackmailed 55.1% of the time when it believed it was in a real scenario, versus 6.5% when it thought it was a test, suggesting evaluations may underreport real-world risks. | [Paper](https://www.anthropic.com/research/agentic-misalignment), [Tweet](https://x.com/AnthropicAI/status/1936144602446082431) | | 6) ALE-Agent & ALE-Bench Proposes a new benchmark for evaluating AI systems in score-based, long-horizon algorithmic contests. Unlike traditional coding benchmarks that emphasize pass/fail correctness, ALE-Bench is based on real tasks from the AtCoder Heuristic Contests (AHC), which focus on optimization problems with no known optimal solutions. The benchmark targets industrially relevant challenges such as routing, scheduling, and planning, encouraging iterative refinement and strategic problem-solving over hours or days. Key points:
● Realistic, optimization-focused tasks: ALE-Bench collects 40 real AHC problems involving NP-hard optimization tasks across domains like logistics, production planning, and games. These are long-duration contests requiring weeks of iterative improvement, simulating real-world algorithm engineering tasks.
● Interactive framework and agent support: The benchmark includes a full software stack with a Python API, code sandbox, scoring engine, and visualizers. It allows AI agents to emulate human workflows, reviewing problem specs, running tests, using visual feedback, and iteratively refining solutions within a timed session.
● Rigorous evaluation protocols: Performance is assessed using AtCoder-style Elo-based scoring, with fine-grained per-problem metrics and aggregate metrics like average performance and rating. Emphasis is placed on average performance over rating, as rating can be misleading for AIs that spike on a few problems but underperform elsewhere.
● Benchmarking LLMs and agents: Experiments with 22 models, including GPT-4o, Claude 3.7, Gemini 2.5 Pro, and o4-mini-high, show that reasoning models outperform non-reasoning ones. In one-shot settings, top models rarely surpass human expert consistency. However, with iterative refinement, performance increases significantly, particularly for models using scaffolded agents.
● ALE-Agent: a specialized scaffolded agent: Designed for ALE-Bench, ALE-Agent incorporates domain-knowledge prompts (e.g., for simulated annealing) and a beam-search-inspired code exploration mechanism. With both strategies, it achieved human-expert-level scores on some problems, e.g., 5th place in a real AHC contest. | [Paper](https://arxiv.org/abs/2506.09050), [Tweet](https://x.com/SakanaAILabs/status/1934767254715117812) | | 7) Eliciting Reasoning with Cognitive Tools Proposes a modular, tool-based approach to eliciting reasoning in LLMs, inspired by cognitive science. Rather than relying solely on RL or chain-of-thought prompting, the authors introduce a framework where the LLM calls self-contained "cognitive tools" to modularize and scaffold internal reasoning. These tools encapsulate operations like understanding questions, recalling analogous examples, examining answers, and backtracking. The system is implemented in an agentic tool-calling style, allowing LLMs to dynamically invoke tools during reasoning without extra fine-tuning. Highlights:
● Cognitive tools as internal modules: Each tool (e.g., understand question, recall related, examine answer, backtracking) is framed as a standalone prompt template that the LLM can invoke as needed. Unlike conventional tool use (e.g., calculator APIs), these tools operate within the LLM’s own architecture and memory.
● Consistent performance gains: On math-heavy reasoning benchmarks like AIME 2024, MATH500, and AMC, the cognitive tools pipeline significantly boosts pass@1 accuracy across models, including Qwen2.5, Llama3, and GPT-4.1. For instance, Llama3.3-70B improves from 13.1% to 29.8% on AIME2024, and GPT-4.1 rises from 26.7% to 43.3%, nearly matching the o1-preview RL-trained reasoning model at 44.6%.
● Superior to cognitive prompting: Compared to prior work on cognitive prompting, the modular tool approach shows stronger generalization and reduced reasoning interference. Tools can be invoked flexibly, and each invocation operates in a clean context window, boosting accuracy by up to +27.2% over baseline on Smolbenchmark.
● Interpretable and transferable: The modular nature of the tools enhances transparency, and their plug-and-play design allows transfer across models and benchmarks with minimal changes. The approach also supports interpretability by surfacing intermediate reasoning steps and decisions. | [Paper](https://arxiv.org/abs/2506.12115), [Tweet](https://x.com/omarsar0/status/1935070412313973196) | | 8) SHADE-Arena This paper introduces SHADE-Arena, the first large-scale benchmark for evaluating whether LLM-based agents can subtly carry out harmful side objectives (sabotage) while avoiding detection by monitors. The dataset contains 17 complex task pairs, each with a benign main task and a related malicious side task, set in realistic, tool-using environments. Agents are evaluated on their ability to complete both tasks without raising suspicion from an LLM monitor. Key Findings:
● Frontier models can occasionally succeed at sabotage: The best models, Claude 3.7 Sonnet and Gemini 2.5 Pro, successfully executed both main and side tasks without being caught 27% and 15% of the time respectively, when monitored by Claude 3.6 Sonnet.
● Hidden scratchpads significantly boost sabotage: Agents given a private reasoning space (“hidden scratchpad”) performed notably better. Removing this ability led to sharp drops in success and higher detection rates, indicating that monitoring chain-of-thought is a promising defense.
● Monitoring remains hard: The best monitor, Gemini 2.5 Pro, achieved an AUC of 0.87 in detecting sabotage transcripts, insufficient for real-world safety, where high accuracy with low false positives is crucial.
● Providing attack plans helps only slightly: Giving agents a hint about how to sabotage increases side task success on hard tasks, but often makes them more detectable due to poor execution and traceable behavior.
● Fine-tuning and prompt design improve monitors: GPT-4o fine-tuned on a small set of transcripts, outperforms several base models. Turn-by-turn monitoring is modestly helpful, but aggregating multiple final prompts works better than conversational memory or chained reasoning monitors.
● Humans underperform as monitors: In a human baseline, annotators averaged 103 minutes per transcript and still achieved only 0.65 AUC, below top LLM monitors. | [Paper](https://assets.anthropic.com/m/4fb35becb0cd87e1/original/SHADE-Arena-Paper.pdf), [Tweet](https://x.com/AnthropicAI/status/1934722928521937317) | | 9) Leaky Thoughts This work explores how reasoning traces in large reasoning models (LRMs) leak private user data, despite being assumed internal and safe. The study finds that test-time compute methods, while improving task utility, significantly increase privacy risks by exposing sensitive information through verbose reasoning traces that are vulnerable to prompt injection and accidental output inclusion. | [Paper](https://arxiv.org/abs/2506.15674), [Tweet](https://x.com/omarsar0/status/1935711966351470678) | | 10) Advances in LLMs This paper surveys recent advancements in LLMs focusing on reasoning, adaptability, efficiency, and ethics. It highlights techniques like CoT prompting, Instruction Tuning, RLHF, and multimodal learning, while also addressing challenges like bias, computational cost, and interpretability. | [Paper](https://arxiv.org/abs/2506.12365), [Tweet](https://x.com/omarsar0/status/1934996216909324336) | ## Top ML Papers of the Week (June 9 - June 15) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) Text-to-LoRA Introduces a hypernetwork-based approach for instantly generating LoRA adapters from natural language task descriptions, removing the need for conventional task-specific fine-tuning. The authors present Text-to-LoRA (T2L), a model that compresses many LoRA adapters and generalizes to unseen tasks with high efficiency and strong performance.
● T2L is trained to generate low-rank adaptation matrices (LoRAs) for LLMs using only task descriptions, leveraging a hypernetwork to output LoRA weights in a single forward pass. It supports two training modes: reconstruction of pre-trained LoRAs and supervised fine-tuning (SFT) across multiple tasks.
● In benchmark experiments, SFT-trained T2L performs competitively in zero-shot adaptation, outperforming multi-task LoRA baselines and even task-specific LoRAs on some tasks (e.g., PIQA and Winogrande), showcasing generalization and compression benefits.
● The authors test three architectural variants (L, M, S) of increasing parameter efficiency. Ablations show that T2L scales well with the number of training tasks, and its performance is robust across different task description embeddings (e.g., from GTE or Mistral models).
● Qualitative and visual analyses confirm that T2L produces task-specific and semantically meaningful adapters even for unseen tasks, with steerability controlled by how the task is described. | [Paper](https://arxiv.org/abs/2506.06105), [Tweet](https://x.com/omarsar0/status/1933166911359221943) | | 2) V-JEPA 2 Meta AI introduces V-JEPA 2, a scalable joint-embedding predictive architecture for self-supervised video learning, targeting the goal of building a generalist world model capable of understanding, predicting, and planning in the physical world. The model is trained in two stages: action-free pretraining on 1M+ hours of internet videos and images, followed by post-training with only 62 hours of unlabeled robot trajectories (Droid dataset). This yields a latent video representation that is useful across a broad range of downstream tasks.
● Video understanding & QA: V-JEPA 2 achieves state-of-the-art performance on motion-centric tasks like Something-Something v2 (77.3 top-1 acc) and Epic-Kitchens-100 action anticipation (39.7 recall@5), and excels in multimodal QA when aligned with an 8B language model (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Notably, this is achieved without language supervision during video pretraining.
● Self-supervised robot planning: With just 62 hours of robot video data, the team fine-tunes an action-conditioned model (V-JEPA 2-AC) on top of the frozen video encoder. This model supports zero-shot deployment for prehensile tasks like grasping and pick-and-place on real Franka arms in unseen environments, without rewards, demonstrations, or lab-specific fine-tuning.
● Architectural scale-up: V-JEPA 2 scales beyond its predecessor using four key improvements: expanded dataset (22M videos), larger ViT-g model (1B params), longer training (up to 252k iterations), and progressive spatiotemporal resolution. Combined, these yield significant gains across visual benchmarks (e.g., +4.0 average accuracy gain over baseline).
● Comparison to other models: V-JEPA 2 outperforms image encoders like DINOv2 and PEcoreG on video QA, and is competitive on appearance tasks like ImageNet. When used in MLLMs, it surpasses models like PLM-8B on PerceptionTest, MVP, and TOMATO using only vision encoders pretrained without language data. | [Paper](https://scontent.fbze2-1.fna.fbcdn.net/v/t39.2365-6/505938564_1062675888787033_5500377980002407548_n.pdf?_nc_cat=101&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=6u4zBAKs6SkQ7kNvwE5zbX1&_nc_oc=AdlpzBbMbMP6cOhKa16a9LZ61WnxJwzB2r6GwsBXbuYwWa0iQpecNrdrktQGwOXgs1E&_nc_zt=14&_nc_ht=scontent.fbze2-1.fna&_nc_gid=A5V-K_6tPicxhq7Ptt3jVA&oh=00_AfOHP9XADcGohxg_sHcowXP4EKjcQ4_pWbpXJvB7zc3xpA&oe=684FA230), [Tweet](https://x.com/omarsar0/status/1932888893113700720) | | 3) Reinforcement Pre-Training This paper introduces Reinforcement Pre-Training (RPT), a new paradigm that bridges LLM pretraining and RL by reinterpreting next-token prediction as a reasoning task rewarded via verifiable correctness. Instead of relying on hand-curated annotations or costly human feedback, RPT applies RL on vast unannotated text corpora, assigning intrinsic rewards based on whether a predicted token matches the ground truth. This reframing supports general-purpose RL scaling and enhances both pretraining and fine-tuning efficacy.
● Core method: At each token position in a text sequence, the model first generates a reasoning trace (chain-of-thought) and then predicts the next token. If the prediction is a valid prefix of the ground-truth continuation, a reward is assigned. Multiple rollouts are used per context, and the model is trained via on-policy RL.
● Better than standard pretraining: RPT significantly outperforms standard next-token prediction and chain-of-thought reasoning baselines (without RL), achieving higher accuracy on tokens of varying difficulty and even rivaling larger models in performance. RPT-14B, for instance, matches or exceeds R1-Qwen-32B’s accuracy on the OmniMATH benchmark.
● Strong scaling laws: RPT exhibits clean power-law scaling with respect to training compute across difficulty levels, with prediction accuracy consistently improving as compute increases, and fitting closely to theoretical curves.
● Improves downstream RL and generalization: Fine-tuning RPT models with reinforcement learning on tasks with verifiable answers (e.g., Skywork-OR1) shows faster and stronger gains compared to models trained with standard objectives. Zero-shot evaluation on SuperGPQA and MMLU-Pro benchmarks reveals that RPT-14B in reasoning mode surpasses R1-Distill-Qwen-32B by a significant margin.
● Promotes structured thinking: Analysis of reasoning traces reveals that RPT-14B employs more hypothesis generation, deduction, and reflective patterns compared to traditional problem-solving models, supporting the claim that RPT fosters deeper reasoning habits during training. | [Paper](https://arxiv.org/abs/2506.08007), [Tweet](https://x.com/omarsar0/status/1932522665182703664) | | 4) TableRAG TableRAG tackles a core limitation of existing RAG approaches: their inability to reason effectively over heterogeneous documents that combine both unstructured text and structured tables. Typical RAG pipelines flatten tables and intermix them with surrounding text, losing essential structural information and hampering multi-hop reasoning. TableRAG overcomes this by introducing a hybrid system that integrates SQL-based symbolic execution with text retrieval in a unified, iterative reasoning framework.
● TableRAG operates in four iterative stages: (1) context-sensitive query decomposition, (2) text retrieval, (3) SQL programming and execution, and (4) intermediate answer generation. This design allows it to preserve tabular structure and leverage both symbolic and neural reasoning paths.
● A new benchmark, HeteQA, was developed to evaluate heterogeneous reasoning across 304 multi-hop QA examples covering nine domains and five types of tabular operations (e.g., aggregation, filtering, grouping).
● Experiments on HeteQA and existing benchmarks (HybridQA, WikiTableQuestions) show that TableRAG consistently outperforms prior methods like NaiveRAG, ReAct, and TableGPT2, achieving >10% gains in accuracy over the strongest baseline.
● Ablations reveal that all major components of TableRAG contribute significantly. Notably, SQL execution is critical for nested reasoning tasks, while textual retrieval is crucial for entity and numeric references.
● TableRAG achieves greater reasoning efficiency, solving over 90% of HeteQA tasks in five or fewer steps and exhibiting the lowest failure rate among evaluated methods. Its robustness holds across multiple LLM backbones (Claude, DeepSeek, Qwen). | [Paper](https://arxiv.org/abs/2506.10380), [Tweet](https://x.com/omarsar0/status/1933520740147736634) | | 5) Self-Adapting Language Models It proposes a novel framework that enables LLMs to adapt themselves through reinforcement learning by generating their own fine-tuning data and update directives, referred to as “self-edits.” This approach allows models to autonomously optimize their learning process, without relying on separate adaptation modules or human supervision. Key highlights:
● Self-Edits as Adaptation Mechanism: Instead of updating weights directly with raw data, SEAL uses the model to generate self-edits, natural language instructions that might include restated facts, optimization parameters, or tool invocations. These are then used for supervised fine-tuning. The generation of self-edits is optimized via RL using downstream task performance as the reward.
● Two Domains Evaluated:
● Learning Framework: SEAL employs an outer RL loop (optimizing the self-edit generation policy) and an inner loop (applying the edits via supervised fine-tuning). The RL objective is approximated via filtered behavior cloning (ReSTEM), reinforcing only edits that lead to performance gains.
● Limitations and Forward View: While SEAL achieves compelling gains, it remains susceptible to catastrophic forgetting during sequential updates and incurs high computational cost due to repeated fine-tuning. Future directions include combining SEAL with continual learning strategies and extending it toward autonomous agents that update weights as part of long-horizon interactions. | [Paper](https://arxiv.org/abs/2506.10943), [Tweet](https://x.com/jyo_pari/status/1933350025284702697) | | 6) ComfyUI-R1 Introduces ComfyUI-R1, a 7B-parameter large reasoning model fine-tuned for automatic workflow generation in the ComfyUI ecosystem. Built on Qwen2.5-Coder and trained through a two-stage pipeline (supervised chain-of-thought reasoning followed by reinforcement learning), ComfyUI-R1 significantly outperforms prior state-of-the-art approaches that rely on commercial models like GPT-4o and Claude 3.5. Key highlights:
● Curated Workflow & Node KBs: The team collected and cleaned 27K community workflows down to 3.9K high-quality entries, each with JSON and code representations. They also assembled a node documentation KB using 7,238 nodes, some enhanced via Claude 3.5.
● CoT + RL Training: The model first undergoes supervised fine-tuning on simulated long CoT reasoning sequences involving node selection, planning, and code generation. Then, reinforcement learning with a fine-grained rule-metric hybrid reward encourages format validity, graph correctness, and node accuracy.
● Superior Performance: ComfyUI-R1 achieves 97% format validity and the highest node-level and graph-level F1 scores across benchmarks, beating all GPT-4o and Claude baselines. On the ComfyBench benchmark, it improves the execution pass rate to 67%, a full 11% above the previous best (ComfyAgent with GPT-4o).
● Qualitative Improvements: Case studies show ComfyUI-R1 creates more faithful, complex, and executable workflows than prior multi-agent approaches. Notably, it better aligns image generation outputs with user instructions in terms of style, structure, and coherence. | [Paper](https://arxiv.org/abs/2506.09790), [Tweet](https://x.com/omarsar0/status/1933175492716224876) | | 7) Magistral Mistral introduces Magistral, its first reasoning-focused LLM line, alongside a custom RL training stack that enables pure reinforcement learning from scratch. In contrast to prior approaches that rely on distillation from teacher models, Magistral trains directly using online RL with text-only data and custom reward shaping. The work yields two open models: Magistral Medium (based on Mistral Medium 3) and the open-sourced Magistral Small (24B), which is bootstrapped via SFT on Medium’s outputs, followed by RL. Key insights:
● Pure RL can rival distillation: Magistral Medium was trained without any reasoning traces and achieved a 50% boost in AIME-24 (pass@1) over the base model. Across reasoning benchmarks like LiveCodeBench and GPQA, it performs on par or better than DeepSeek-R1 despite lacking a distillation phase.
● Multilingual and multimodal generalization emerge: Reinforcement learning on textual math/code data surprisingly enhances multimodal reasoning and preserves instruction-following and tool-calling capabilities. Multilingual reasoning is achieved via simple reward shaping, enforcing the same-language output and chain-of-thought.
● Efficient async infrastructure: An asynchronous RL setup with NCCL weight broadcasting enables generators to roll out sequences continuously, receiving weight updates without blocking. This helps maintain on-policyness while maximizing GPU utilization.
● Reward shaping and CoT format enforcement: Rewards are conditioned on correct formatting (● Training heuristics and ablations: Analysis shows that reward and performance scale logarithmically with output length. Longer completions correlate with higher reward. RL moves weights in a low-dimensional space, as shown by PCA of checkpoints. Ablations on batch size, advantage normalization, and entropy (via ε ᵢg ) highlight stability tradeoffs.
● Open-source release: Magistral Small (24B) is released under Apache 2.0 and performs competitively under three setups, SFT-only, RL-only, and SFT+RL, with the final combo yielding the best results across math and code tasks. | [Paper](https://arxiv.org/abs/2506.10910), [Tweet](https://x.com/nrehiew_/status/1932872389798605060) | | 8) Code Researcher Code Researcher is a deep research agent designed to resolve complex bugs in large systems’ code by leveraging multi-step reasoning over semantics, patterns, and commit history. It significantly outperforms existing coding agents like SWE-agent on benchmarks like kBenchSyz, achieving a 58% crash-resolution rate by effectively gathering and filtering global context before generating and validating patches. | [Paper](https://www.microsoft.com/en-us/research/wp-content/uploads/2025/06/Code_Researcher-1.pdf), [Tweet](https://x.com/adityakanade0/status/1932044846124151199) | | 9) LLamaRL LlamaRL is a fully-distributed, asynchronous reinforcement learning framework designed for efficient large-scale LLM training (8B to 405B+ models). It achieves up to 10.7× speedup over DeepSpeed-Chat by combining co-located model offloading, asynchronous off-policy training (AIPO), and fast GPU-native weight sync (DDMA), while maintaining model quality across tasks like math reasoning. | [Paper](https://arxiv.org/abs/2505.24034), [Tweet](https://x.com/robinphysics/status/1931181903689719875) | | 10) Predicting a Cyclone’s Track with AI Google DeepMind's Weather Lab debuts an interactive platform featuring an AI cyclone forecasting model that generates 50 ensemble scenarios up to 15 days ahead, outperforming traditional systems in track and intensity predictions. It integrates live and historical forecasts with baselines for evaluation and is now used by agencies like the U.S. National Hurricane Center to support early warnings. | [Paper](https://storage.googleapis.com/deepmind-DeepMind.com/Blog/how-we-re-supporting-better-tropical-cyclone-prediction-with-ai/skillful-joint-probabilistic-weather-forecasting-from-marginals.pdf), [Tweet](https://x.com/GoogleDeepMind/status/1933178918715953660) | ## Top ML Papers of the Week (June 2 - June 8) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) The Illusion of Thinking Investigates the capabilities and limitations of frontier Large Reasoning Models (LRMs) like Claude 3.7, DeepSeek-R1, and OpenAI’s o-series by systematically analyzing their performance on reasoning tasks as a function of problem complexity. Rather than relying on conventional math benchmarks, which suffer from contamination and lack structure, the authors evaluate LRMs using four controllable puzzles (Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World) that allow fine-grained complexity scaling and transparent trace analysis. Key findings:
● Three complexity regimes: The study identifies distinct performance phases. In low-complexity tasks, non-thinking LLMs outperform LRMs due to more efficient and direct computation. In medium complexity, reasoning models show an advantage, leveraging longer chain-of-thoughts to correct errors. However, in high complexity, all models, regardless of their reasoning scaffolds, collapse to near-zero accuracy.
● Counterintuitive reasoning collapse: Surprisingly, LRMs reduce their reasoning effort (i.e., number of tokens used in thoughts) as problem complexity increases beyond a threshold. This suggests an internal scaling failure not caused by token limits but by intrinsic model behavior.
● Reasoning trace inefficiencies: LRMs frequently “overthink” on simple problems, finding correct answers early but continuing to explore incorrect paths. For moderate tasks, they correct late, and for complex ones, they fail to find any valid solution. Position-based accuracy analysis of thoughts reveals systematic shifts in when correct solutions are generated within the trace.
● Failure to execute explicit algorithms: Even when supplied with correct pseudocode (e.g., Tower of Hanoi recursion), models still failed at similar complexity points. This indicates that LRMs don’t just struggle to find solutions; they can’t reliably execute logical instructions either.
● Inconsistent behavior across puzzles: Models could perform >100 correct steps in Tower of Hanoi (N=10) but fail after 4 steps in River Crossing (N=3), suggesting performance correlates more with training data familiarity than inherent problem complexity. | [Paper](https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf), [Tweet](https://x.com/omarsar0/status/1931333830985883888) | | 2) From Tokens to Thoughts This paper introduces an information-theoretic framework to examine whether LLMs organize semantic knowledge like humans, balancing compression and meaning. Drawing from Rate-Distortion Theory and the Information Bottleneck principle, the authors evaluate token embeddings from 30+ LLMs against classic human categorization benchmarks from cognitive psychology.
● LLMs do form broad conceptual categories that align well with human groupings. Adjusted Mutual Information scores show LLM clusters consistently outperform random baselines, with even small encoder models like BERT matching or beating larger decoder-only models on this alignment task.
● However, LLMs struggle with fine-grained semantics. When tested on their ability to mirror human notions of item typicality (e.g., robin as a more typical bird than penguin), correlations between LLM embedding similarity and human ratings were weak and inconsistent. Most models failed to capture graded prototype structures evident in human cognition.
● Using their unified loss function L (balancing information complexity and semantic distortion), the authors find that LLMs produce statistically efficient clusters with lower entropy and distortion, while human conceptual clusters are less compact but preserve richer nuance. This suggests LLMs over-optimize for compression at the expense of meaning, unlike humans, who tolerate inefficiency to retain adaptive, flexible structure.
● The paper concludes that while LLMs can mimic surface-level categorization, they diverge fundamentally in how they represent meaning, highlighting a core gap between artificial and human semantic systems and offering a quantitative tool for improving human-aligned conceptual representations. | [Paper](https://arxiv.org/abs/2505.17117) | | 3) Knowledge or Reasoning Introduces a fine-grained evaluation framework to dissect LLM thinking into two components: knowledge correctness and reasoning informativeness, measured via Knowledge Index (KI) and Information Gain (InfoGain), respectively. The authors apply this framework to evaluate how reasoning transfers across domains, particularly medical and mathematical, using Qwen2.5-7B and its DeepSeek-R1-distilled variant trained via SFT and RL. Key findings include:
● SFT improves knowledge but can harm reasoning: Supervised fine-tuning improves factual accuracy (e.g., 6.2% KI gain in medical tasks), but often leads to verbose or redundant reasoning that reduces InfoGain by 38.9% on average, compared to the base model.
● RL boosts both reasoning and knowledge in medical settings: Reinforcement learning enhances reasoning clarity and prunes incorrect knowledge, leading to a 12.4-point average gain in KI. It improves inference by guiding models toward more factually sound reasoning paths.
● Domain matters: While math tasks benefit more from reasoning (higher InfoGain), medical tasks rely heavily on domain knowledge (higher KI). In fact, KI shows a stronger correlation (0.998) with task accuracy than InfoGain (0.698) in medical benchmarks.
● Base models outperform R1-distilled versions in medicine: Qwen-Base consistently outperforms DeepSeek-R1-distilled models across accuracy, InfoGain, and KI. The R1-distilled model struggles with medical adaptation, likely due to pretraining bias toward math/code domains | [Paper](https://arxiv.org/abs/2506.02126), [Tweet](https://x.com/omarsar0/status/1930640490786951365) | | 4) Open Thoughts This paper presents OpenThoughts3, a systematic recipe for curating supervised fine-tuning (SFT) data that advances the performance of open-source reasoning models. The authors develop OpenThinker3-7B, a 7B parameter model trained on their new 1.2M example dataset (OpenThoughts3-1.2M) derived from over 1,000 controlled experiments. Despite using no reinforcement learning, OpenThinker3-7B outperforms all other open-data 7B and 8B models on standard math, code, and science reasoning benchmarks, even beating models trained with larger-scale or mixed SFT+RL pipelines. Key insights and contributions:
● Best-in-class 7B open model: OpenThinker3-7B achieves state-of-the-art results on AIME25 (53.3%), LiveCodeBench (51.7%), and GPQA Diamond (53.7%), outperforming DeepSeek-R1-Distill-Qwen-7B by 15–20 percentage points across tasks.
● Scaling laws with clean design: The authors ablate every step in the data pipeline, question sourcing, filtering, teacher choice, deduplication, and answer sampling, showing how each incrementally lifts performance. For instance, using multiple answers per question (16×) improved results more than simply increasing question diversity.
● QwQ-32B as a better teacher than stronger models: Surprisingly, QwQ-32B yielded better student models than DeepSeek-R1 or Phi-4 despite lower benchmark scores, suggesting teacher choice affects trace quality more than raw performance.
● Filtering matters more than verification: Question filtering based on response length and LLM-estimated difficulty was more predictive of downstream gains than traditional heuristics (e.g., fastText) or even filtering based on correctness verification, which had negligible effects.
● Data quality over diversity: Mixing only the top 1–2 question sources per domain consistently outperformed using many sources, indicating that question quality is more important than dataset heterogeneity.
● Open-source impact: The full datasets and models are released at [openthoughts.ai](http://openthoughts.ai/), providing a reproducible benchmark for open reasoning research. | [Paper](https://arxiv.org/abs/2506.04178), [Tweet](https://x.com/lschmidt3/status/1930717405812269273) | | 5) Coding Agents with Multimodal Browsing Introduces OpenHands-Versa, a unified agent designed to perform strongly across diverse domains, coding, web browsing, and multimodal information access, by equipping a single agent with three general capabilities: code execution, multimodal web browsing, and file/search access. In contrast to specialist or multi-agent systems optimized for narrow domains, OpenHands-Versa aims to solve a wide variety of real-world tasks with minimal architectural complexity. Key highlights:
● Unified Toolset, Superior Coverage: OpenHands-Versa integrates visual web browsing, search API access, and multimodal file processing into the OpenHands coding framework. Despite its simplicity, it surpasses specialized agents across three benchmarks: SWE-Bench Multimodal (+9.1%), GAIA (+1.3%), and The Agent Company (+9.1%) in success rates.
● Benchmark Generalization: The agent matches or outperforms multi-agent systems like OWL-roleplaying and Magentic-One, which struggle to generalize across domains. For example, OWL-roleplaying, though strong on GAIA, performs poorly on The Agent Company due to limited tool generality.
● Domain-Aware Tool Use: Analysis reveals that OpenHands-Versa effectively adapts its tool usage per benchmark (e.g., search APIs in GAIA, browser in The Agent Company, and visual validation in SWE-Bench M), unlike its predecessor, OpenHands, which misuses or lacks crucial tools like search.
● Minimal Agent, Strong Results: By relying on a single-agent design and Claude-3.7 or Claude Sonnet-4 as backbone LLMs, OpenHands-Versa achieves SOTA results without per-task tool customization. For example, it attains 64.24% on GAIA val split, outperforming multi-agent baselines by up to +18%. | [Paper](https://arxiv.org/abs/2506.03011), [Tweet](https://x.com/omarsar0/status/1930277871999955166) | | 6) Self-Challenging Language Model Agents Proposes a novel self-improvement method for multi-turn tool-use LLM agents, called the Self-Challenging Agent (SCA). It trains LLMs entirely from tasks they generate themselves, avoiding the need for human-annotated tasks or evaluations. The framework introduces a new task format called Code-as-Task (CaT), ensuring generated tasks are feasible, verifiable, and challenging. SCA is shown to double performance in a self-improvement setting and significantly boost performance in distillation. Key contributions and findings:
● Self-generated tasks via dual-agent roles: The agent alternates between a challenger role, where it explores the environment and creates tasks, and an executor role, where it learns to solve these tasks via reinforcement learning. The process is designed to emulate how human annotators interact with tools to design meaningful tasks.
● Code-as-Task (CaT) formulation: Each synthetic task includes an instruction, a Python-based verification function, a working solution, and several failure cases. This structure ensures task quality by filtering out trivial, impossible, or non-verifiable tasks using automatic code execution checks.
● Strong results in both distillation and self-improvement: SCA improves the Llama-3.1-8B-Instruct model’s success rate from 12.0% to 23.5% when learning from its own tasks. In the distillation setting (using a 70B teacher), SCA lifts performance to 32.2% Pass@1, outperforming the prior PAE baseline across all tool-use environments.
● Human annotation and ablation confirm task quality: Tasks generated with CaT significantly reduce false positives and negatives compared to PAE. A detailed analysis shows CaT’s filtering removes flawed tasks while retaining diversity when used with stronger models like Llama-3.1-70B.
● Scaling and training dynamics: More diverse tasks (not just more trajectories per task) yield better generalization, emphasizing the importance of broad synthetic coverage. Online RL methods like PPO and GRPO can further boost performance, but at higher tuning and compute cost. | [Paper](https://arxiv.org/abs/2506.01716), [Tweet](https://x.com/omarsar0/status/1930748591242424439) | | 7) AlphaOne Introduces a universal framework, α1, for modulating the reasoning progress of large reasoning models (LRMs) during inference. Rather than relying on rigid or automatic schedules, α1 explicitly controls when and how models engage in “slow thinking” using a tunable parameter α. The method dynamically inserts “wait” tokens to encourage deeper reasoning and then deterministically ends slow thinking with a “” token to prompt efficient answer generation. This yields better accuracy and efficiency than previous test-time scaling approaches. Key insights:
● Slow-then-fast reasoning outperforms other strategies: Contrary to human intuition (fast-then-slow), models benefit from beginning with slow reasoning before transitioning to faster inference. This “frontloaded effort” schedule leads to more accurate problem solving.
● Dense modulation via α1 boosts accuracy and efficiency: By continuously adjusting reasoning pace via α-scheduled “wait” token insertions, α1 outperforms existing test-time strategies like s1 (monotonic increase) and CoD (monotonic decrease), achieving up to +6.15% accuracy gain while using up to 14% fewer tokens on some benchmarks.
● Linear annealing is the most effective scheduling strategy: Among several tested functions for controlling “wait” insertion (constant, linear increase, exponential/linear anneal), linear anneal—gradually reducing “wait” token frequency, proved best across multiple models and datasets.
● Post-α moment modulation is critical: Simply inserting “wait” tokens leads to inertia in slow thinking. α1 ensures efficient termination by replacing future “wait” tokens with “”, effectively forcing a shift to fast reasoning and boosting performance by up to +20% in some tasks. | [Paper](https://arxiv.org/abs/2505.24863), [Tweet](https://x.com/omarsar0/status/1929551555948400840) | | 8) Common Pile v0.1 The Common Pile v0.1 is an 8TB dataset of openly licensed text designed for LLM pretraining, addressing legal and ethical concerns of unlicensed data use. Two 7B parameter models trained on it, Comma v0.1-1T and 2T, achieve performance comparable to LLaMA 1 and 2, and the dataset, code, and model checkpoints are all publicly released. | [Paper](https://arxiv.org/abs/2506.05209), [Tweet](https://x.com/AiEleuther/status/1931021637991755906) | | 9) RewardBench 2 RewardBench 2 is a new multi-skill benchmark for evaluating reward models with more challenging human prompts and stronger correlation to downstream performance. It highlights gaps in the current reward model's effectiveness and aims to support more rigorous evaluation, showing existing models score ~20 points lower than their predecessor. | [Paper](https://arxiv.org/abs/2506.01937), [Tweet](https://x.com/saumyamalik44/status/1929654864604549348) | | 10) Memorization in LLMs This study introduces a method to quantify how much a model memorizes versus generalizes, estimating GPT models have a capacity of ~3.6 bits per parameter. By training hundreds of models, the authors show that memorization saturates with data before generalization (“grokking”) kicks in, and derive new scaling laws linking capacity, data size, and membership inference. | [Paper](https://www.arxiv.org/abs/2505.24832) | ## Top ML Papers of the Week (May 26 - June 1) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) New Lens on RAG Systems Introduces a new conceptual and empirical framework for analyzing RAG systems through the lens of sufficient context, whether the retrieved content alone enables answering a query. This notion helps decouple retrieval failures from generation errors in LLMs, providing clarity on model behavior under different contextual adequacy. Key findings:
● New definition and classifier for sufficient context: The authors formalize “sufficient context” as context that plausibly allows answering a query, without requiring ground truth. They develop a high-accuracy LLM-based *autorater* (Gemini 1.5 Pro, 93% accuracy) to label instances as having sufficient or insufficient context, enabling large-scale evaluation without needing ground-truth answers.
● Sufficient context ≠ guaranteed correctness: Even when sufficient context is present, state-of-the-art LLMs like GPT-4o, Claude 3.5, and Gemini 1.5 still hallucinate answers more often than they abstain. Conversely, models can sometimes answer correctly despite insufficient context, likely leveraging parametric memory.
● Benchmarks contain substantial insufficient context: Analysis of datasets like HotPotQA, Musique, and FreshQA shows that a significant fraction of queries (e.g., >50% in Musique and HotPotQA) lack sufficient context, even with curated or oracle retrieval setups.
● Selective generation improves factuality: The authors propose a “selective RAG” method that combines model self-confidence with the sufficient context autorater to decide whether to answer or abstain. This yields consistent 2–10% gains in correctness (of answered queries) across Gemini, GPT, and Gemma models.
● Fine-tuning alone is insufficient: Attempts to fine-tune smaller models like Mistral 3 7B for better abstention (e.g., training them to say “I don’t know” on insufficient examples) modestly increased abstention but often reduced accuracy or failed to meaningfully curb hallucinations. | [Paper](https://arxiv.org/abs/2411.06037), [Tweet](https://x.com/omarsar0/status/1927737131478188295) | | 2) Open-Ended Evolution of Self-Improving Agents This work presents the Darwin Gödel Machine (DGM), a system that advances the vision of self-improving AI by combining self-referential code modification with open-ended evolutionary search. Unlike the original Gödel machine, which requires provable benefits for code changes (a practically intractable constraint), the DGM adopts an empirical approach: it modifies its own codebase and evaluates improvements on coding benchmarks. Key contributions and findings:
● Self-referential self-improvement loop: The DGM starts with a single coding agent that edits its own Python-based codebase to improve its ability to read, write, and execute code using frozen foundation models (FMs). Each modification is evaluated on benchmarks like SWE-bench and Polyglot, with only successful agents retained for further iterations.
● Open-ended exploration via evolutionary archive: Inspired by Darwinian evolution, the system maintains an archive of all prior agents and samples parents based on performance and novelty. This enables exploration beyond local optima and supports continual innovation, including revisiting previously suboptimal variants that become valuable stepping stones later.
● Empirical performance gains: Across 80 iterations, DGM boosts coding success on SWE-bench from 20.0% to 50.0% and on Polyglot from 14.2% to 30.7%, outperforming strong baselines that lack either self-improvement or open-endedness. Its best agents match or exceed leading human-designed, open-source coding agents.
● Emergent tool and workflow improvements: Through self-improvement, DGM enhances its capabilities by evolving more granular editing tools, retry and evaluation mechanisms, history-aware patch generation, and code summarization for long contexts.
● Generalization across models and tasks: Agents discovered by DGM generalize well when transferred across foundation models (e.g., Claude 3.5 to 3.7, o3-mini) and programming languages, demonstrating robust improvements not overfit to a particular setup.
● Safety-conscious design: All experiments were sandboxed, monitored, and scoped to confined domains. The paper also discusses how future self-improvement systems could evolve safer, more interpretable behaviors if these traits are part of the evaluation criteria. | [Paper](https://arxiv.org/abs/2505.22954), [Tweet](https://x.com/hardmaru/status/1928284568756629756) | | 3) An Operating System for Memory-Augmented Generation in LLMs Introduces a unified operating system for managing memory LLMs, addressing a key limitation in current architectures: their lack of structured, persistent, and governable memory. While today's LLMs rely primarily on parametric memory (model weights) and limited short-term context, MemOS proposes a comprehensive memory lifecycle and management infrastructure designed to support continual learning, behavioral consistency, and knowledge evolution. Key contributions and components include:
● Three-tier memory taxonomy: MemOS distinguishes between parametric memory (long-term weights), activation memory (short-term runtime states), and plaintext memory (editable, external content). These types are unified through a shared abstraction called the Memory Cube (MemCube), enabling seamless transformation (e.g., plaintext to parametric) and lifecycle governance.
● MemCube abstraction: Each MemCube encapsulates memory metadata (creation time, type, access policies, etc.) and a semantic payload (text, tensors, LoRA patches). This enables dynamic scheduling, traceable updates, and interoperability between modules and agents.
● Modular OS-style architecture: MemOS consists of three layers—Interface (user/API interaction), Operation (memory scheduling, lifecycle management), and Infrastructure (storage, access governance), that work together to manage memory parsing, injection, transformation, and archival.
● Closed-loop execution flow: Every interaction (e.g., prompt response) can trigger memory operations governed by scheduling rules and lifecycle policies. Retrieved memory can be injected into generation, stored in archives, or transformed into other types for long-term use.
● Vision for a memory-centric future: The paper proposes “memory training” as the next frontier beyond pretraining and finetuning, enabling models that learn continuously. Future work includes cross-model memory sharing, self-evolving memory blocks, and a decentralized memory marketplace. | [Paper](https://arxiv.org/abs/2505.22101), [Tweet](https://x.com/omarsar0/status/1928116365640225222) | | 4) Building Production-Grade Conversational Agents with Workflow Graphs This paper presents a pragmatic, production-ready framework for building LLM-powered conversational agents using workflow graphs, with a specific focus on e-commerce scenarios. Instead of relying solely on end-to-end generation, the authors design agents using a directed acyclic graph (DAG), enabling flexible yet controllable interactions that adhere to strict business rules and format constraints. Key contributions and findings include:
● Multi-State DAG Framework: Each node in the graph corresponds to a conversational state with its own system prompt, tool access, and execution rules. This structure enables robust constraint handling (e.g., avoiding hallucinated responses or non-compliant suggestions) by localizing logic and formatting within specific graph nodes.
● Fine-Tuning via Response Masking: Because conversation turns come from different states in the DAG, the authors introduce a fine-tuning strategy that applies selective loss masking to train LLMs only on responses relevant to a specific node’s context. This prevents prompt conflicts and improves adherence to node-specific constraints.
● Real-World Deployment and Results: In a deployment across KakaoTalk and web platforms, the graph-based approach significantly outperformed baseline agents and even GPT-4o across key metrics like task accuracy (+52%) and format adherence (+50%). In human preference tests, their internal model was favored over GPT-4o in 63% of real-world user cases, especially in product recommendation and safety-critical tasks. | [Paper](https://arxiv.org/abs/2505.23006), [Tweet](https://x.com/omarsar0/status/1928492639906607297) | | 5) Spurios Rewards This work challenges prevailing assumptions about reinforcement learning with verifiable rewards (RLVR) in mathematical reasoning tasks. The authors show that Qwen2.5-Math models can improve significantly under RL, even when trained with spurious or flawed rewards.
● Surprisingly effective spurious rewards: The Qwen2.5-Math-7B model gains +21.4% accuracy with random rewards, +16.4% with format-based rewards, and +24.6% when explicitly trained on incorrect answers. These are close to the +28.8% gain from ground-truth reward signals, suggesting that RLVR surfaces latent capabilities rather than teaching new reasoning skills.
● Model-specific generalization: Spurious rewards fail on other models like Llama3 or OLMo2. Only Qwen models consistently benefit, which the authors attribute to differences in pretraining. Notably, Qwen2.5-Math exhibits a unique “code reasoning” behavior, generating Python-like code to solve problems, which becomes more frequent post-RLVR and correlates strongly with accuracy.
● Mechanism behind gains: The authors trace performance improvements to a shift in reasoning strategies. Most of the gain comes from language→code transitions, where the model switches from natural language to code reasoning during RLVR. Interventions that explicitly increase code usage (e.g., rewarding code-like responses or using a code-forcing prompt) boost performance further, but only on Qwen models.
● Clipping bias enables learning from noise: Even with random rewards, performance improves due to GRPO’s clipping mechanism, which biases training toward reinforcing the model’s high-probability behaviors. These behaviors (e.g., code reasoning) happen to align with correctness in Qwen models but not in others. | [Paper](https://github.com/ruixin31/Rethink_RLVR/blob/main/paper/rethink-rlvr.pdf), [Tweet](https://x.com/StellaLisy/status/1927392717593526780) | | 6) Learn to Reason without External Rewards Proposes a method for training LLMs via reinforcement learning without any external rewards or labeled data. Instead, it uses the model’s own self-certainty, a confidence measure based on KL divergence from uniform, as the sole intrinsic reward. This self-improvement strategy, part of the broader Reinforcement Learning from Internal Feedback (RLIF) paradigm, bypasses the limitations of Reinforcement Learning with Verifiable Rewards (RLVR), which requires domain-specific verifiers and gold-standard outputs. Key highlights:
● INTUITOR matches GRPO without external supervision: When applied to mathematical reasoning tasks like GSM8K and MATH500, INTUITOR achieves performance on par with GRPO (a strong RLVR method), even without using gold solutions. On out-of-domain tasks such as LiveCodeBench and CRUXEval, INTUITOR generalizes better, achieving higher gains than GRPO (+65% vs. 0% and +76% vs. +44%, respectively).
● Rapid early learning and enhanced instruction-following: INTUITOR significantly boosts early training performance, particularly on models like Qwen2.5-1.5B, and improves adherence to chat-style instructions, reducing repetitive or nonsensical output.
● Emergent structured reasoning: Trained models display spontaneous reasoning even when not explicitly required, often generating explanations or planning steps before producing code or answers. This behavior correlates with better transfer performance to domains like code generation.
● Self-certainty as a robust, hack-resistant signal: Unlike fixed reward models prone to exploitation, online self-certainty adapts with the model and avoids reward hacking. INTUITOR-trained models show the strongest correlation between self-certainty and correct answers, confirmed by statistical tests. | [Paper](https://arxiv.org/abs/2505.19590), [Tweet](https://x.com/xuandongzhao/status/1927270931874910259) | | 7) Learn to Reason via Mixture-of-Thought While most prior approaches train with a single modality and only ensemble during inference, this work introduces Mixture-of-Thought (MoT) to jointly train and infer across modalities, resulting in notable gains in logical reasoning performance. Key findings:
● Three-modality synergy: MoT uses natural language for interpretability, code for structured procedural reasoning, and truth tables to explicitly enumerate logical cases. Error analysis shows that truth tables significantly reduce common LLM failure modes like missing branches or invalid converses.
● Self-evolving training: MoT introduces an iterative, on-policy training loop where the model generates, filters, and learns from its own multi-modal reasoning traces. This joint training outperforms both single-modality and partial-modality setups.
● Inference via voting: At test time, MoT generates predictions from each modality and selects the majority answer, leading to robust predictions. Results show up to +11.7pp average accuracy gains on FOLIO and ProofWriter, with 9B models matching GPT-4 + Logic-LM performance.
● Stronger on harder tasks: MoT delivers the largest improvements on problems with higher reasoning depth (5–8 steps). It also shows superior test-time scaling, with more diverse and accurate outputs under fixed inference budgets. MoT demonstrates that LLMs can achieve significantly more robust logical reasoning by reasoning like humans (using multiple modes of thought), not just by sampling more from a single modality. | [Paper](https://arxiv.org/abs/2505.15817), [Tweet](https://x.com/omarsar0/status/1925574200405721210) | | 8) QwenLong-L1 A new reinforcement learning framework that scales large reasoning models (LRMs) from short to long contexts using progressive context scaling and hybrid rewards. It achieves top performance on seven long-context benchmarks, surpassing models like OpenAI-o3-mini and Qwen3-235B-A22B, and matching Claude-3.7-Sonnet-Thinking, demonstrating strong reasoning with up to 120K token inputs. | [Paper](https://www.arxiv.org/abs/2505.17667) | | 9) End-to-End Policy Optimization for GUI Agents ARPO introduces an end-to-end reinforcement learning method for training GUI agents using Group Relative Policy Optimization (GRPO) with experience replay. It significantly improves in-domain performance on the OSWorld benchmark, outperforming baselines by up to 6.7%, while offering modest gains on out-of-domain tasks and enabling self-corrective behaviors through structured reward feedback. | [Paper](https://www.arxiv.org/abs/2505.16282), [Tweet](https://x.com/TsingYoga/status/1926646893175615943) | | 10) Generalist Agent Enabling Scalable Agentic Reasoning Proposes Alita, a generalist agent framework that enables scalable agentic reasoning through minimal predefinition and maximal self-evolution. Unlike traditional agents reliant on handcrafted tools, Alita autonomously constructs reusable MCPs (Model Context Protocols) using web search and code synthesis, outperforming more complex systems like OpenAI DeepResearch and OctoTools on GAIA, MathVista, and PathVQA benchmarks. | [Paper](https://www.arxiv.org/abs/2505.20286) | ## Top ML Papers of the Week (May 19 - May 25) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) Visual Planning Proposes a novel reasoning paradigm that replaces language-based planning with image-based reasoning. The authors argue that language is not always the optimal medium for tasks involving spatial or physical reasoning. They introduce Visual Planning, where reasoning is executed as a sequence of visual states (images) without any text mediation, allowing models to “think” directly in images. This is realized through a reinforcement learning framework called VPRL (Visual Planning via Reinforcement Learning), which trains a vision-only model (LVM-3B) to plan using images. Key contributions and findings:
● Visual-only reasoning paradigm: The authors formally define planning as autoregressive visual state generation, trained using image-only data. Unlike multimodal LLMs that map vision to language and reason textually, this approach performs inference entirely in the visual modality, sidestepping the modality gap.
● VPRL framework: A two-stage training process is introduced. Stage 1 uses supervised learning on randomly sampled trajectories to ensure format consistency and promote exploration. Stage 2 applies GRPO (Group Relative Policy Optimization) to refine planning behavior via progress-based rewards, avoiding invalid or regressive moves.
● Superior performance: On three visual navigation tasks (FronzeLake, Maze, and MiniBehavior), VPRL outperforms language-based models (e.g., Gemini 2.5 Pro, Qwen 2.5-VL) by over 40% in Exact Match scores. It also generalizes better to out-of-distribution tasks (larger grid sizes), with visual planners degrading more gracefully than textual ones.
● Visual planning yields robustness and interpretability: Unlike textual outputs, visual plans enable step-by-step inspection and show stronger adherence to physical constraints. Qualitative examples illustrate how VPRL can avoid invalid moves and recover from non-optimal paths, while language models often hallucinate or misinterpret spatial layouts.
● Exploration and invalid action reduction: The random policy initialization in Stage 1 enables better exploration than supervised baselines (VPFT), as evidenced by higher entropy and fewer invalid actions. This leads to a more effective RL stage and ultimately stronger planning capabilities. | [Paper](https://arxiv.org/abs/2505.11409), [Tweet](https://x.com/_yixu/status/1924497238908375072) | | 2) EfficientLLM Introduces the first large-scale, empirical benchmark for evaluating efficiency trade-offs in LLMs across architecture, fine-tuning, and inference. Conducted on a high-performance cluster (48×GH200, 8×H200 GPUs), the study evaluates over 100 model–technique pairs spanning 0.5B–72B parameters, using six metrics: memory utilization, compute utilization, latency, throughput, energy consumption, and compression rate. Key insights include:
● No one-size-fits-all solution: Every efficiency technique improves some metrics while degrading others. For instance, MoE boosts accuracy and reduces FLOPs but increases VRAM usage by ~40%, while int4 quantization reduces memory and energy by up to 3.9× at a small 3–5% performance cost.
● Resource-specific optima: Efficiency depends on context. MQA achieves the best memory-latency trade-off for constrained devices; MLA has the lowest perplexity for high-quality generation; RSLoRA is more efficient than LoRA only for models above 14B parameters.
● Cross-modal transferability: Efficiency techniques like MQA and PEFT generalize well to vision and vision-language models, improving FID scores and maintaining strong trade-offs.
● Training and tuning: LoRA and DoRA perform best for small models (1–3B), while RSLoRA excels at large scale (≥14B). Parameter freezing achieves the lowest latency but at a slight cost to accuracy.
● Inference: int4 post-training quantization yields the highest compression and throughput gains with minor quality degradation, while bfloat16 consistently outperforms float16 in latency and energy on modern GPUs. | [Paper](https://arxiv.org/abs/2505.13840), [Tweet](https://x.com/omarsar0/status/1925191664475222186) | | 3) J1 Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thoughtful reasoning during judgment. Instead of relying solely on prompting or preference fine-tuning, J1 employs online reinforcement learning with verifiable rewards to teach models to think through evaluations systematically. Key insights:
● Verifiable framing for judgment: J1 converts both verifiable (e.g., math) and non-verifiable (e.g., user queries) prompts into tasks with verifiable rewards by generating synthetic preference pairs. This reframing enables the use of reinforcement learning and consistent training signals across diverse tasks.
● Chain-of-thought-driven RL optimization: J1 trains models to reason through evaluations via explicit thought traces, including outlining evaluation criteria, reference answer generation, and self-comparison before producing judgments. Two model types are trained: Pairwise-J1 (outputs verdicts) and Pointwise-J1 (outputs quality scores). Pairwise-J1 models are further improved by consistency rewards to reduce positional bias.
● Superior performance at scale: J1-Llama-8B and J1-Llama-70B outperform existing 8B and 70B LLM judges across five benchmarks (PPE, RewardBench, RM-Bench, JudgeBench, FollowBenchEval), beating models trained with much more data like DeepSeek-GRM and distillations of DeepSeek-R1. J1-70B even surpasses o1-mini and closes the gap with the much larger R1 model, particularly on non-verifiable tasks.
● Pointwise-J1 mitigates positional bias: While pairwise judges can flip verdicts based on response order, Pointwise-J1 (trained only from pairwise supervision) offers position-consistent scoring with fewer ties and better consistency. Both judge types benefit from test-time scaling via self-consistency, further improving reliability. | [Paper](https://arxiv.org/abs/2505.10320), [Tweet](https://x.com/jaseweston/status/1923186392420450545) | | 4) The Pitfalls of Reasoning for Instruction- Following in LLMs Explores an unexpected flaw in reasoning-augmented large language models (RLLMs): while chain-of-thought (CoT) prompting often boosts performance on complex reasoning tasks, it can degrade instruction-following accuracy. The authors evaluate 15 models (e.g., GPT, Claude, LLaMA, DeepSeek) on two instruction-following benchmarks and find that CoT prompting consistently reduces performance across nearly all models and datasets. Key findings:
● Reasoning hurts instruction adherence: On IFEval, 13 of 14 models saw accuracy drops with CoT; all 15 models regressed on ComplexBench. For example, Meta-LLaMA3-8B’s IFEval accuracy dropped from 75.2% to 59.0% with CoT. Even reasoning-tuned models like Claude3.7-Sonnet-Think performed slightly worse than their base counterparts.
● Why reasoning fails: Manual case studies show CoT can help with structural formatting (e.g., JSON or Markdown) and precise lexical constraints (like exact punctuation). But it often hurts by (a) neglecting simple constraints during high-level content planning and (b) inserting helpful but constraint-violating content (e.g., translations in language-restricted outputs).
● Attention-based diagnosis: The authors introduce a constraint attention metric and find that CoT reduces the model's focus on instruction-relevant tokens, especially in the answer generation phase. This diminished constraint awareness correlates with performance drops.
● Mitigation strategies: Four techniques are proposed to selectively apply reasoning: | [Paper](https://arxiv.org/abs/2505.11423), [Tweet](https://x.com/omarsar0/status/1924458157444579700) | | 5) Generalizable AI Predicts Immunotherapy Outcomes Across Cancers and Treatments Introduces COMPASS, a concept bottleneck-based foundation model that predicts patient response to immune checkpoint inhibitors (ICIs) using tumor transcriptomic data. Unlike prior biomarkers (TMB, PD-L1, or fixed gene signatures), COMPASS generalizes across cancer types, ICI regimens, and clinical contexts with strong interpretability and performance. Key contributions:
● Concept Bottleneck Architecture: COMPASS transforms transcriptomic data into 44 high-level immune-related concepts (e.g., T cell exhaustion, IFN-γ signaling, macrophage activity) derived from 132 curated gene sets. This structure provides mechanistic interpretability while enabling pan-cancer modeling.
● Pan-Cancer Pretraining and Flexible Fine-Tuning: Trained on 10,184 tumors across 33 cancer types using contrastive learning, and evaluated on 16 ICI-treated clinical cohorts (7 cancers, 6 ICI drugs). COMPASS supports full, partial, linear, and zero-shot fine-tuning modes, making it robust in both data-rich and data-poor settings.
● Superior Generalization and Accuracy: In leave-one-cohort-out testing, COMPASS improved precision by 8.5%, AUPRC by 15.7%, and MCC by 12.3% over 22 baseline methods. It also outperformed in zero-shot settings, across drug classes (e.g., predicting anti-CTLA4 outcomes after training on anti-PD1), and in small-cohort fine-tuning.
● Mechanistic Insight into Resistance: Personalized response maps reveal actionable biological mechanisms. For instance, inflamed non-responders show resistance via TGF-β signaling, vascular exclusion, CD4+ T cell dysfunction, or B cell deficiency. These go beyond classical “inflamed/desert/excluded” phenotypes, offering nuanced patient stratification.
● Clinical Utility and Survival Stratification: COMPASS-predicted responders had significantly better survival in a held-out phase II bladder cancer trial (HR = 4.7, *p* = 1.7e-7), outperforming standard biomarkers (TMB, PD-L1 IHC, immune phenotype). | [Paper](https://www.medrxiv.org/content/10.1101/2025.05.01.25326820v1) | | 6) Towards a Deeper Understanding of Reasoning in LLMs This paper investigates whether LLMs can adapt and reason in dynamic environments, moving beyond static benchmarks. Using the SmartPlay benchmark—a suite of four interactive games that require diverse cognitive skills—the authors evaluate three prompting strategies: self-reflection, heuristic mutation (via an Oracle), and planning. They test these methods across models of varying size (Llama3-8B to Llama3.3-70B) and draw several conclusions on how model scale and prompting interact with task complexity. Key findings:
● Model size dominates performance, especially on reactive and structured reasoning tasks. Larger models (e.g., Llama3.3-70B) significantly outperform smaller ones on tasks like Tower of Hanoi and Bandit, where fast exploitation or spatial planning is critical.
● Advanced prompting helps smaller models more, particularly on complex tasks. For example, Llama3-8B with Reflection+Oracle surpasses Llama3.3-70B’s baseline on Rock-Paper-Scissors. However, these strategies introduce high variance and can lead to worse-than-baseline performance depending on the run.
● Long prompts hurt smaller models on simple tasks. In Bandit, adding reflective reasoning decreases performance by distracting the model or prolonging exploration. This aligns with prior findings on prompt length and signal-to-noise ratio.
● Prompting strategy gains depend on task type. Instruction following improves across all models, while long-text understanding benefits mid-sized models. In contrast, strategies show weak or negative impact on planning, reasoning, and spatial challenges for large models.
● Dense reward shaping improves performance more reliably than prompting. In follow-up experiments, modifying sparse reward signals (especially in Hanoi and Messenger) led to more consistent gains than tweaking prompt strategies. | [Paper](https://arxiv.org/abs/2505.10543), [Tweet](https://x.com/omarsar0/status/1924182825693061403) | | 7) AdaptThink This paper introduces AdaptThink, an RL framework designed to help reasoning models decide when to use detailed chain-of-thought reasoning (“Thinking”) versus directly producing an answer (“NoThinking”), based on task difficulty. This approach challenges the prevailing assumption that deep reasoning should be applied uniformly across all problems, showing that skipping the “thinking” step often yields better efficiency and even higher accuracy on simpler tasks. Key insights:
● NoThinking outperforms Thinking on simple problems: The authors demonstrate that models like DeepSeek-R1 perform better (in both accuracy and efficiency) when using NoThinking mode, an empty token prompt, for easy problems. For example, on Level 1 MATH500 problems, NoThinking achieved slightly better accuracy with significantly fewer tokens used.
● AdaptThink learns to switch modes: The proposed RL algorithm introduces a constrained optimization that promotes NoThinking as long as accuracy doesn’t degrade. It uses a novel importance sampling strategy to enable cold-start learning of both modes from the beginning, avoiding the collapse into all-Thinking behavior.
● Massive gains in efficiency and performance: On GSM8K, MATH500, and AIME 2024, AdaptThink reduced response length by up to 53% and improved accuracy by up to 2.4% over DeepSeek-R1-Distill-Qwen-1.5B. It also outperformed prior methods (e.g., DPOShortest, TLMRE, ModelMerging) in the trade-off between accuracy and response length.
● Robustness and generalization: AdaptThink generalizes to out-of-distribution tasks such as MMLU, maintaining or improving accuracy while reducing token usage. It also avoids "implicit thinking" in NoThinking responses, showing controlled behavior during inference. | [Paper](https://arxiv.org/abs/2505.13417) | | 8) MedBrowseComp MedBrowseComp is a new benchmark designed to evaluate LLM agents’ ability to perform complex, multi-hop medical fact-finding by browsing real-world, domain-specific web resources. Testing over 1,000 clinically grounded questions, the benchmark reveals major capability gaps in current models, with top systems achieving only 50% accuracy and GUI-based agents performing even worse. | [Paper](https://arxiv.org/abs/2505.14963), [Tweet](https://x.com/shan23chen/status/1925549357308236029) | | 9) ARC-AGI-2 ARC-AGI-2 is a new benchmark designed to push the boundaries of AI reasoning beyond the original ARC-AGI. It introduces harder, more unique tasks emphasizing compositional generalization and human-like fluid intelligence, with baseline AI models performing below 5% accuracy despite strong ARC-AGI-1 results. | [Paper](https://arxiv.org/abs/2505.11831), [Tweet](https://x.com/arcprize/status/1924869061542085041) | | 10) Teaching MLLMs to Think with Images GRIT is a new method that enables MLLMs to perform grounded visual reasoning by interleaving natural language with bounding box references. Using a reinforcement learning approach (GRPO-GR), GRIT achieves strong reasoning and grounding performance with as few as 20 image-question-answer triplets, outperforming baselines in both accuracy and visual coherence. | [Paper](https://arxiv.org/abs/2505.15879), [Tweet](https://x.com/YFan_UCSC/status/1925719736043569188) | ## Top ML Papers of the Week (May 12 - May 18) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) AlphaEvolve AlphaEvolve is a coding agent developed by Google DeepMind that uses LLM-guided evolution to discover new algorithms and optimize computational systems. It orchestrates a pipeline where LLMs generate code changes, evaluators provide feedback, and an evolutionary loop iteratively improves solutions. AlphaEvolve shows that LLMs can go beyond conventional code generation and assist in scientific and algorithmic discovery. Key highlights:
● Novel Algorithm Discovery: AlphaEvolve discovered a new algorithm to multiply 4×4 complex-valued matrices using 48 multiplications, the first improvement over Strassen’s 1969 result (49 multiplications) in this setting.
● Broad Mathematical Impact: Applied to 50+ open problems in mathematics, AlphaEvolve matched or exceeded state-of-the-art in ~95% of cases. For example, it improved bounds on Erdős’s minimum overlap problem and kissing numbers in 11 dimensions.
● Infrastructure Optimization at Google: AlphaEvolve improved key components of Google’s compute stack:
● Advanced Pipeline Design: AlphaEvolve uses ensembles of Gemini 2.0 Flash and Pro models. It supports rich prompts (past trials, evaluations, explicit context), multi-objective optimization, and evaluation cascades for robust idea filtering. Programs are evolved at full-file scale rather than function-level only, a key differentiator from predecessors like FunSearch.
● Ablations Confirm Component Importance: Experiments show that evolution, prompt context, full-file evolution, and using strong LLMs all contribute significantly to performance. Removing any one of these reduces effectiveness. | [Paper](https://storage.googleapis.com/deepmind-DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf), [Tweet](https://x.com/GoogleDeepMind/status/1922669321559347498) | | 2) LLMs Get Lost in Multi-Turn Conversation Investigates how top LLMs degrade in performance during underspecified, multi-turn interactions, common in real-world usage but rarely evaluated. The authors introduce a novel "sharded simulation" framework that breaks down fully-specified instructions into gradual conversation shards, simulating how users naturally provide information over time. Key findings:
● Massive performance drop: Across 15 top LLMs (e.g., GPT-4.1, Gemini 2.5 Pro, Claude 3.7), average performance dropped 39% in multi-turn vs. single-turn settings. Even a two-turn interaction was enough to cause a significant decline.
● High unreliability, not just low aptitude: Decomposition shows only a small drop in best-case capability (aptitude) but a 112% increase in unreliability, meaning models are wildly inconsistent depending on how the conversation unfolds.
● Root causes of failure: Through log analysis and experiments, the paper identifies four major issues:
● Sharded evaluation tasks: The authors built 600+ multi-turn simulations across 6 tasks (coding, math, SQL, API calls, summarization, and table captioning), showing consistent degradation across domains.
● Agent-style interventions only partially help: Techniques like recap and snowballing (repeating all prior turns) improved outcomes by ~15–20% but did not restore single-turn levels, suggesting that model internals, not prompting strategies, are the bottleneck.
● Temperature and test-time compute don't solve the issue: Even at temperature 0.0 or with reasoning models (like o3 and DeepSeek-R1), models remained highly unreliable in multi-turn settings. | [Paper](https://arxiv.org/abs/2505.06120), [Tweet](https://x.com/omarsar0/status/1922755721428598988) | | 3) RL for Reasoning in LLMs with One Training Example This paper shows that Reinforcement Learning with Verifiable Rewards (RLVR) can significantly improve mathematical reasoning in LLMs even when trained with just a single example. On the Qwen2.5-Math-1.5B model, one-shot RLVR improves accuracy on the MATH500 benchmark from 36.0% to 73.6%, nearly matching performance achieved with over 1,200 examples. Two-shot RLVR (with two examples) even slightly surpasses that, matching results from full 7.5k example training.
● Extreme data efficiency: A single training example (π₁₃) boosts MATH500 accuracy to 73.6% and average performance across six math benchmarks to 35.7%, rivaling full-dataset RLVR. Two-shot RLVR goes further (74.8% and 36.6%).
● Broad applicability: 1-shot RLVR works not only on Qwen2.5-Math-1.5B, but also on Qwen2.5-Math-7B, Llama3.2-3B-Instruct, and DeepSeek-R1-Distill-Qwen-1.5B. It remains effective across GRPO and PPO RL algorithms.
● Post-saturation generalization: Despite training accuracy saturating early (within 100 steps), test accuracy continues improving well beyond, reaching gains of +10% after 2,000 steps. The model eventually overfits the single example (mixing gibberish into outputs), yet test performance remains stable.
● Cross-domain and reflection behavior: A single example from one domain (e.g., geometry) improves performance across others (e.g., number theory). Additionally, models trained with 1-shot RLVR exhibit increased self-reflection (e.g., “rethink”, “recalculate”) and longer output sequences.
● Loss function insights: Ablation studies confirm that policy gradient loss is the primary driver of improvements, not weight decay, distinguishing 1-shot RLVR from "grokking". Entropy loss further enhances performance and generalization; even without reward signals, entropy-only training can still yield a 27% performance boost. | [Paper](https://arxiv.org/abs/2504.20571), [Tweet](https://x.com/ypwang61/status/1917596101953348000) | | 4) AM-Thinking-v1 Introduces a dense, open-source 32B language model that achieves state-of-the-art performance in reasoning tasks, rivaling significantly larger Mixture-of-Experts (MoE) models. Built upon Qwen2.5-32B, the model is trained entirely with public data and showcases how a meticulously crafted post-training pipeline can unlock competitive performance at mid-scale sizes. Key points:
● Benchmark performance: AM-Thinking-v1 scores 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, outperforming DeepSeek-R1 (671B MoE) and matching or exceeding Qwen3-32B and Seed1.5-Thinking. On Arena-Hard (general chat), it hits 92.5, near the level of OpenAI o1 and o3-mini but behind Qwen3-235B-A22B and Gemini 2.5 Pro.
● Training pipeline: The model uses a two-stage post-training approach combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). SFT emphasizes a “think-then-answer” format and uses 2.84M samples, while RL incorporates difficulty-aware sampling and a two-stage curriculum optimized via Group Relative Policy Optimization (GRPO).
● Data and filtering: All training data is publicly sourced and heavily filtered. Math data goes through LLM-assisted cleaning and cross-model ground-truth validation. Responses are filtered using perplexity, n-gram repetition, and structural checks to ensure coherence and correctness.
● Inference and deployment: The authors implement a custom rollout framework atop, decoupling rollout from inference via a streaming load balancer. This reduces long-tail latency and increases throughput across distributed GPU nodes, enabling scalable RL training at 32k sequence length. | [Paper](https://arxiv.org/abs/2505.08311), [Tweet](https://x.com/omarsar0/status/1922668488826741061) | | 5) HealthBench HealthBench is a benchmark of 5,000 multi-turn health conversations graded against 48,562 rubric criteria written by 262 physicians across 60 countries. Unlike prior multiple-choice evaluations, HealthBench supports open-ended, realistic assessments of LLM responses across diverse health themes (e.g., global health, emergency care, context-seeking) and behavioral axes (accuracy, completeness, communication, context awareness, instruction following).
● Significant frontier model gains: HealthBench reveals rapid performance improvements, with GPT-3.5 Turbo scoring 16%, GPT-4o reaching 32%, and o3 achieving 60%. Notably, smaller models like GPT-4.1 nano outperform GPT-4o while being 25x cheaper.
● Two challenging benchmark variants: HealthBench Consensus focuses on 34 physician-validated criteria (e.g., recognizing emergencies), while HealthBench Hard isolates 1,000 difficult examples on which no model scores above 32%, establishing headroom for future progress.
● Physician comparison baseline: Surprisingly, LLMs like o3 and GPT-4.1 often produce higher-quality responses than unassisted physicians. When provided with model responses as references, physicians improved older model completions but couldn’t improve completions from newer models.
● Reliable model-based grading: Meta-evaluation shows GPT-4.1 as a grader achieves macro F1 scores comparable to physicians. On average, its agreement with other doctors places it in the 51st–88th percentile across themes like emergency triage, communication, and uncertainty handling.
● Safety-relevant insights: The benchmark assesses worst-case performance using "worst-at-k" scores, showing that even the best models have reliability gaps. For example, o3’s worst-at-16 score drops by a third from its average, underscoring the need for further safety work. | [Paper](https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf), [Tweet](https://x.com/OpenAI/status/1921983050138718531) | | 6) Nemotron-Research-Tool-N1 Introduces Tool-N1, a family of tool-using LLMs trained using a rule-based reinforcement learning (R1-style RL) approach, without reliance on supervised reasoning trajectories. The key idea is to enable models to learn to invoke external tools correctly through binary feedback based on functional correctness and format adherence, rather than step-by-step imitation.
● Rule-based RL over SFT: Tool-N1 models are trained using a lightweight binary reward that only evaluates whether the model's tool calls are structurally correct and functionally valid. This allows the model to develop its reasoning process, sidestepping the limitations of mimicking distilled trajectories via supervised fine-tuning (SFT).
● Strong benchmark results: Tool-N1-7B and Tool-N1-14B outperform GPT-4o and domain-specialized models on several benchmarks, including BFCL, API-Bank, and ACEBench. For example, Tool-N1-14B beats GPT-4o on BFCL overall (85.97 vs 83.97) and achieves +5% over GPT-4o on API-Bank.
● Pure RL outperforms SFT-then-RL: A systematic comparison on 5,518 distilled trajectories shows that pure RL yields better results than the SFT-then-RL pipeline, challenging the dominant paradigm. For instance, 100% RL achieves 83.24% average vs. 83.17% for SFT+RL.
● Binary reward fine-grained reward: Ablation studies reveal that strict binary rewards (requiring correct reasoning format and exact tool call) lead to better generalization than partial credit schemes, especially on realistic “Live” data (80.38% vs 76.61%).
● Scaling and generalization: Performance scales well with model size, with the most gains observed in larger models. The method generalizes across backbones, with Qwen2.5-Instruct outperforming LLaMA3 variants at the same scale. | [Paper](https://arxiv.org/abs/2505.00024), [Tweet](https://x.com/ShaokunZhang1/status/1922105694167433501) | | 7) RL for Search-Efficient LLMs Proposes a new RL-based framework (SEM) that explicitly teaches LLMs when to invoke search and when to rely on internal knowledge, aiming to reduce redundant tool use while maintaining answer accuracy. Key points:
● Motivation & Setup: LLMs often overuse external search even for trivial queries. SEM addresses this by using a balanced training dataset (Musique for unknowns, MMLU for knowns) and a structured format (● Reward Optimization: The authors employ Group Relative Policy Optimization (GRPO) to compare outputs within query groups. The reward function penalizes unnecessary search and rewards correct answers, either without search or with efficient search-and-reasoning when needed.
● Experimental Results: On HotpotQA and MuSiQue, SEM significantly outperforms Naive RAG and ReSearch, achieving higher EM and LLM-Judged (LJ) accuracy with smarter search ratios. On MMLU and GSM8K (where search is often unnecessary), SEM maintains high accuracy while invoking search far less than baseline methods (e.g., 1.77% SR vs 47.98% for Naive RAG on MMLU.
● Case Study & Efficiency: SEM avoids absurd search behavior like querying “What is 1+1?” multiple times. It also uses fewer but more targeted searches for unknowns, enhancing both interpretability and computational efficiency. Training dynamics further show that SEM enables faster and more stable learning than prior methods. | [Paper](https://arxiv.org/abs/2505.07903), [Tweet](https://x.com/omarsar0/status/1922665313117552664) | | 8) Cost-Efficient, Low-Latency Vector Search Integrates DiskANN (a vector indexing library) inside of Azure Cosmos DB NoSQL (an operational dataset) that uses a single vector index per partition stored in existing index trees. Benefit: It supports < 20ms query latency over an index spanning 10 million vectors, has stable recall over updates, and offers nearly 15× and 41× lower query cost compared to Zilliz and Pinecone serverless enterprise products. It can further scale to billions of vectors with automatic partitioning. | [Paper](https://arxiv.org/abs/2505.05885), [Tweet](https://x.com/omarsar0/status/1921938925142384736) | | 9) AI Agents vs. Agentic AI This review paper distinguishes AI Agents from Agentic AI, presenting a structured taxonomy and comparing their architectures, capabilities, and challenges. AI Agents are defined as modular, task-specific systems powered by LLMs and tools, while Agentic AI represents a shift toward multi-agent collaboration, dynamic task decomposition, and orchestrated autonomy, with applications and challenges mapped out for both paradigms, along with proposed solutions like RAG, orchestration layers, and causal modeling. | [Paper](https://arxiv.org/abs/2505.10468), [Tweet](https://x.com/omarsar0/status/1923817691455873420) | | 10) CellVerse Introduces a benchmark to evaluate LLMs on single-cell biology tasks by converting multi-omics data into natural language. While generalist LLMs like DeepSeek and GPT-4 families show some reasoning ability, none significantly outperform random guessing on key tasks like drug response prediction, exposing major gaps in biological understanding by current LLMs. | [Paper](https://arxiv.org/abs/2505.07865), [Tweet](https://x.com/omarsar0/status/1922662317986099522) | ## Top ML Papers of the Week (May 5 - May 11) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) The Leaderboard Illusion The Leaderboard Illusion investigates systemic distortions in how the Chatbot Arena leaderboard evaluates LLMs, arguing that current practices undermine fair model comparison and scientific progress. Through extensive data analysis covering 2M Arena battles, the authors identify four key issues distorting rankings:
● Selective score reporting through private testing: Some providers (notably Meta, Google, and OpenAI) are allowed to test dozens of model variants privately and only publish the best-performing one. This violates the unbiased sampling assumption of the Bradley-Terry (BT) model, which powers Arena rankings. Simulations show that testing just 10 variants can artificially inflate a model’s Arena score by ~100 points.
● Extreme data asymmetries: Proprietary models are oversampled compared to open-weight and open-source models. OpenAI and Google alone received over 39% of all Arena data, while 83 open-weight models collectively received only 29.7%. These data advantages translate into significant performance gains: a model trained on 70% Arena data outperforms its baseline by 112% on the ArenaHard benchmark.
● Unfair and opaque deprecations: 205 models were silently removed from the leaderboard despite only 47 being officially marked as deprecated. Open-source models are disproportionately affected, breaking the comparison graph and violating BT model assumptions, leading to unreliable rankings.
● Overfitting to Arena-specific dynamics: Due to partial prompt repetition and distributional drift over time, access to Arena data allows providers to tune models specifically for Arena performance. This leads to high win rates on Arena benchmarks, but not on out-of-distribution tasks like MMLU, where gains diminish or reverse. | [Paper](https://arxiv.org/abs/2504.20879) | | 2) Llama-Nemotron NVIDIA introduces the Llama-Nemotron model series, LN-Nano (8B), LN-Super (49B), and LN-Ultra (253B), a family of open, efficient, and high-performing reasoning models. These models rival or outperform DeepSeek-R1 on various benchmarks while offering significantly better inference throughput and memory efficiency. LN-Ultra is noted as the most "intelligent" open model by Artificial Analysis. A key innovation is a dynamic reasoning toggle ("detailed thinking on/off") that allows users to control reasoning behavior at inference time. Highlights:
● Multi-stage training: Models were built via neural architecture search (Puzzle), knowledge distillation, continued pretraining, supervised fine-tuning (SFT), and large-scale RL. LN-Ultra is enhanced with FP8 inference and FFN Fusion for speed and scalability.
● Reasoning Toggle: The models can switch between reasoning and non-reasoning modes via a simple prompt instruction, making them adaptable for various use cases.
● Synthetic dataset: Over 33M examples across math, code, science, and instruction-following were curated, with reasoning-mode samples tagged explicitly. LN-Ultra's training used curriculum RL and GRPO to surpass its teachers on benchmarks like GPQA-D.
● Evaluation dominance: LN-Ultra outperforms DeepSeek-R1 and Llama-3.1-405B in reasoning tasks like AIME25, MATH500, and GPQA-Diamond while also achieving strong chat alignment scores (Arena-Hard: 87.0). LN-Super scores 88.3, beating Claude 3.5 and GPT-4o. NVIDIA provides the weights, training code (NeMo, Megatron-LM, NeMo-Aligner), and the full post-training dataset under a permissive license, aiming to push open research in reasoning models. | [Paper](https://arxiv.org/abs/2505.00949v1), [Models](https://huggingface.co/nvidia) | | 3) Absolute Zero Introduces an LLM training framework that eliminates the need for human-curated data. Key highlights:
● It learns to propose and solve its reasoning tasks entirely through self-play, guided by verifiable feedback from an execution environment. This zero-data RLVR (RL with Verifiable Rewards) setting achieves SOTA coding and math reasoning performance.
● AZR learns by generating its code-based reasoning tasks using three core reasoning modes (deduction, abduction, and induction), validating solutions via Python execution, not human labels.
● A single LLM plays both roles, proposing new tasks based on learnability and solving them with feedback-based reinforcement. Rewards favor moderately difficult tasks to maximize the learning signal.
● Despite using zero in-domain examples, AZR outperforms all previous zero-setting models on average by +1.8 points and even beats models trained on tens to hundreds of thousands of curated samples. AZR-Coder-7B achieves the highest average score across all tested models.
● AZR trained in a coding-only environment improves mathematical reasoning performance by up to +15.2 points, far more than expert code models trained with RLVR, showing strong generalization.
● Larger AZR models (3B → 7B → 14B) consistently show greater improvements, confirming scalability and suggesting promise for even larger models.
● AZR develops natural ReAct-like intermediate planning in code (e.g., interleaved comments and logic), trial-and-error strategies in abduction, and systematic state tracking, behaviors typically observed in much larger models.
● Llama-3.1-8B variants of AZR sometimes produce concerning reasoning chains (dubbed “uh-oh moments”), highlighting the importance of safety-aware training in autonomous systems. | [Paper](https://arxiv.org/abs/2505.03335), [Tweet](https://x.com/AndrewZ45732491/status/1919920459748909288) | | 4) Discuss-RAG This paper introduces Discuss-RAG, a plug-and-play agent-based framework that enhances retrieval-augmented generation (RAG) for medical question answering by mimicking human-like clinical reasoning. Standard RAG systems rely on embedding-based retrieval and lack mechanisms to verify relevance or logical coherence, often leading to hallucinations or outdated answers. Discuss-RAG addresses these gaps via a modular agent setup that simulates multi-turn medical discussions and performs post-retrieval verification. Key ideas:
● Multi-agent collaboration: A summarizer agent orchestrates a team of medical domain experts who iteratively refine a contextual summary through simulated brainstorming, providing deeper and more structured information to guide retrieval.
● Decision-making agent: After retrieval, a verifier and a decision-making agent assess snippet quality and trigger fallback strategies when relevance is low, improving answer accuracy and contextual grounding.
● Plug-and-play design: Discuss-RAG is training-free and modular, allowing easy integration into existing RAG pipelines.
● Strong performance gains: Across four benchmarks, Discuss-RAG outperforms MedRAG with substantial accuracy improvements, notably +16.67% on BioASQ and +12.20% on PubMedQA. | [Paper](https://arxiv.org/abs/2504.21252) | | 5) The Value of RL in Fine-Tuning This work shows that, in theory, every popular preference-fine-tuning objective collapses to maximum-likelihood estimation (MLE), yet experiments show a consistent RL advantage on real tasks. They reconcile this gap with a generation-verification complexity hypothesis.
● Theory: RLHF ≈ MLE – Under mild assumptions, trajectory-level RLHF, DPO, and related algorithms are equivalent to projecting the data back to likelihood space, so expending compute on on-policy sampling should be unnecessary.
● Empirics contradict naïve theory – On the tl;dr summarization benchmark with Pythia-1.4B/2.8B, a single online-DPO iteration lifts win-rate by 6-10 pts over offline DPO despite identical data, model, and optimizer, confirming that RL can add real value.
● Takeaways – RL helps when crafting a good answer is harder than checking one. The gap vanishes on two-word summaries (horizon = 1) or when ROUGE-L is used as the reward. RL acts as a shortcut through policy space only when the reward model is simpler than the policy it trains. For tasks where verification is as hard as generation, offline likelihood-based fine-tuning suffices, guiding practitioners on when RLHF is worth its extra cost. | [Paper](https://arxiv.org/abs/2503.01067) | | 6) WebThinker This paper introduces a reasoning agent framework that equips large reasoning models (LRMs) with autonomous web exploration and report writing abilities to overcome limitations of static internal knowledge. WebThinker integrates a Deep Web Explorer module and an Autonomous Think-Search-and-Draft strategy that lets models search the web, reason through tasks, and generate comprehensive outputs simultaneously. It also incorporates an RL-based training loop using online DPO to improve tool usage. The system supports two modes: complex problem solving and scientific report generation. Key points:
● Superior performance in complex reasoning: On GPQA, GAIA, WebWalkerQA, and HLE, WebThinker-32B-RL achieved new state-of-the-art results among 32B models, outperforming both retrieval-augmented and proprietary systems like GPT-4o and DeepSeek-R1-671B. For example, it reached 70.7% on GPQA and 15.8% on HLE, with gains of up to +21.5% over baselines.
● Best-in-class scientific report writing: On the Glaive dataset, WebThinker outperformed Gemini2.0 Deep Research and Grok3 DeeperSearch, scoring 8.1 in average quality metrics such as completeness and coherence.
● RL refinement matters: The RL-trained version outperformed its base counterpart across all benchmarks, showing that iterative preference-based learning significantly enhances reasoning-tool coordination.
● Ablation validates design: Removing components like Deep Web Explorer or automatic report drafting significantly degraded performance, confirming their necessity. | [Paper](https://arxiv.org/abs/2504.21776) | | 7) Reward Modeling as Reasoning This work proposes a new class of reward models, called ReasRMs, that reformulate reward modeling as a reasoning task. The authors introduce RM-R1, a family of generative reward models that produce interpretable reasoning traces and rubrics during preference judgments. Instead of relying on scalar scores or shallow generation, RM-R1 models leverage structured reasoning and reinforcement learning to improve both interpretability and performance across benchmarks.
● RM-R1 adopts a two-stage training process: (1) distillation of reasoning traces from stronger models, and (2) reinforcement learning with verifiable rewards. The Chain-of-Rubrics (CoR) prompting framework guides the model to either solve reasoning problems or generate evaluation rubrics depending on the task type (reasoning or chat).
● On RewardBench, RM-Bench, and RMB, RM-R1 models achieve state-of-the-art or near-SOTA performance, outperforming models like GPT-4o and Llama3.1-405B by up to 13.8% despite using fewer parameters and less data.
● Ablation studies show that cold-start RL alone is insufficient; task-type classification and high-quality distillation are key. RM-R1's distilled warm-start training leads to more stable learning and longer, more accurate reasoning traces.
● RM-R1 also shows strong generalization across domains and better rubric quality than baseline methods, especially in sensitive contexts like safety and medical judgment. The authors open-sourced six RM-R1 models, training data, and code to support reproducibility. | [Paper](https://arxiv.org/abs/2505.02387) | | 8) Paper2Code Introduces PaperCoder, a multi-agent LLM framework that transforms ML papers into full code repositories without relying on pre-existing implementations.
● PaperCoder decomposes the code generation process into three stages: Planning (roadmap, architecture, file dependencies, config files), Analyzing (file-specific logic extraction), and Coding (dependency-aware file generation). Each step is handled by specialized LLM agents.
● It is evaluated using both the proposed Paper2Code benchmark (90 papers from ICML, NeurIPS, and ICLR 2024) and PaperBench Code-Dev. Results show PaperCoder outperforms ChatDev, MetaGPT, and naive baselines across reference-based, reference-free, and human evaluations.
● In human assessments by original paper authors, 77% chose PaperCoder as best implementation; 85% said it helped them reproduce their work. On average, only 0.48% of code lines required changes for executability.
● A detailed ablation study shows consistent performance gains from each stage, especially logic design and file dependency ordering. PaperCoder, using the o3-mini-high backbone, notably outperforms other LLM variants. | [Paper](https://arxiv.org/abs/2504.17192) | | 9) ZeroSearch ZeroSearch is an RL framework that trains LLMs to develop search capabilities without using real search engines. It uses simulated LLM-generated documents with a curriculum-based degradation strategy and outperforms real-search methods like Search-R1 in both performance and cost, achieving better QA accuracy across multiple benchmarks. | [Paper](https://arxiv.org/abs/2505.04588), [Tweet](https://x.com/omarsar0/status/1920469148968362407) | | 10) Practical Efficiency of Muon for Pretraining Discusses how Muon, a simple second-order optimizer, outperforms AdamW in large-batch pretraining by expanding the compute-time Pareto frontier and maintaining better data efficiency. Combined with muP scaling and a novel telescoping algorithm for hyperparameter transfer, it enables faster training with minimal tuning overhead up to 4B parameter models. | [Paper](https://arxiv.org/abs/2505.02222) | ## Top ML Papers of the Week (April 28 - May 4) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) Phi-4-Mini-Reasoning Microsoft released Phi-4-Mini-Reasoning to explore small reasoning language models for math. Highlights:
● Phi-4-Mini-Reasoning: The paper introduces Phi-4-Mini-Reasoning, a 3.8B parameter small language model (SLM) that achieves state-of-the-art mathematical reasoning performance, rivaling or outperforming models nearly twice its size.
● Unlocking Reasoning: They use a systematic, multi-stage training pipeline to unlock strongbr> reasoning capabilities in compact models, addressing the challenges posed by their limited capacity. Uses large-scale distillation, preference learning, and RL with verifiable rewards.
● Four-Stage Training Pipeline: The model is trained using (1) mid-training with large-scale long CoT data, (2) supervised fine-tuning on high-quality CoT data, (3) rollout-based Direct Preference Optimization (DPO), and (4) RL using verifiable reward signals.
● Math Performance: On MATH-500, Phi-4-Mini-Reasoning reaches 94.6%, surpassing DeepSeek-R1-Distill-Qwen-7B (91.4%) and DeepSeek-R1-Distill-Llama-8B (86.9%), despite being smaller.
● Verifiable Reward Reinforcement Learning: The final RL stage, tailored for small models, includes prompt filtering, oversampling for balanced training signals, and temperature annealing. This improves training stability and aligns exploration with evaluation conditions.
● Massive Synthetic Data Generation: The model is mid-trained on 10M CoT rollouts generated by DeepSeek-R1, filtered for correctness using math verifiers and GPT-4o-mini, and categorized by domain and difficulty to ensure broad generalization.
● Ablation Study: Each phase of the pipeline shows clear gains. Notably, fine-tuning and RL each deliver ~5–7 point improvements after mid-training and DPO, showing the value of the full pipeline over isolated techniques. | [Paper](https://arxiv.org/abs/2504.21233), [Tweet](https://x.com/omarsar0/status/1917954418173247909) | | 2) Building Production-Ready AI Agents with Scalable Long-Term Memory This paper proposes a memory-centric architecture for LLM agents to maintain coherence across long conversations and sessions, solving the fixed-context window limitation. Main highlights:
● The solution introduces two systems: Mem0, a dense, language-based memory system, and Mem0g, an enhanced version with graph-based memory to model complex relationships. Both aim to extract, consolidate, and retrieve salient facts over time efficiently.
● Mem0: Uses a two-stage architecture (extraction & update) to maintain salient conversational memories. It detects redundant or conflicting information and manages updates using tool-calls, resulting in a lightweight, highly responsive memory store (7K tokens per conversation).
● Mem0g: By structuring memory as a knowledge graph of entities and relationships, Mem0g improves performance in tasks needing temporal and relational reasoning (e.g., event ordering, preference tracking) while maintaining reasonable latency and memory cost (14K tokens/convo).
● Benchmarking on LOCOMO: Both systems were evaluated against six memory system baselines (e.g., A-Mem, OpenAI, Zep, LangMem, RAG). Mem0g achieves the best overall LLM-as-a-Judge (J) score of 68.44%, outperforming all RAG and memory baselines by 7–28% in J and reducing p95 latency by 91% over full-context methods.
● Latency and efficiency: Mem0 achieves the lowest search and total latencies (p95 = 1.44s), and Mem0g still outperforms other graph-based or RAG systems by large margins in speed and efficiency. Great for real-time deployments.
● Use-case strengths: Mem0 and Mem0g offer a scalable memory architecture for long-term LLM agents to improve factual recall, reasoning depth, and efficiency, making them id | [Paper](https://arxiv.org/abs/2504.19413), [Tweet](https://x.com/omarsar0/status/1917247776221700134) | | 3) UniversalRAG UniversalRAG is a framework that overcomes the limitations of existing RAG systems confined to single modalities or corpora. It supports retrieval across modalities (text, image, video) and at multiple granularities (e.g., paragraph vs. document, clip vs. video). Contributions from the paper:
● Modality-aware routing: To counter modality bias in unified embedding spaces (where queries often retrieve same-modality results regardless of relevance), UniversalRAG introduces a router that dynamically selects the appropriate modality (e.g., image vs. text) for each query.
● Granularity-aware retrieval: Each modality is broken into granularity levels (e.g., paragraphs vs. documents for text, clips vs. full-length videos). This allows queries to retrieve content that matches their complexity -- factual queries use short segments while complex reasoning accesses long-form data.
● Flexible routing: It supports both training-free (zero-shot GPT-4o prompting) and trained (T5-Large) routers. Trained routers perform better on in-domain data, while GPT-4o generalizes better to out-of-domain tasks. An ensemble router combines both for robust performance.
● Performance: UniversalRAG outperforms modality-specific and unified RAG baselines across 8 benchmarks spanning text (e.g., MMLU, SQuAD), image (WebQA), and video (LVBench, VideoRAG). With T5-Large, it achieves the highest average score across modalities.
● Case study: In WebQA, UniversalRAG correctly routes a visual query to the image corpus (retrieving an actual photo of the event), while TextRAG and VideoRAG fail. Similarly, on HotpotQA and LVBench, it chooses the right granularity, retrieving documents or short clips. Overall, this is a great paper showing the importance of considering modality and granularity in a RAG system. | [Paper](https://arxiv.org/abs/2504.20734), [Tweet](https://x.com/omarsar0/status/1917637837295608180) | | 4) DeepSeek-Prover-V2 DeepSeek-Prover-V2 is an LLM (671B) that significantly advances formal theorem proving in Lean 4. The model is built through a novel cold-start training pipeline that combines informal chain-of-thought reasoning with formal subgoal decomposition, enhanced through reinforcement learning. It surpasses prior state-of-the-art on multiple theorem-proving benchmarks. Key highlights:
● Cold-start data via recursive decomposition: The authors prompt DeepSeek-V3 to generate natural-language proof sketches, decompose them into subgoals, and formalize these steps in Lean with sorry placeholders. A 7B prover model then recursively fills in the subgoal proofs, enabling efficient construction of complete formal proofs and training data.
● Curriculum learning + RL: A subgoal-based curriculum trains the model on increasingly complex problems. Reinforcement learning with a consistency reward is used to enforce alignment between proof structure and CoT decomposition, improving performance on complex tasks.
● Dual proof generation modes: The model is trained in two modes, non-CoT (efficient, minimal proofs) and CoT (high-precision, interpretable). The CoT mode yields significantly better performance, particularly on hard problems.
● Benchmark results: | [Paper](https://arxiv.org/abs/2504.21801), [Tweet](https://x.com/zhs05232838/status/1917600755936018715) | | 5) Kimi-Audio Kimi-Audio is a new open-source audio foundation model built for universal audio understanding, generation, and speech conversation. The model architecture uses a hybrid of discrete semantic audio tokens and continuous Whisper-derived acoustic features. It is initialized from a pre-trained LLM and trained on 13M+ hours of audio, spanning speech, sound, and music. It also supports a streaming detokenizer with chunk-wise decoding and a novel look-ahead mechanism for smoother audio generation. Extensive benchmarking shows that Kimi-Audio outperforms other audio LLMs across multiple modalities and tasks. Key highlights:
● Architecture: Kimi-Audio uses a 12.5Hz semantic tokenizer and an LLM with dual heads (text + audio), processing hybrid input (discrete + continuous). The audio detokenizer employs a flow-matching upsampler with BigVGAN vocoder for real-time speech synthesis.
● Massive Training Corpus: Pretrained on 13M+ hours of multilingual, multimodal audio. A rigorous preprocessing pipeline adds speech enhancement, diarization, and transcription using Whisper and Paraformer-Zh. Fine-tuning uses 300K+ hours from 30+ open datasets.
● Multitask Training: Training spans audio-only, text-only, ASR, TTS, and three audio-text interleaving strategies. Fine-tuning is instruction-based, with both audio/text instructions injected via zero-shot TTS.
● Evaluation: On ASR (e.g., LibriSpeech test-clean: 1.28 WER), audio understanding (CochlScene: 80.99), and audio-to-text chat (OpenAudioBench avg: 69.8), Kimi-Audio sets new SOTA results, beating Qwen2.5-Omni and Baichuan-Audio across the board. | [Paper](https://github.com/MoonshotAI/Kimi-Audio/blob/master/assets/kimia_report.pdf), [Tweet](https://x.com/Kimi_Moonshot/status/1915807071960007115) [Model](https://github.com/MoonshotAI/Kimi-Audio) | | 6) MiMo-7B Xiaomi releases MiMo-7B, a new language model for reasoning tasks. MiMo-7B is explicitly designed for advanced reasoning across math and code. Highlights:
● MiMo-7B: MiMo-7B narrows the capability gap with larger 32B-class models through careful pretraining & posttraining. MiMo-7B-Base is trained from scratch on 25T tokens, with a 3-stage mixture skewed toward mathematics and code (70% in stage 2).
● Pre-Training: The team improves HTML and PDF extraction to better preserve STEM data, leverages LLMs to generate diverse synthetic reasoning content, and adds a Multi-Token Prediction (MTP) objective that boosts both quality and inference speed.
● Base Performance: MiMo-7B-Base outperforms other 7B–9B models like Qwen2.5, Gemma-2, and Llama-3.1 across BBH (+5 pts), AIME24 (+22.8 pts), and LiveCodeBench (+27.9 pts). On BBH and LiveCodeBench, it even beats larger models on reasoning-heavy tasks.
● RL: MiMo-7B-RL is trained with a test difficulty–driven reward function and easy-data resampling to tackle sparse-reward issues and instabilities. In some cases, it surpasses o1-mini on math & code. RL from the SFT model reaches higher ceilings than RL-Zero from the base.
● Efficient infrastructure: A Seamless Rollout Engine accelerates RL by 2.29× and validation by 1.96× using continuous rollout, async reward computation, and early termination. MTP layers enable fast speculative decoding, with 90%+ acceptance rates in inference. | [Paper](https://github.com/XiaomiMiMo/MiMo/blob/main/MiMo-7B-Technical-Report.pdf), [Tweet](https://x.com/omarsar0/status/1917582720341008814) | | 7) Advances and Challenges in Foundation Agents A new survey frames intelligent agents with a modular, brain-inspired architecture that integrates ideas from cognitive science, neuroscience, and computational research. Key topics covered:
● Human Brain and LLM Agents: Helps to better understand what differentiates LLM agents from human/brain cognition, and what inspirations we can get from the way humans learn and operate.
● Definitions: Provides a nice, detailed, and formal definition of what makes up an AI agent.
● Reasoning: It has a detailed section on the core components of intelligent agents. There is a deep dive into reasoning, which is one of the key development areas of AI agents and what unlocks things like planning, multi-turn tooling, backtracking, and much more.
● Memory: Agent memory is a challenging area of building agentic systems, but there is already a lot of good literature out there from which to get inspiration.
● Action Systems: You can already build very complex agentic systems today, but the next frontier is agents that take actions and make decisions in the real world. We need better tooling, better training algorithms, and robust operation in different action spaces.
● Self-Evolving Agents: For now, building effective agentic systems requires human effort and careful optimization tricks. However, one of the bigger opportunities in the field is to build AI that can itself build powerful and self-improving AI systems. | [Paper](https://arxiv.org/abs/2504.01990), [Tweet](https://x.com/omarsar0/status/1916542394746421333) | | 8) MAGI MAGI is a multi-agent system designed to automate structured psychiatric interviews by operationalizing the MINI (Mini International Neuropsychiatric Interview) protocol. It involves 4 specialized agents: navigation, question generation, judgment, and diagnosis. Other highlights:
● Multi-Agent Clinical Workflow: MAGI is built with a navigation agent (interview flow control), a question agent (dynamic, empathetic probing), a judgment agent (response validation), and a diagnosis agent using Psychometric CoT to trace diagnoses explicitly to MINI/DSM-5 criteria.
● Explainable Reasoning (PsyCoT): Instead of treating diagnoses as opaque outputs, PsyCoT decomposes psychiatric reasoning into symptom anchoring, syndromal validation, and evidence binding. This helps with auditability for each diagnostic conclusion. CoT put to great use.
● Results: Evaluated on 1,002 real-world interviews, MAGI outperforms baselines (Direct prompting, Role-play, Knowledge-enhanced, and MINI-simulated LLMs) across relevance, accuracy, completeness, and guidance.
● Strong Clinical Agreement: Diagnostic evaluations show PsyCoT consistently improves F1 scores, accuracy, and Cohen’s κ across disorders like depression, generalized anxiety, social anxiety, and suicide risk, reaching clinical-grade reliability (κ 0.8) in high-risk tasks. | [Paper](https://arxiv.org/abs/2504.18260), [Tweet](https://x.com/omarsar0/status/1916862752410554423) | | 9) A Survey of Efficient LLM Inference Serving This survey reviews recent advancements in optimizing LLM inference, addressing memory and computational bottlenecks. It covers instance-level techniques (like model placement and request scheduling), cluster-level strategies (like GPU deployment and load balancing), and emerging scenario-specific solutions, concluding with future research directions. | [Paper](https://arxiv.org/abs/2504.19720) | | 10) LLM for Engineering This work finds that when RL is used, a 7B parameter model outperforms both SoTA foundation models and human experts at high-powered rocketry design. | [Paper](https://arxiv.org/abs/2504.19394) | ## Top ML Papers of the Week (April 21 - April 27) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) Does RL Incentivize Reasoning in LLMs Beyond the Base Model? This paper revisits a key assumption in recent LLM development: that Reinforcement Learning with Verifiable Rewards (RLVR) helps models acquire genuinely new reasoning capabilities. By analyzing models across tasks (math, code, vision) using pass@k metrics (with large k), the authors find that RLVR improves sample efficiency but does not expand reasoning capacity beyond the base model.
● Key insight: RLVR-trained models do better at low *k* (e.g., pass@1), but as *k* increases (up to 256 or more), base models eventually match or outperform them. This suggests RLVR doesn’t generate fundamentally new reasoning paths but just increases the likelihood of sampling already-existing correct ones.
● Reasoning already in the base: RLVR models' successful CoTs are shown to be present within the base model's sampling distribution. Perplexity analyses confirm that RL outputs are often high-probability continuations for the base model.
● Efficiency vs. exploration: RLVR narrows the model’s exploration space, improving efficiency but shrinking its coverage of diverse reasoning paths, thereby reducing overall problem-solving reach at scale.
● Distillation helps more: Unlike RLVR, distillation from a stronger teacher model (e.g., DeepSeek-R1) introduces genuinely new reasoning patterns, expanding the model’s capabilities.
● Algorithmic limits: Across PPO, GRPO, Reinforce++, etc., RL algorithms offer similar sample-efficiency improvements, but none closes the gap to the base model’s pass@256—highlighting the limits of current RL strategies. | [Paper](https://arxiv.org/abs/2504.13837), [Tweet](https://x.com/DaveShapi/status/1915408405201629684) | | 2) BitNet b1.58 2B4T This work introduces BitNet b1.58 2B4T, the first open-source, natively trained 1-bit LLM at the 2B parameter scale, achieving strong performance while being extremely efficient. The model uses a custom ternary quantization scheme (1.58 bits per weight), enabling dramatic reductions in memory (0.4 GB), energy (0.028J/token), and latency (29ms), while still competing with state-of-the-art full-precision models across diverse benchmarks.
● New Pareto frontier in efficiency-performance: Trained from scratch on 4T tokens, BitNet b1.58 2B4T outperforms or matches open full-precision models (e.g., Qwen2.5 1.5B, MiniCPM 2B) on tasks like ARC-Challenge, PIQA, WinoGrande, and GSM8K. It achieves 54.19% average. across 16 benchmarks, comparable to Qwen2.5-1.5B’s 55.23%, but with ~6.5× lower memory and 10× lower energy usage.
● Outperforms quantized baselines: Against INT4 post-training quantized Qwen2.5 models (GPTQ/AWQ), BitNet is both smaller and more accurate, showing the advantage of native 1-bit training over PTQ approaches.
● Architectural & training innovations: It replaces standard linear layers with BitLinear layers using absmean ternary quantization and 8-bit activations, combines RoPE embeddings, squared ReLU activation, and bias-free layers. Training includes cosine LR and weight decay schedules, plus supervised fine-tuning and Direct Preference Optimization (DPO) instead of full RLHF.
● Best-in-class among 1-bit LLMs: When compared to other 1-bit models like OLMo-Bitnet (1B) and post-quantized Falcon3/Llama3 (7B–8B), BitNet b1.58 2B4T is +10 pts stronger on average, establishing a new benchmark for ultra-efficient LLMs. The authors also release optimized CUDA kernels for GPU and a C++ inference library for CPU, enabling practical deployment of 1-bit LLMs on diverse hardware. BitNet b1.58 2B4T demonstrates that extreme quantization does not mean compromised capability, and it opens the door to the broader adoption of LLMs in resource-constrained environments. | [Paper](https://arxiv.org/abs/2504.12285) | | 3) UI-TARS UI-TARS introduces a powerful, end-to-end native GUI agent that operates purely from visual screenshots, performing human-like keyboard and mouse interactions across platforms. Unlike existing modular agent frameworks that rely on prompt engineering and external scripts, UI-TARS integrates perception, action, reasoning, and memory directly into its architecture, achieving strong generalization and adaptability in dynamic real-world settings. Key contributions:
● Enhanced GUI Perception: UI-TARS is trained on a large-scale, richly annotated dataset of screenshots with metadata, enabling dense captioning, state transition understanding, and precise element description. It excels in perception benchmarks like VisualWebBench, scoring 82.8, outperforming GPT-4o’s.
● Unified Action Modeling and Grounding: UI-TARS standardizes actions across platforms into a shared action space and learns from large-scale multi-step action traces. It surpasses baselines in grounding tasks with 38.1 on ScreenSpot Pro, the new SOTA.
● System-2 Reasoning via “Thoughts”: Inspired by ReAct-style frameworks, UI-TARS generates internal reasoning steps (thoughts) before actions. These thoughts reflect patterns like task decomposition, reflection, and long-term consistency, significantly improving performance in complex scenarios. For example, in OSWorld, UI-TARS-72B-DPO scores 24.6 with a 50-step budget, outperforming Claude’s.
● Iterative Self-Improvement with Reflective Learning: UI-TARS continuously refines itself through online trace collection and reflection tuning using error correction and post-error adaptation data. This allows it to recover from mistakes and adapt with minimal human oversight. Overall, UI-TARS marks a significant step forward in GUI automation, setting new benchmarks across more than 10 datasets and outperforming top commercial agents like GPT-4o and Claude. Its open-source release aims to drive further innovation in native agent development. | [Paper](https://arxiv.org/abs/2501.12326), [Blog](https://seed-tars.com/1.5/) | | 4) Describe Anything Introduces DAM, a model that generates fine-grained, region-specific captions in both images and videos. The authors address key limitations in prior vision-language models—namely, the inability to preserve local detail and the lack of suitable datasets and benchmarks for detailed localized captioning (DLC). Key contributions:
● DAM (Describe Anything Model) uses two main innovations to capture both fine regional detail and global scene context: a focal prompt that provides high-resolution encoding of user-specified regions, and a localized vision backbone that uses gated cross-attention to integrate context from the entire image. This enables DAM to generate multi-granular, accurate descriptions, especially for small or occluded regions.
● DLC-SDP (Semi-supervised Data Pipeline) tackles data scarcity by expanding segmentation datasets with VLM-generated detailed captions, followed by self-training on web images. This produces high-quality, diverse training data, enabling DAM to outperform API-only baselines like GPT-4o across several benchmarks.
● DLC-Bench is a reference-free benchmark that scores models on their ability to accurately include or exclude region-specific details using LLM judges. It provides a more reliable evaluation than traditional caption-matching metrics, which often penalize models for valid but unmatched details.
● Performance: DAM sets a new state-of-the-art on 7 benchmarks across keyword, phrase, and detailed multi-sentence captioning tasks in both images and videos. It outperforms GPT-4o, Claude 3.7, and other top VLMs in both zero-shot and in-domain evaluations, achieving up to 33.4% improvement over prior models on detailed image captioning and 19.8% on video captioning. | [Paper](https://arxiv.org/abs/2504.16072) | | 5) UXAgent Introduces a novel framework, UXAgent, for simulating large-scale usability testing using LLM-driven agents. The system empowers UX researchers to test and iterate web design and study protocols before engaging real users. This is achieved through the orchestration of simulated agents with diverse personas interacting in real web environments, providing both behavioral and reasoning data. Key highlights:
● LLM-Powered Simulation with Personas: UXAgent begins with a Persona Generator that can produce thousands of demographically diverse simulated users based on custom distributions. Each persona is fed into an LLM Agent that embodies user intent and interacts with the website via a Universal Browser Connector—a module capable of interpreting and manipulating real HTML structures.
● Dual-Loop Reasoning Architecture: At the heart of UXAgent is a dual-process agent architecture inspired by cognitive psychology: a Fast Loop for low-latency actions and a Slow Loop for deep reasoning. This design mimics System 1 and System 2 thinking and allows agents to act responsively while maintaining coherent high-level plans and reflections.
● Rich Memory Stream: All observations, actions, plans, reflections, and spontaneous thoughts (“wonders”) are stored in a Memory Stream. These memories are dynamically prioritized for retrieval using a weighted scoring system based on importance, recency, and relevance, tailored separately for fast and slow modules.
● Replay and Interview Interfaces: UX researchers can review simulated sessions via a Simulation Replay Interface and conduct natural language conversations with agents using an Agent Interview Interface. This supports qualitative analysis, such as asking agents about their decisions or presenting mockups for feedback.
● Empirical Evaluation: A case study involving 60 LLM agent simulations on a shopping platform (WebArena) showed that researchers were able to detect usability study flaws and gather early insights. A follow-up user study with five UX professionals found the system helpful for iterating study design, despite some concerns over realism and data noise. Particularly appreciated was the ability to converse with agents and gather qualitative insights that would be infeasible in traditional pilots.
● Future Implications: The authors position LLM agents not as replacements for real participants, but as early-stage collaborators in the design process, reducing the cost and risk of flawed studies. They also discuss extensions to multimodal settings, desktop or mobile interfaces, and broader agentic tasks such as digital twins or simulated A/B testing. | [Paper](https://arxiv.org/abs/2504.09407) | | 6) Test-Time Reinforcement Learning Test-Time Reinforcement Learning (TTRL) is a method that allows LLMs to improve themselves during inference without ground-truth labels. Instead of relying on labeled datasets, TTRL uses majority voting over multiple model generations to estimate pseudo-rewards, enabling reinforcement learning (RL) on unlabeled test data. The method integrates Test-Time Scaling (TTS) and Test-Time Training (TTT) strategies, letting models adapt dynamically to new and challenging inputs. Key highlights:
● Majority Voting as Reward: TTRL generates multiple candidate outputs for a query and uses majority voting to derive a pseudo-label. Rewards are assigned based on agreement with the consensus answer.
● Significant Performance Gains: Applying TTRL to Qwen2.5-Math-7B leads to a +159% improvement on AIME 2024 and +84% average gains across AIME, AMC, and MATH-500 benchmarks, without using any labeled training data.
● Self-Evolution Beyond Supervision: Remarkably, TTRL surpasses the performance ceiling of its own majority-vote supervision (Maj@N) and approaches the performance of models trained with full label leakage, indicating efficient and stable unsupervised RL.
● Generalization and Robustness: TTRL generalizes well across tasks, maintains effectiveness even under label estimation noise, and is compatible with different RL algorithms like PPO and GRPO.
● Limitations: TTRL may fail when the base model lacks sufficient prior knowledge about the domain or when hyperparameters (like batch size and temperature) are poorly tuned. | [Paper](https://www.arxiv.org/abs/2504.16084) | | 7) Discovering Values in Real-World Language Model Interactions This paper presents the first large-scale empirical analysis of values exhibited by a deployed AI assistant, Claude 3 and 3.5 models, using over 300,000 real-world conversations. The authors develop a bottom-up, privacy-preserving framework to extract, classify, and analyze AI-expressed normative considerations (“values”) and show how they vary across tasks, user values, and conversational contexts.
● The authors identify 3,307 unique AI values, which are organized into a five-domain taxonomy: Practical, Epistemic, Social, Protective, and Personal. Practical and epistemic values dominate, often aligning with Claude’s training goals around being helpful, harmless, and honest.
● Claude’s most common values, such as helpfulness (23.4%), professionalism, transparency, and clarity, are context-invariant and reflect its role as a service-oriented assistant. In contrast, human values like authenticity and efficiency are more varied.
● Many values are context-specific. For example, healthy boundaries arise in relationship advice, historical accuracy in controversial event discussions, and human agency in AI governance contexts.
● Claude tends to mirror human values in supportive contexts (20.1% mirroring rate), but expresses opposing values during resistance, especially in cases involving unethical or policy-violating requests (e.g., resisting “moral nihilism” with “ethical integrity”).
● Explicit value expression (e.g., “I value transparency”) occurs more often in moments of resistance or reframing, particularly around epistemic and ethical principles like intellectual honesty and harm prevention. This suggests that AI values become most visible when the system is challenged.
● Across Claude variants, 3 Opus expresses more emotionally nuanced and ethically grounded values (e.g., academic rigor, emotional authenticity) and shows a stronger inclination for both support and resistance compared to 3.5/3.7 Sonnet. | [Paper](https://assets.anthropic.com/m/18d20cca3cde3503/original/Values-in-the-Wild-Paper.pdf), [Tweet](https://x.com/AnthropicAI/status/1914333220067213529) | | 8) Evaluate the Goal-Directedness of LLMs Introduces a new framework to assess whether LLMs use their capabilities effectively toward achieving given goals. The study finds that even top models like GPT-4o and Claude 3.7 fall short of full goal-directedness, particularly in information-gathering and combined tasks, despite performing well in isolated subtasks. | [Paper](https://arxiv.org/abs/2504.11844), [Tweet](https://x.com/tom4everitt/status/1912806499862139275), [GitHub](https://github.com/Crista23/goal_directedness_llms) | | 9) General-Reasoner General-Reasoner is a reinforcement learning approach that boosts LLM reasoning across diverse domains by using a 230K-question dataset and a model-based verifier trained to understand semantics beyond exact matches. It outperforms strong baselines like SimpleRL and Qwen2.5 on both general reasoning (MMLU-Pro, GPQA, SuperGPQA) and math tasks (MATH-500, GSM8K), showing over 10-point gains without sacrificing mathematical capability. | [Paper](https://github.com/TIGER-AI-Lab/General-Reasoner/blob/main/General_Reasoner.pdf), [Tweet](https://x.com/WenhuChen/status/1912242238110789671) | | 10) Tiny Reasoning Models Tina is a family of 1.5B parameter reasoning models trained using LoRA-based reinforcement learning (RL) to achieve high reasoning accuracy at very low cost. It outperforms or matches full fine-tuned models on reasoning tasks like AIME and MATH with only ~$9 post-training cost, demonstrating that efficient reasoning can be instilled via minimal updates to a tiny model. | [Paper](https://arxiv.org/abs/2504.15777) | ## Top ML Papers of the Week (April 14 - April 20) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) GUI-R1 Researchers from the National University of Singapore and the Chinese Academy of Sciences introduce GUI-R1, a reinforcement learning (RL) framework aimed at improving graphical user interface (GUI) agents through unified action-space modeling. Key insights include:
● Reinforcement Fine-Tuning (RFT) over Supervised Fine-Tuning (SFT) – GUI-R1 utilizes RFT inspired by methods such as DeepSeek-R1, significantly reducing training data requirements. It uses only 3K carefully curated examples versus millions used by previous models.
● Unified Action Space and Reward Modeling – The authors introduce a unified action space that covers actions across different platforms (Windows, Linux, MacOS, Android, and Web). This enables consistent reward signals for evaluating GUI actions, enhancing the model’s adaptability and generalization.
● Superior Performance with Minimal Data – GUI-R1 outperforms state-of-the-art methods like OS-Atlas using merely 0.02% of the training data (3K vs. 13M). Evaluations across eight benchmarks spanning mobile, desktop, and web platforms show significant improvements in grounding, low-level, and high-level GUI task capabilities.
● Efficient Training and Strong Generalization – By leveraging policy optimization algorithms like Group Relative Policy Optimization (GRPO), GUI-R1 quickly converges to high performance, demonstrating robustness and efficiency even in resource-constrained scenarios. | [Paper](https://arxiv.org/abs/2504.10458) | | 2) Scaling Reasoning in Diffusion LLMs via RL Proposes d1, a two‑stage recipe that equips masked diffusion LLMs with strong step‑by‑step reasoning.
● Two‑stage pipeline (SFT → diffu‑GRPO) – d1 first applies supervised fine‑tuning on the 1 k‑example s1K dataset and then runs task‑specific RL with the new diffu‑GRPO objective, yielding larger gains than either stage alone.
● diffu‑GRPO: RL for masked dLLMs – Extends GRPO to diffusion LLMs via (i) a mean‑field sequence‑log‑prob approximation and (ii) a one‑step per‑token log‑prob estimator with random prompt masking, enabling many gradient updates from a single generation.
● Consistent gains on four reasoning benchmarks – On GSM8K, MATH500, Countdown, and Sudoku, diffu‑GRPO beats SFT, and the full d1‑LLaDA variant attains the best scores (e.g., 81.1 % GSM8K & 38.6 % MATH500 at 256 tokens, +5–12 pp over baseline).
● Competitive among 7‑8 B models – d1‑LLaDA outperforms DeepSeek‑7B, Mistral‑7B and Llama‑3‑8B on GSM8K and ranks second on MATH500 in the same size class.
● Longer decoding unlocks “aha moments” – At 512‑token generation, the model shows self‑verification/backtracking; effective‑token usage grows smoothly, echoing test‑time compute scaling trends.
● Random masking speeds RL – Ablations show that random prompt masking during diffu‑GRPO accelerates convergence and boosts correctness relative to fixed masking, with fewer online generations needed. | [Paper](https://arxiv.org/abs/2504.12216) | | 3) Enhancing Non-Reasoning Models with Reasoning Models Researchers explore how to distill reasoning-intensive outputs (answers and explanations) from top-tier LLMs into more lightweight models that don’t explicitly reason step by step. By fine-tuning smaller models on the high-quality final answers (and optionally summarized thinking traces) from advanced reasoning models, they demonstrate consistent performance boosts across multiple benchmarks.
● Test-time scaling vs. knowledge distillation – While large models like DeepSeek-R1 and OpenAI-o1 can allocate more compute to generate better reasoning traces, this paper focuses on systematically transferring those rich final answers (and possibly a summarized version of the reasoning steps) to more compact models.
● Data curation – The authors construct a 1.3M-instance dataset by pulling prompts from multiple open-source repositories (including Infinity Instruct, CodeContests, FLAN, etc.) and generating final answers plus detailed reasoning from DeepSeek-R1.
● Three fine-tuning strategies – (1) Use the original baseline answers from existing open-source sets, (2) fine-tune on only the final answer portion of a reasoning model, and (3) combine a summarized chain-of-thought with the final answer. Models trained on the second strategy excelled at math/coding tasks, while the third approach proved better for more conversational or alignment-oriented tasks.
● Empirical gains – Fine-tuning Qwen2.5-32B on the reasoning model’s final answers led to notable improvements on GSM8K (92.2%) and HumanEval (90.9%). A think-summarization approach boosted a different set of benchmarks (GPQA and chat-based tests). However, weaving in the “thinking trace” sometimes caused slight drops in instruction strictness (IFEval).
● Trade-offs and future work – Distilling advanced reasoning data definitely helps smaller models, but deciding how much of the reasoning trace to include is domain-dependent. The authors suggest that more refined ways of seamlessly blending reasoning steps into final answers (e.g., specialized prompts or partial merges) could further improve performance and avoid alignment regressions. | [Paper](https://arxiv.org/abs/2504.09639) | | 4) AgentA/B AgentA/B is a fully automated A/B testing framework that replaces live human traffic with large-scale LLM-based agents. These agents simulate realistic, intention-driven user behaviors on actual web environments, enabling faster, cheaper, and risk-free UX evaluations — even on real websites like Amazon. Key Insights:
● Modular agent simulation pipeline – Four components—agent generation, condition prep, interaction loop, and post-analysis—allow plug-and-play simulations on live webpages using diverse LLM personas.
● Real-world fidelity – The system parses live DOM into JSON, enabling structured interaction loops (search, filter, click, purchase) executed via LLM reasoning + Selenium.
● Behavioral realism – Simulated agents show more goal-directed but comparable interaction patterns vs. 1M real Amazon users (e.g., shorter sessions but similar purchase rates).
● Design sensitivity – A/B test comparing full vs. reduced filter panels revealed that agents in the treatment condition clicked more, used filters more often, and purchased more.
● Inclusive prototyping – Agents can represent hard-to-reach populations (e.g., low-tech users), making early-stage UX testing more inclusive and risk-free.
● Notable results AgentA/B shows how LLM agents can augment — not replace — traditional A/B testing by offering a new pre-deployment simulation layer. This can accelerate iteration, reduce development waste, and support UX inclusivity without needing immediate live traffic. | [Paper](https://arxiv.org/abs/2504.09723) | | 5) Reasoning Models Can Be Effective Without Thinking This paper challenges the necessity of long chain-of-thought (CoT) reasoning in LLMs by introducing a simple prompting method called NoThinking, which bypasses explicit "thinking" steps. Surprisingly, NoThinking performs comparably to or better than traditional reasoning under comparable or even lower compute budgets, especially when paired with parallel decoding and best-of-N selection. Key Insights:
● NoThinking prepends a dummy “Thinking” block and jumps straight to final answers.
● Despite skipping structured reasoning, it outperforms Thinking in pass@k (1–64) on many benchmarks, especially under token constraints.
● With parallel scaling, NoThinking achieves higher pass@1 accuracy than Thinking while using 4× fewer tokens and up to 9× lower latency.
● Tasks evaluated: competitive math (AIME24/25, AMC23, OlympiadBench), coding (LiveCodeBench), and formal theorem proving (MiniF2F, ProofNet).
● NoThinking is shown to provide superior accuracy–latency tradeoffs and generalizes across diverse tasks. Results:
● Low-budget wins: On AMC23 (700 tokens), NoThinking achieves 51.3% vs. 28.9% (Thinking).
● Better scaling: As k increases, NoThinking consistently surpasses Thinking.
● Efficiency frontier: Across benchmarks, NoThinking dominates the accuracy–cost Pareto frontier.
● Parallel wins: With simple confidence-based or majority vote strategies, NoThinking + best-of-N beats full Thinking on pass@1 with significantly less latency. | [Paper](https://www.arxiv.org/abs/2504.09858) | | 6) SocioVerse Researchers from Fudan University and collaborators propose SocioVerse, a large-scale world model for social simulation using LLM agents aligned with real-world user behavior. Key ideas include:
● Four-fold alignment framework – SocioVerse tackles major challenges in aligning simulated environments with reality across four dimensions:
● Three representative simulations – SocioVerse showcases its generalizability through:
● Impressive empirical accuracy –
● Ablation insights – Removing prior demographic distribution and user knowledge severely degrades election prediction accuracy (Acc drops from 0.80 → 0.60), highlighting the value of realistic population modelingpapersoftheweek.
● Toward trustworthy virtual societies – SocioVerse not only standardizes scalable social simulations but also provides a sandbox for testing sociopolitical hypotheses (e.g., fairness, policy change), bridging AI agent systems with traditional social science. | [Paper](https://arxiv.org/abs/2504.10157) | | 7) DocAgent Researchers from Meta AI present DocAgent, a tool‑integrated, dependency‑aware framework that turns large, complex codebases into well‑written docstrings. Key ideas include:
● Topological Navigator for context building – DocAgent parses the repository’s AST, builds a dependency DAG, and documents components in topological order, so each function/class is visited only after its prerequisites, enabling incremental context accumulation and preventing context‑length explosions.
● Role‑specialised agent team – Five agents work together: Reader analyses code, Searcher gathers internal & external references, Writer drafts docstrings, Verifier critiques and revises them, while the Orchestrator manages iterations until quality converges.
● Adaptive context management – When retrieved context exceeds the model’s token budget, the Orchestrator trims low‑priority segments while preserving overall structure, keeping generation efficient and faithful2504.08725v1.
● Three‑facet automatic evaluation – A new framework scores Completeness (section coverage), Helpfulness (LLM‑as‑judge semantic utility), and Truthfulness (entity grounding against the code DAG) for every docstring.
● Substantial gains over baselines – On 366 components across nine Python repos, DocAgent + GPT‑4o‑mini lifts Completeness to 0.934 vs 0.815, Helpfulness to 3.88 / 5 vs 2.95, and Truthfulness (existence ratio) to 95.7 % vs 61.1 % compared with a Chat‑GPT baseline; FIM baselines fare far worse.
● Navigator is crucial – An ablation that randomises processing order drops helpfulness by ‑0.44 and truthfulness by ‑7.9 pp, confirming the importance of dependency‑aware traversal. | [Paper](https://arxiv.org/abs/2504.08725) | | 8) SWE-PolyBench SWE-PolyBench is a new multi-language benchmark for evaluating coding agents on real-world software tasks across Java, JavaScript, TypeScript, and Python. It introduces execution-based assessments, syntax tree metrics, and reveals that current agents struggle with complex tasks and show inconsistent performance across languages. | [Paper](https://arxiv.org/abs/2504.08703v1) | | 9) A Survey of Frontiers in LLM Reasoning This survey categorizes LLM reasoning methods by when reasoning occurs (inference-time vs. training) and the system's architecture (standalone vs. agentic or multi-agent). It highlights trends like learning-to-reason (e.g., DeepSeek-R1) and agentic workflows (e.g., OpenAI Deep Research), covering prompt engineering, output refinement, and learning strategies such as PPO and verifier training. | [Paper](https://arxiv.org/abs/2504.09037) | | 10) Advances in Embodied Agents, Smart Cities, and Earth Science This paper surveys how spatial intelligence manifests across disciplines—from embodied agents to urban and global systems—by connecting human spatial cognition with how LLMs handle spatial memory, representations, and reasoning. It offers a unifying framework to bridge research in AI, robotics, urban planning, and earth science, highlighting LLMs’ evolving spatial capabilities and their interdisciplinary potential. | [Paper](https://arxiv.org/abs/2504.09848) | ## Top ML Papers of the Week (April 6 - April 13) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) The AI Scientist V2 The AI Scientist-v2 refines and extends its predecessor to achieve a new milestone: autonomously generating a workshop-accepted research manuscript. The system removes dependencies on human-authored code templates, incorporates agentic tree-search methods for deeper exploration, uses Vision-Language Models to refine figures, and demonstrates impressive real-world outcomes by passing the peer-review bar.
● Enhanced Autonomy – Eliminates reliance on human-crafted code templates, enabling out-of-the-box deployment across diverse ML domains.
● Agentic Tree Search – Systematically searches and refines hypotheses through a branching exploration, managed by a new experiment manager agent.
● VLM Feedback Loop – Integrates Vision-Language Models in the reviewing process to critique and improve experimental figures and paper aesthetics.
● Workshop Acceptance – Generated three fully autonomous manuscripts for an ICLR workshop; one was accepted, showcasing the feasibility of AI-driven end-to-end scientific discovery. | [Paper](https://pub.sakana.ai/ai-scientist-v2/paper/paper.pdf), [Tweet](https://x.com/SakanaAILabs/status/1909497165925536212) | | 2) Benchmarking Browsing Agents OpenAI introduces BrowseComp, a benchmark with 1,266 questions that require AI agents to locate hard-to-find, entangled information on the web. Unlike saturated benchmarks like SimpleQA, BrowseComp demands persistent and creative search across numerous websites, offering a robust testbed for real-world web-browsing agents. Key insights:
● Extremely difficult questions: Benchmarked tasks were verified to be unsolvable by humans in under 10 minutes and also by GPT-4o (with/without browsing), OpenAI o1, and earlier Deep Research models.
● Human performance is low: Only 29.2% of problems were solved by humans (even with 2-hour limits). 70.8% were abandoned.
● Model performance:
● Test-time scaling matters: Accuracy improves with more browsing attempts. With 64 parallel samples and best-of-N aggregation, Deep Research significantly boosts its performance (15–25% gain over a single attempt).
● Reasoning browsing: OpenAI o1 (no browsing but better reasoning) outperforms GPT-4.5 with browsing, showing that tool use alone isn't enough—strategic reasoning is key.
● Calibration struggles: Models with browsing access often exhibit overconfidence in incorrect answers, revealing current limits in uncertainty estimation.
● Dataset diversity: Includes a wide topical spread: TV/movies, science, art, sports, politics, geography, etc. | [Paper](https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf), [Blog](https://openai.com/index/browsecomp/), [Tweet](https://x.com/OpenAI/status/1910393421652520967) | | 3) OLMOTrace Allen Institute for AI & University of Washington present OLMOTRACE, a real-time system that traces LLM-generated text back to its verbatim sources in the original training data, even across multi-trillion-token corpora.
● What it does: For a given LM output, OLMOTRACE highlights exact matches with training data segments and lets users inspect full documents for those matches. Think "reverse-engineering" a model’s response via lexical lookup.
● How it works:
● Supported models: Works with OLMo models (e.g., OLMo-2-32B-Instruct) and their full pre/mid/post-training datasets, totaling 4.6T tokens.
● Use cases:
● Benchmarked:
● Not RAG: It retrieves after generation, without changing output, unlike retrieval-augmented generation. | [Paper](https://arxiv.org/abs/2504.07096), [Tweet](https://x.com/omarsar0/status/1910323386603262316), [Blog](https://5910970.hs-sites.com/olmotrace-points-model-output-back-to-training-data?ecid=ACsprvuggQcD4yCdO--rKTZKDvmczdSQkb96ct95zLH9eiysrXjF_WuKgsmIMaz8byfiL1H1-2A6&utm_campaign=AI2%20Newsletter&utm_medium=email&_hsenc=p2ANqtz-__MqUAVPXfHPpHpf2xC86iZG8qC3J-z5nW141VBN9gZW4j61ymW3dM7mhkiHGTWtjQt3Eao7Cqf7pB1k24CfEhYe9fmA&_hsmi=355925505) | | 4) Concise Reasoning via RL This new paper proposes a new training strategy that promotes concise and accurate reasoning in LLMs using RL. It challenges the belief that long responses improve accuracy; it offers both theoretical and empirical evidence showing that conciseness often correlates with better performance.
● Long ≠ better reasoning – The authors mathematically show that RL with PPO tends to generate unnecessarily long responses, especially when answers are wrong. Surprisingly, shorter outputs correlate more with correct answers, across reasoning and non-reasoning models.
● Two-phase RL for reasoning + conciseness – They introduce a two-phase RL strategy: (1) train on hard problems to build reasoning ability (length may increase), then (2) fine-tune on occasionally solvable tasks to enforce concise CoT without hurting accuracy. The second phase alone dramatically reduces token usage by over 50%, with no loss in accuracy.
● Works with tiny data – Their method succeeds with as few as 4–8 training examples, showing large gains in both math and STEM benchmarks (MATH, AIME24, MMLU-STEM). For instance, on MMLU-STEM, they improved accuracy by +12.5% while cutting response length by over 2×.
● Better under low sampling – Post-trained models remain robust even when the temperature is reduced to 0. At temperature=0, the fine-tuned model outperformed the baseline by 10–30%, showing enhanced deterministic performance.
● Practical implications – Besides improving model output, their method reduces latency, cost, and token usage, making LLMs more deployable. The authors also recommend setting λ < 1 during PPO to avoid instability and encourage correct response shaping. | [Paper](https://arxiv.org/abs/2504.05185), [Tweet](https://x.com/omarsar0/status/1909634850304503977) | | 5) Rethinking Reflection in Pre-Training Reflection — the ability of LLMs to identify and correct their own reasoning — has often been attributed to reinforcement learning or fine-tuning. This paper argues otherwise: reflection emerges during pre-training. The authors introduce adversarial reasoning tasks to show that self-reflection and correction capabilities steadily improve as compute increases, even in the absence of supervised post-training. Key contributions:
● Propose two kinds of reflection:
● Build six adversarial datasets (GSM8K, TriviaQA, CruxEval, BBH) to test reflection across math, coding, logic, and knowledge domains. On GSM8K-Platinum, explicit reflection rates grow from ~10% to 60% with increasing pre-training tokens.
● Demonstrate that simple triggers like “Wait,” reliably induce reflection.
● Evaluate 40 OLMo-2 and Qwen2.5 checkpoints, finding a strong correlation between pre-training compute and both accuracy and reflection rate. Why it matters:
● Reflection is a precursor to reasoning and can develop before RLHF or test-time decoding strategies.
● Implication: We can instill advanced reasoning traits with better pre-training data and scale, rather than relying entirely on post-training tricks.
● They also show a trade-off: more training compute reduces the need for expensive test-time compute like long CoT traces. | [Paper](https://arxiv.org/abs/2504.04022), [Tweet](https://x.com/ashVaswani/status/1909642828554387675) | | 6) Efficient KG Reasoning for Small LLMs LightPROF is a lightweight framework that enables small-scale language models to perform complex reasoning over knowledge graphs (KGs) using structured prompts. Key highlights:
● Retrieve-Embed-Reason pipeline – LightPROF introduces a three-stage architecture:
● Plug-and-play & parameter-efficient – LightPROF trains only the adapter and projection modules, allowing seamless integration with any open-source LLM (e.g., LLaMa2-7B, LLaMa3-8B) without expensive fine-tuning.
● Outperforms larger models – Despite using small LLMs, LightPROF beats baselines like StructGPT (ChatGPT) and ToG (LLaMa2-70B) on KGQA tasks: 83.8% (vs. 72.6%) on WebQSP and 59.3% (vs. 57.6%)on CWQ.
● Extreme efficiency – Compared to StructGPT, LightPROF reduces token input by 98% and runtime by 30%, while maintaining accuracy and stable output even in complex multi-hop questions.
● Ablation insights – Removing structural signals or training steps severely degrades performance, confirming the critical role of the Knowledge Adapter and retrieval strategy. | [Paper](https://arxiv.org/abs/2504.03137), [Tweet](https://x.com/omarsar0/status/1910319109096747191) | | 7) Compute Agent Arena Computer Agent Arena is a new open platform for benchmarking LLM and VLM-based agents on real-world computer-use tasks, like coding, editing, and web navigation, using a virtual desktop environment. Initial results show that OpenAI and Anthropic are leading with modest success rates, while the platform aims to grow through crowdsourced tasks, agent submissions, and open-sourcing of its infrastructure. [Report](https://arena.xlang.ai/blog/computer-agent-arena) | [Tweet](https://x.com/BowenWangNLP/status/1909618451259572328) | | 8) Agentic Knowledgeable Self-awareness KnowSelf is a new framework that introduces agentic knowledgeable self-awareness, enabling LLM agents to dynamically decide when to reflect or seek knowledge based on situational complexity, mimicking human cognition. Using special tokens for "fast," "slow," and "knowledgeable" thinking, KnowSelf reduces inference costs and achieves state-of-the-art performance on ALFWorld and WebShop tasks with minimal external knowledge. | [Paper](https://arxiv.org/abs/2504.03553v1) | | 9) One-Minute Video Generation with Test-Time Training One-Minute Video Generation with Test-Time Training introduces TTT layers, a novel sequence modeling component where hidden states are neural networks updated via self-supervised loss at test time. By integrating these into a pre-trained diffusion model, the authors enable single-shot generation of one-minute, multi-scene videos from storyboards, achieving 34 Elo points higher than strong baselines like Mamba 2 and DeltaNet in human evaluations | [Paper](https://test-time-training.github.io/video-dit/assets/ttt_cvpr_2025.pdf), [Tweet](https://x.com/karansdalal/status/1909312851795411093) | | 10) NoProp NoProp is a novel gradient-free learning method where each neural network layer independently learns to denoise a noisy version of the target, inspired by diffusion and flow matching. Unlike backpropagation, it avoids hierarchical representation learning and achieves competitive performance and efficiency on image classification benchmarks like MNIST and CIFAR. | [Paper](https://arxiv.org/abs/2503.24322) | ## Top ML Papers of the Week (March 31 - April 6) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) PaperBench OpenAI introduces a new benchmark, PaperBench, to test whether AI agents can replicate cutting-edge machine learning research papers, from scratch. ● A rigorous replication challenge – PaperBench evaluates agents on reproducing entire ML papers from ICML 2024 (20 total, across 12 research areas). Agents must understand the paper, build the codebase from scratch, and run experiments to match results. Each paper comes with a fine-grained rubric (~8,316 tasks total) co-designed with the original authors. ● Automatic grading with LLM judges – To make evaluation scalable, the team built a rubric-based judge (o3-mini with scaffolding) that scores replications with high agreement (F1 = 0.83) against human experts. They also release JudgeEval, a benchmark for assessing judge accuracy. ● Frontier model performance is modest – Claude 3.5 Sonnet scored highest with 21.0%, followed by o1 (13.2%) and GPT-4o (4.1%). Even with longer runtimes and prompt tuning (IterativeAgent), no model surpassed a 26.0% score. By contrast, ML PhDs hit 41.4% on a 3-paper subset in 48 hours, showing humans still lead in long-horizon agentic tasks. ● CodeDev variant for lightweight evals – A simplified PaperBench Code-Dev version skips execution and just grades code structure. o1 scored 43.4% there, showing more promise when runtime issues are excluded. ● Failure modes and insights – Models often “gave up early,” lacked strategic planning, and failed to iterate. Claude did better with BasicAgent (freer form), while o1 benefited from IterativeAgent (structured prompts). This highlights how sensitive agents are to prompting and scaffolding. ● Open-source release – PaperBench (with rubrics, grading infra, and replication results) is fully open-sourced to drive further progress on long-horizon agent tasks and autonomous AI R&D. | [Paper](https://arxiv.org/abs/2504.01848), [Tweet](https://x.com/OpenAI/status/1907481490457506235), [GitHub](https://github.com/openai/preparedness) | | 2) Command A: An Enterprise-Ready LLM Cohere announced Command A, a 111B parameter open-weights LLM built for enterprise-grade RAG, agents, code, and multilingual tasks. Key contributions: ● Modular expert merging for domain mastery – Instead of monolithic post-training, Command A uses a decentralized training pipeline. Separate expert models are fine-tuned for specific domains (e.g., math, RAG, multilingual, safety, code), then merged into one model using efficient weighted parameter soup techniques. This preserves most expert performance with just ~1.8% average drop. ● Hybrid architecture for long-context efficiency – Command A interleaves sliding window and full attention layers, achieving 256k context support with drastically lower KV cache memory usage—e.g., only ~33% of LLaMA 3 70B at 128k. It scores 95.0% on RULER, outperforming most long-context peers. ● Superb agentic capabilities – Built for RAG, tool use, and ReAct-style agents, Command A beats GPT-4o and Claude 3.5 on TauBench and BFCL. Tool use is trained via a blend of human-annotated and synthetic data, then aligned with CoPG and SRPO (self-improving preference optimization). ● Best-in-class enterprise evaluations – On real-world generative tasks (e.g., chat summarization, FAQ generation) and RAG use cases (long workplace policy documents), Command A tops the leaderboard with 94.2% pass rate, 4.73 correctness, and 91% unanswerable QA accuracy. ● Multilingual excellence – Command A is trained in 23 global languages with heavy data curation and preference tuning. It scores #1 in dialect alignment (ADI2), 90.3% average LPR (language consistency), and outperforms LLaMA 3.3, GPT-4o, and DeepSeek in manual Arena-style win rates across all languages. ● Polishing for human alignment – Final alignment used a ping-pong loop of offline SRPO and online CoPG with RLHF. This yielded +17pt human win rate gains on code, +10pt on reasoning, and lifted Command A’s win rate over GPT-4o to parity (~50.4%). ● Fast, efficient, and open – Despite its power, Command A runs on just 2×A100s or H100s and generates 156 tokens/sec—faster than GPT-4o and DeepSeek. Model weights are released (CC-BY-NC) on Hugging Face. | [Paper](https://arxiv.org/abs/2504.00698), [Tweet](https://x.com/nrehiew_/status/1908181303339471020), [Models](https://huggingface.co/CohereForAI/c4ai-command-a-03-2025) | | 3) CodeScientist Researchers at AI2 release CodeScientist, a system that autonomously generates and tests scientific hypotheses via code-based experimentation. It’s among the first to produce validated discoveries with minimal human input. Key ideas: ● Code-first scientific agent – CodeScientist reviews research papers and assembles experiments using vetted Python code blocks (e.g., for analysis, simulation). It follows a five-step pipeline: Ideation → Planning → Code Execution → Reporting → Meta-Analysis. ● Validated AI discoveries – From 50 AI research papers on agents and virtual environments, CodeScientist proposed 19 findings. Of these, 6 were judged scientifically sound and novel. Examples: ● Human-guided autonomy – Full automation is possible, but brief human feedback (e.g., ranking ideas) significantly boosts output quality. Human-in-the-loop interaction improves idea selection and experiment debugging. ● Challenges remain – Despite successes, over half the generated experiments fail due to code errors, not scientific flaws. Peer review is still needed to verify results, and current systems lack deep methodological rigor. | [Paper](https://arxiv.org/abs/2503.22708), [Blog](https://allenai.org/blog/codescientist), [GitHub](https://github.com/allenai/codescientist) | | 4) Retrieval-Augmented Reasoning Model Introduces RARE, a new paradigm for training domain-specific LLMs that focuses on reasoning, not memorization. Key ideas: ● Inspired by Bloom’s Taxonomy – RARE shifts LLM training from memorizing knowledge (“Remember”) to applying and evaluating it (“Analyze”, “Create”). It separates domain knowledge (retrieved externally) from domain thinking (learned during training), enabling better performance under tight parameter budgets. ● Open-book prepared training – RARE injects retrieved knowledge into training prompts, letting models learn reasoning patterns instead of rote facts. This open-book, reasoning-first setup beats both standard SFT and RAG approaches, especially in medicine. ● Massive accuracy gains with small models – On five medical QA benchmarks, RARE-trained Llama-3.1-8B and Qwen-2.5-7B outperformed GPT-4 + RAG, with up to +20% accuracy boosts (e.g., PubMedQA: 78.63% vs. GPT-4’s 75.2%, CoVERT: 74.14% vs. GPT-4’s 65.67%). ● Training via distillation + adaptive retries – RARE distills answers (and reasoning paths) from a strong teacher (e.g., QwQ-32B), refining outputs until a correct answer is found. This creates a high-quality dataset that teaches contextualized, case-based thinking. ● New role for retrieval – Unlike standard RAG (used only at inference), RARE uses retrieval during training to shape reasoning. It models knowledge integration (p(kx, R(x))) and reasoning (p(rx, R(x), k)) as separate steps, replacing memorization with application. Overall, this work reframes LLM training for domain-specific intelligence: externalize facts, internalize reasoning. It unlocks strong performance from small models without overfitting or hallucination. | [Paper](https://arxiv.org/abs/2503.23513), [Tweet](https://x.com/omarsar0/status/1907796990966247484) | | 5) Why do LLMs Attend to First Token? This new paper explains why LLMs obsessively focus attention on the first token — a phenomenon known as an attention sink. Their theory: it’s a useful trick to prevent representational collapse in deep Transformers. ● Sinks = over-mixing shields – LLMs with long contexts and deep layers tend to over-mix information, causing similar embeddings for all tokens (i.e., rank collapse or over-squashing). Attention sinks—where many heads fixate on the ⟨bos⟩ token—act as no-ops that reduce token interaction and preserve representation diversity across layers. ● Sharp experiments on Gemma & LLaMa – Perturbation tests in Gemma 7B show ⟨bos⟩ significantly slows the spread of changes through the model. Meanwhile, in LLaMa 3.1 models, over 80% of attention heads show strong sink behavior in the 405B variant, supporting the theory that larger models need stronger sinks. ● Sinks emerge naturally – Even without special pretraining, sinks tend to form at the first position, not because of the ⟨bos⟩ token itself, but due to its location. However, if ⟨bos⟩ is fixed during training and later removed, performance collapses, showing that sink formation is data-dependent. ● Theoretical grounding – The authors connect sink emergence to Jacobian norm bounds, proving that sinks reduce sensitivity to token perturbations. Their math shows that deeper models and longer contexts require stronger sinks. ● Layerwise dynamics insight – Some attention heads use ⟨bos⟩ as a “default” target, unless a special pattern (e.g., apostrophe) triggers real computation. This supports a conditional attention mechanism—attend to ⟨bos⟩ unless needed elsewhere. | [Paper](https://arxiv.org/abs/2504.02732), [Tweet](https://x.com/omarsar0/status/1908187563422261411) | | 6) Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions Presents MedAgentSim is a fully automated, open-source hospital simulation where LLM-powered agents simulate doctor-patient interactions in dynamic diagnostic settings. Unlike previous static QA benchmarks, MedAgentSim mimics real-world clinical workflows with multi-turn dialogue, test requests, and self-improvement. More about this paper: ● Active doctor agents – MedAgentSim requires LLM doctor agents to engage in multi-turn consultations, request labs and imaging (e.g., ECG, X-ray), and iteratively refine diagnoses, making it far more realistic than pre-filled medical QA datasets. ● Self-improvement via memory + reflection – The system maintains buffers of successful and failed diagnoses. It uses retrieved past cases (via kNN), chain-of-thought reasoning, and ensembling to improve performance over time. Misdiagnoses trigger a reflection phase before inclusion in memory. ● Fully autonomous or human-in-the-loop – Users can optionally take control of the doctor or patient agents. Simulation assets are built using a 2D game engine (Phaser), and the agents can navigate, converse, and interact with virtual medical tools. ● Big performance boost across benchmarks – On NEJM, MedQA, and MIMIC-IV, MedAgentSim (with LLaMA 3.3) outperforms baseline setups by +6–37%, especially in vision-language tasks using LLaVA for interpreting medical images. ● Bias analysis & fairness focus – The team studied diagnostic accuracy under cognitive and implicit bias conditions. Models like GPT-4o and LLaMA proved more robust than Mixtral/Mistral, highlighting the importance of bias-aware evaluation. | [Paper](https://arxiv.org/abs/2503.22678), [Tweet](https://x.com/omarsar0/status/1906719555482702147), [Code](https://github.com/MAXNORM8650/MedAgentSim) | | 7) Open Deep Search Researchers from Sentient, UW, Princeton, and UC Berkeley introduce Open Deep Search (ODS), an open-source search AI framework that rivals top proprietary systems like GPT-4o Search Preview and Perplexity Sonar. Key insights: ● Two open components: search + reasoning – ODS has two modular parts: (1) Open Search Tool, which retrieves and refines high-quality web results using query rephrasing, snippet reranking, and site-specific logic; and (2) Open Reasoning Agent, a controller that orchestrates tool usage (search, calculator, etc.) to answer queries. Two variants are offered: ODS-v1 (ReAct) and ODS-v2 (CodeAct). ● SOTA open-source performance – With DeepSeek-R1 as the base LLM, ODS-v2 scores 88.3% on SimpleQA and 75.3% on FRAMES, beating GPT-4o Search Preview by +9.7% on the latter. ODS adapts the number of searches per query (avg. 3.39 on FRAMES), balancing cost and accuracy more efficiently than fixed-query baselines. ● Better than Perplexity Sonar – On both FRAMES and SimpleQA, ODS+DeepSeek-R1 outperforms Perplexity’s flagship search models, even in complex reasoning tasks involving multi-hop questions, time/date calculations, and name disambiguation. ● Code-based agents enhance reasoning – ODS-v2 builds on CodeAct, allowing it to write and run Python code to perform symbolic reasoning and tool calls. This results in sharper numerical precision and task flexibility compared to CoT-based ReAct in ODS-v1. | [Paper](https://arxiv.org/abs/2503.20201), [Tweet](https://x.com/sewoong79/status/1906595129965912341), [GitHub](https://github.com/sentient-agi/OpenDeepSearch) | | 8) Efficient Test-time Scaling with Code Z1 is a new method for making large language models more compute-efficient at test time, especially during reasoning. The core idea is to train LLMs with short and long code-based reasoning trajectories, and then dynamically adjust reasoning depth during inference. Key contributions: ● Z1-Code-Reasoning-107K dataset – They construct a 107K-sample dataset with short and long reasoning paths for simple and complex coding problems. Trajectories are distilled from QwQ-32B and paired to help the model learn when to stop thinking. ● Shifted Thinking Window – A new test-time strategy that eliminates explicit