# ML Papers of The Week [Subscribe to our newsletter](https://nlpnews.substack.com/) to get a weekly list of top ML papers in your inbox. At DAIR.AI we ❤️ reading ML papers so we've created this repo to highlight the top ML papers of every week. Here is the weekly series: Here is the weekly series: ## 2025 - [Top ML Papers of the Week (May 12 - May 18)](./#top-ml-papers-of-the-week-may-12---may-18---2025) - [Top ML Papers of the Week (May 5 - May 11)](./#top-ml-papers-of-the-week-may-5---may-11---2025) - [Top ML Papers of the Week (April 28 - May 4)](./#top-ml-papers-of-the-week-april-28---may-4---2025) - [Top ML Papers of the Week (April 21 - April 27)](./#top-ml-papers-of-the-week-april-21---april-27---2025) - [Top ML Papers of the Week (April 14 - April 20)](./#top-ml-papers-of-the-week-april-14---april-20---2025) - [Top ML Papers of the Week (April 7 - April 13)](./#top-ml-papers-of-the-week-april-7---april-13---2025) - [Top ML Papers of the Week (March 31 - April 6)](./#top-ml-papers-of-the-week-march-31---april-6---2025) - [Top ML Papers of the Week (March 24 - March 30)](./#top-ml-papers-of-the-week-march-24---march-30---2025) - [Top ML Papers of the Week (March 17 - March 23)](./#top-ml-papers-of-the-week-march-17---march-23---2025) - [Top ML Papers of the Week (March 10 - March 16)](./#top-ml-papers-of-the-week-march-10---march-16---2025) - [Top ML Papers of the Week (March 3 - March 9)](./#top-ml-papers-of-the-week-march-3---march-9---2025) - [Top ML Papers of the Week (February 24 - March 2)](./#top-ml-papers-of-the-week-february-24---march-2---2025) - [Top ML Papers of the Week (February 17 - February 23)](./#top-ml-papers-of-the-week-february-17---february-23---2025) - [Top ML Papers of the Week (February 10 - February 16)](./#top-ml-papers-of-the-week-february-10---february-16---2025) - [Top ML Papers of the Week (February 3 - February 9)](./#top-ml-papers-of-the-week-february-3---february-9---2025) - [Top ML Papers of the Week (January 27 - February 2)](./#top-ml-papers-of-the-week-january-27---february-2---2025) - [Top ML Papers of the Week (January 20 - January 26)](./#top-ml-papers-of-the-week-january-20---january-26---2025) - [Top ML Papers of the Week (January 13 - January 19)](./#top-ml-papers-of-the-week-january-13---january-19---2025) - [Top ML Papers of the Week (January 6 - January 12)](./#top-ml-papers-of-the-week-january-6---january-12---2025) ## 2024 - [Top ML Papers of the Week (December 30 - January 5)](./#top-ml-papers-of-the-week-december-30---january-5---2025) - [Top ML Papers of the Week (December 23 - December 29)](./#top-ml-papers-of-the-week-december-23---december-29---2024) - [Top ML Papers of the Week (December 16 - December 22)](./#top-ml-papers-of-the-week-december-16---december-22---2024) - [Top ML Papers of the Week (December 9 - December 15)](./#top-ml-papers-of-the-week-december-9---december-15---2024) - [Top ML Papers of the Week (December 2 - December 8)](./#top-ml-papers-of-the-week-december-2---december-8---2024) - [Top ML Papers of the Week (November 25 - December 1)](./#top-ml-papers-of-the-week-november-25---december-1---2024) - [Top ML Papers of the Week (November 18 - November 24)](./#top-ml-papers-of-the-week-november-18---november-24---2024) - [Top ML Papers of the Week (November 11 - November 17)](./#top-ml-papers-of-the-week-november-11---november-17---2024) - [Top ML Papers of the Week (November 4 - November 10)](./#top-ml-papers-of-the-week-november-4---november-10---2024) - [Top ML Papers of the Week (October 28 - November 3)](./#top-ml-papers-of-the-week-october-28---november-3---2024) - [Top ML Papers of the Week (October 21 - October 27)](./#top-ml-papers-of-the-week-october-14---october-20---2024) - [Top ML Papers of the Week (October 14 - October 20)](./#top-ml-papers-of-the-week-october-14---october-20---2024) - [Top ML Papers of the Week (October 7 - October 13)](./#top-ml-papers-of-the-week-october-7---october-13---2024) - [Top ML Papers of the Week (September 30 - October 6)](./#top-ml-papers-of-the-week-september-30---october-6---2024) - [Top ML Papers of the Week (September 23 - September 29)](./#top-ml-papers-of-the-week-september-23---september-29---2024) - [Top ML Papers of the Week (September 16 - September 22)](./#top-ml-papers-of-the-week-september-16---september-22---2024) - [Top ML Papers of the Week (September 9 - September 15)](./#top-ml-papers-of-the-week-september-9---september-15---2024) - [Top ML Papers of the Week (September 2 - September 8)](./#top-ml-papers-of-the-week-september-2---september-8---2024) - [Top ML Papers of the Week (August 26 - September 1)](./#top-ml-papers-of-the-week-august-26---september-1---2024) - [Top ML Papers of the Week (August 19 - August 25)](./#top-ml-papers-of-the-week-august-19---august-25---2024) - [Top ML Papers of the Week (August 12 - August 18)](./#top-ml-papers-of-the-week-august-12---august-18---2024) - [Top ML Papers of the Week (August 5 - August 11)](./#top-ml-papers-of-the-week-august-5---august-11---2024) - [Top ML Papers of the Week (July 29 - August 4)](./#top-ml-papers-of-the-week-july-29---august-4---2024) - [Top ML Papers of the Week (July 22 - July 28)](./#top-ml-papers-of-the-week-july-15---july-21---2024) - [Top ML Papers of the Week (July 15 - July 21)](./#top-ml-papers-of-the-week-july-15---july-21---2024) - [Top ML Papers of the Week (July 8 - July 14)](./#top-ml-papers-of-the-week-july-8---july-14---2024) - [Top ML Papers of the Week (July 1 - July 7)](./#top-ml-papers-of-the-week-july-1---july-7---2024) - [Top ML Papers of the Week (June 24 - June 30)](./#top-ml-papers-of-the-week-june-24---june-30---2024) - [Top ML Papers of the Week (June 17 - June 23)](./#top-ml-papers-of-the-week-june-17---june-23---2024) - [Top ML Papers of the Week (June 10 - June 16)](./#top-ml-papers-of-the-week-june-10---june-16---2024) - [Top ML Papers of the Week (June 3 - June 9)](./#top-ml-papers-of-the-week-june-3---june-9---2024) - [Top ML Papers of the Week (May 27 - June 2)](./#top-ml-papers-of-the-week-may-27---june-2---2024) - [Top ML Papers of the Week (May 20 - May 26)](./#top-ml-papers-of-the-week-may-20---may-26---2024) - [Top ML Papers of the Week (May 13 - May 19)](./#top-ml-papers-of-the-week-may-13---may-19---2024) - [Top ML Papers of the Week (May 6 - May 12)](./#top-ml-papers-of-the-week-may-6---may-12---2024) - [Top ML Papers of the Week (April 29 - May 5)](./#top-ml-papers-of-the-week-april-29---may-5---2024) - [Top ML Papers of the Week (April 22 - April 28)](./#top-ml-papers-of-the-week-april-22---april-28---2024) - [Top ML Papers of the Week (April 15 - April 21)](./#top-ml-papers-of-the-week-april-15---april-21---2024) - [Top ML Papers of the Week (April 8 - April 14)](./#top-ml-papers-of-the-week-april-8---april-14---2024) - [Top ML Papers of the Week (April 1 - April 7)](./#top-ml-papers-of-the-week-april-1---april-7---2024) - [Top ML Papers of the Week (March 26 - March 31)](./#top-ml-papers-of-the-week-march-26---march-31---2024) - [Top ML Papers of the Week (March 18 - March 25)](./#top-ml-papers-of-the-week-march-18---march-25---2024) - [Top ML Papers of the Week (March 11 - March 17)](./#top-ml-papers-of-the-week-march-11---march-17---2024) - [Top ML Papers of the Week (March 4 - March 10)](./#top-ml-papers-of-the-week-march-4---march-10---2024) - [Top ML Papers of the Week (February 26 - March 3)](./#top-ml-papers-of-the-week-february-26---march-3---2024) - [Top ML Papers of the Week (February 19 - February 25)](./#top-ml-papers-of-the-week-february-19---february-25---2024) - [Top ML Papers of the Week (February 12 - February 18)](./#top-ml-papers-of-the-week-february-12---february-18---2024) - [Top ML Papers of the Week (February 5 - February 11)](./#top-ml-papers-of-the-week-february-5---february-11---2024) - [Top ML Papers of the Week (January 29 - February 4)](./#top-ml-papers-of-the-week-january-29---february-4---2024) - [Top ML Papers of the Week (January 22 - January 28)](./#top-ml-papers-of-the-week-january-22---january-28---2024) - [Top ML Papers of the Week (January 15 - January 21)](./#top-ml-papers-of-the-week-january-15---january-21---2024) - [Top ML Papers of the Week (January 8 - January 14)](./#top-ml-papers-of-the-week-january-8---january-14---2024) - [Top ML Papers of the Week (January 1 - January 7)](./#top-ml-papers-of-the-week-january-1---january-7---2024) ## 2023 - [Top ML Papers of the Week (December 24 - December 31)](./#top-ml-papers-of-the-week-december-25---december-31) - [Top ML Papers of the Week (December 18 - December 24)](./#top-ml-papers-of-the-week-december-18---december-24) - [Top ML Papers of the Week (December 11 - December 17)](./#top-ml-papers-of-the-week-december-11---december-17) - [Top ML Papers of the Week (December 4 - December 10)](./#top-ml-papers-of-the-week-december-4---december-10) - [Top ML Papers of the Week (November 27 - December 3)](./#top-ml-papers-of-the-week-november-27---december-3) - [Top ML Papers of the Week (November 20 - November 26)](./#top-ml-papers-of-the-week-november-20---november-26) - [Top ML Papers of the Week (November 13 - November 19)](./#top-ml-papers-of-the-week-november-13---november-19) - [Top ML Papers of the Week (November 6 - November 12)](./#top-ml-papers-of-the-week-november-6---november-12) - [Top ML Papers of the Week (October 30 - November 5)](./#top-ml-papers-of-the-week-october-30---november-5) - [Top ML Papers of the Week (October 23 - October 29)](./#top-ml-papers-of-the-week-october-23---october-29) - [Top ML Papers of the Week (October 16 - October 22)](./#top-ml-papers-of-the-week-october-16---october-22) - [Top ML Papers of the Week (October 9 - October 15)](./#top-ml-papers-of-the-week-october-9---october-15) - [Top ML Papers of the Week (October 2 - October 8)](./#top-ml-papers-of-the-week-october-2---october-8) - [Top ML Papers of the Week (September 25 - October 1)](./#top-ml-papers-of-the-week-september-25---october-1) - [Top ML Papers of the Week (September 18 - September 24)](./#top-ml-papers-of-the-week-september-18---september-24) - [Top ML Papers of the Week (September 11 - September 17)](./#top-ml-papers-of-the-week-september-11---september-17) - [Top ML Papers of the Week (September 4 - September 10)](./#top-ml-papers-of-the-week-september-4---september-10) - [Top ML Papers of the Week (August 28 - September 3)](./#top-ml-papers-of-the-week-august-28---september-3) - [Top ML Papers of the Week (August 21 - August 27)](./#top-ml-papers-of-the-week-august-21---august-27) - [Top ML Papers of the Week (August 14 - August 20)](./#top-ml-papers-of-the-week-august-14---august-20) - [Top ML Papers of the Week (August 7 - August 13)](./#top-ml-papers-of-the-week-august-7---august-13) - [Top ML Papers of the Week (July 31 - August 6)](./#top-ml-papers-of-the-week-july-31---august-6) - [Top ML Papers of the Week (July 24 - July 30)](./#top-ml-papers-of-the-week-july-24---july-30) - [Top ML Papers of the Week (July 17 - July 23)](./#top-ml-papers-of-the-week-july-17---july-23) - [Top ML Papers of the Week (July 10 - July 16)](./#top-ml-papers-of-the-week-july-10---july-16) - [Top ML Papers of the Week (July 3 - July 9)](./#top-ml-papers-of-the-week-july-3---july-9) - [Top ML Papers of the Week (June 26 - July 2)](./#top-ml-papers-of-the-week-june-26---july-2) - [Top ML Papers of the Week (June 19 - June 25)](./#top-ml-papers-of-the-week-june-19---june-25) - [Top ML Papers of the Week (June 12 - June 18)](./#top-ml-papers-of-the-week-june-12---june-18) - [Top ML Papers of the Week (June 5 - June 11)](./#top-ml-papers-of-the-week-june-5---june-11) - [Top ML Papers of the Week (May 29 - June 4)](./#top-ml-papers-of-the-week-may-29-june-4) - [Top ML Papers of the Week (May 22 - 28)](./#top-ml-papers-of-the-week-may-22-28) - [Top ML Papers of the Week (May 15 - 21)](./#top-ml-papers-of-the-week-may-15-21) - [Top ML Papers of the Week (May 8 - 14)](./#top-ml-papers-of-the-week-may-8-14) - [Top ML Papers of the Week (May 1-7)](./#top-ml-papers-of-the-week-may-1-7) - [Top ML Papers of the Week (April 24 - April 30)](./#top-ml-papers-of-the-week-april-24---april-30) - [Top ML Papers of the Week (April 17 - April 23)](./#top-ml-papers-of-the-week-april-17---april-23) - [Top ML Papers of the Week (April 10 - April 16)](./#top-ml-papers-of-the-week-april-10---april-16) - [Top ML Papers of the Week (April 3 - April 9)](./#top-ml-papers-of-the-week-april-3---april-9) - [Top ML Papers of the Week (Mar 27 - April 2)](./#top-ml-papers-of-the-week-mar-27---april-2) - [Top ML Papers of the Week (Mar 20-Mar 26)](./#top-ml-papers-of-the-week-mar-20-mar-26) - [Top ML Papers of the Week (Mar 13-Mar 19)](./#top-ml-papers-of-the-week-mar-13-mar-19) - [Top ML Papers of the Week (Mar 6-Mar 12)](./#top-ml-papers-of-the-week-mar-6-mar-12) - [Top ML Papers of the Week (Feb 27-Mar 5)](./#top-ml-papers-of-the-week-feb-27-mar-5) - [Top ML Papers of the Week (Feb 20-26)](./#top-ml-papers-of-the-week-feb-20-26) - [Top ML Papers of the Week (Feb 13 - 19)](./#top-ml-papers-of-the-week-feb-13---19) - [Top ML Papers of the Week (Feb 6 - 12)](./#top-ml-papers-of-the-week-feb-6---12) - [Top ML Papers of the Week (Jan 30-Feb 5)](./#top-ml-papers-of-the-week-jan-30-feb-5) - [Top ML Papers of the Week (Jan 23-29)](./#top-ml-papers-of-the-week-jan-23-29) - [Top ML Papers of the Week (Jan 16-22)](./#top-ml-papers-of-the-week-jan-16-22) - [Top ML Papers of the Week (Jan 9-15)](./#top-ml-papers-of-the-week-jan-9-15) - [Top ML Papers of the Week (Jan 1-8)](./#top-ml-papers-of-the-week-jan-1-8) [Follow us on Twitter](https://twitter.com/dair_ai) [Join our Discord](https://discord.gg/SKgkVT8BGJ) ## Top ML Papers of the Week (May 12 - May 18) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) AlphaEvolve AlphaEvolve is a coding agent developed by Google DeepMind that uses LLM-guided evolution to discover new algorithms and optimize computational systems. It orchestrates a pipeline where LLMs generate code changes, evaluators provide feedback, and an evolutionary loop iteratively improves solutions. AlphaEvolve shows that LLMs can go beyond conventional code generation and assist in scientific and algorithmic discovery. Key highlights:
● Novel Algorithm Discovery: AlphaEvolve discovered a new algorithm to multiply 4×4 complex-valued matrices using 48 multiplications, the first improvement over Strassen’s 1969 result (49 multiplications) in this setting.
● Broad Mathematical Impact: Applied to 50+ open problems in mathematics, AlphaEvolve matched or exceeded state-of-the-art in ~95% of cases. For example, it improved bounds on Erdős’s minimum overlap problem and kissing numbers in 11 dimensions.
● Infrastructure Optimization at Google: AlphaEvolve improved key components of Google’s compute stack:
● Advanced Pipeline Design: AlphaEvolve uses ensembles of Gemini 2.0 Flash and Pro models. It supports rich prompts (past trials, evaluations, explicit context), multi-objective optimization, and evaluation cascades for robust idea filtering. Programs are evolved at full-file scale rather than function-level only, a key differentiator from predecessors like FunSearch.
● Ablations Confirm Component Importance: Experiments show that evolution, prompt context, full-file evolution, and using strong LLMs all contribute significantly to performance. Removing any one of these reduces effectiveness. | [Paper](https://storage.googleapis.com/deepmind-DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf), [Tweet](https://x.com/GoogleDeepMind/status/1922669321559347498) | | 2) LLMs Get Lost in Multi-Turn Conversation Investigates how top LLMs degrade in performance during underspecified, multi-turn interactions, common in real-world usage but rarely evaluated. The authors introduce a novel "sharded simulation" framework that breaks down fully-specified instructions into gradual conversation shards, simulating how users naturally provide information over time. Key findings:
● Massive performance drop: Across 15 top LLMs (e.g., GPT-4.1, Gemini 2.5 Pro, Claude 3.7), average performance dropped 39% in multi-turn vs. single-turn settings. Even a two-turn interaction was enough to cause a significant decline.
● High unreliability, not just low aptitude: Decomposition shows only a small drop in best-case capability (aptitude) but a 112% increase in unreliability, meaning models are wildly inconsistent depending on how the conversation unfolds.
● Root causes of failure: Through log analysis and experiments, the paper identifies four major issues:
● Sharded evaluation tasks: The authors built 600+ multi-turn simulations across 6 tasks (coding, math, SQL, API calls, summarization, and table captioning), showing consistent degradation across domains.
● Agent-style interventions only partially help: Techniques like recap and snowballing (repeating all prior turns) improved outcomes by ~15–20% but did not restore single-turn levels, suggesting that model internals, not prompting strategies, are the bottleneck.
● Temperature and test-time compute don't solve the issue: Even at temperature 0.0 or with reasoning models (like o3 and DeepSeek-R1), models remained highly unreliable in multi-turn settings. | [Paper](https://arxiv.org/abs/2505.06120), [Tweet](https://x.com/omarsar0/status/1922755721428598988) | | 3) RL for Reasoning in LLMs with One Training Example This paper shows that Reinforcement Learning with Verifiable Rewards (RLVR) can significantly improve mathematical reasoning in LLMs even when trained with just a single example. On the Qwen2.5-Math-1.5B model, one-shot RLVR improves accuracy on the MATH500 benchmark from 36.0% to 73.6%, nearly matching performance achieved with over 1,200 examples. Two-shot RLVR (with two examples) even slightly surpasses that, matching results from full 7.5k example training.
● Extreme data efficiency: A single training example (π₁₃) boosts MATH500 accuracy to 73.6% and average performance across six math benchmarks to 35.7%, rivaling full-dataset RLVR. Two-shot RLVR goes further (74.8% and 36.6%).
● Broad applicability: 1-shot RLVR works not only on Qwen2.5-Math-1.5B, but also on Qwen2.5-Math-7B, Llama3.2-3B-Instruct, and DeepSeek-R1-Distill-Qwen-1.5B. It remains effective across GRPO and PPO RL algorithms.
● Post-saturation generalization: Despite training accuracy saturating early (within 100 steps), test accuracy continues improving well beyond, reaching gains of +10% after 2,000 steps. The model eventually overfits the single example (mixing gibberish into outputs), yet test performance remains stable.
● Cross-domain and reflection behavior: A single example from one domain (e.g., geometry) improves performance across others (e.g., number theory). Additionally, models trained with 1-shot RLVR exhibit increased self-reflection (e.g., “rethink”, “recalculate”) and longer output sequences.
● Loss function insights: Ablation studies confirm that policy gradient loss is the primary driver of improvements, not weight decay, distinguishing 1-shot RLVR from "grokking". Entropy loss further enhances performance and generalization; even without reward signals, entropy-only training can still yield a 27% performance boost. | [Paper](https://arxiv.org/abs/2504.20571), [Tweet](https://x.com/ypwang61/status/1917596101953348000) | | 4) AM-Thinking-v1 Introduces a dense, open-source 32B language model that achieves state-of-the-art performance in reasoning tasks, rivaling significantly larger Mixture-of-Experts (MoE) models. Built upon Qwen2.5-32B, the model is trained entirely with public data and showcases how a meticulously crafted post-training pipeline can unlock competitive performance at mid-scale sizes. Key points:
● Benchmark performance: AM-Thinking-v1 scores 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, outperforming DeepSeek-R1 (671B MoE) and matching or exceeding Qwen3-32B and Seed1.5-Thinking. On Arena-Hard (general chat), it hits 92.5, near the level of OpenAI o1 and o3-mini but behind Qwen3-235B-A22B and Gemini 2.5 Pro.
● Training pipeline: The model uses a two-stage post-training approach combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). SFT emphasizes a “think-then-answer” format and uses 2.84M samples, while RL incorporates difficulty-aware sampling and a two-stage curriculum optimized via Group Relative Policy Optimization (GRPO).
● Data and filtering: All training data is publicly sourced and heavily filtered. Math data goes through LLM-assisted cleaning and cross-model ground-truth validation. Responses are filtered using perplexity, n-gram repetition, and structural checks to ensure coherence and correctness.
● Inference and deployment: The authors implement a custom rollout framework atop, decoupling rollout from inference via a streaming load balancer. This reduces long-tail latency and increases throughput across distributed GPU nodes, enabling scalable RL training at 32k sequence length. | [Paper](https://arxiv.org/abs/2505.08311), [Tweet](https://x.com/omarsar0/status/1922668488826741061) | | 5) HealthBench HealthBench is a benchmark of 5,000 multi-turn health conversations graded against 48,562 rubric criteria written by 262 physicians across 60 countries. Unlike prior multiple-choice evaluations, HealthBench supports open-ended, realistic assessments of LLM responses across diverse health themes (e.g., global health, emergency care, context-seeking) and behavioral axes (accuracy, completeness, communication, context awareness, instruction following).
● Significant frontier model gains: HealthBench reveals rapid performance improvements, with GPT-3.5 Turbo scoring 16%, GPT-4o reaching 32%, and o3 achieving 60%. Notably, smaller models like GPT-4.1 nano outperform GPT-4o while being 25x cheaper.
● Two challenging benchmark variants: HealthBench Consensus focuses on 34 physician-validated criteria (e.g., recognizing emergencies), while HealthBench Hard isolates 1,000 difficult examples on which no model scores above 32%, establishing headroom for future progress.
● Physician comparison baseline: Surprisingly, LLMs like o3 and GPT-4.1 often produce higher-quality responses than unassisted physicians. When provided with model responses as references, physicians improved older model completions but couldn’t improve completions from newer models.
● Reliable model-based grading: Meta-evaluation shows GPT-4.1 as a grader achieves macro F1 scores comparable to physicians. On average, its agreement with other doctors places it in the 51st–88th percentile across themes like emergency triage, communication, and uncertainty handling.
● Safety-relevant insights: The benchmark assesses worst-case performance using "worst-at-k" scores, showing that even the best models have reliability gaps. For example, o3’s worst-at-16 score drops by a third from its average, underscoring the need for further safety work. | [Paper](https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf), [Tweet](https://x.com/OpenAI/status/1921983050138718531) | | 6) Nemotron-Research-Tool-N1 Introduces Tool-N1, a family of tool-using LLMs trained using a rule-based reinforcement learning (R1-style RL) approach, without reliance on supervised reasoning trajectories. The key idea is to enable models to learn to invoke external tools correctly through binary feedback based on functional correctness and format adherence, rather than step-by-step imitation.
● Rule-based RL over SFT: Tool-N1 models are trained using a lightweight binary reward that only evaluates whether the model's tool calls are structurally correct and functionally valid. This allows the model to develop its reasoning process, sidestepping the limitations of mimicking distilled trajectories via supervised fine-tuning (SFT).
● Strong benchmark results: Tool-N1-7B and Tool-N1-14B outperform GPT-4o and domain-specialized models on several benchmarks, including BFCL, API-Bank, and ACEBench. For example, Tool-N1-14B beats GPT-4o on BFCL overall (85.97 vs 83.97) and achieves +5% over GPT-4o on API-Bank.
● Pure RL outperforms SFT-then-RL: A systematic comparison on 5,518 distilled trajectories shows that pure RL yields better results than the SFT-then-RL pipeline, challenging the dominant paradigm. For instance, 100% RL achieves 83.24% average vs. 83.17% for SFT+RL.
● Binary reward fine-grained reward: Ablation studies reveal that strict binary rewards (requiring correct reasoning format and exact tool call) lead to better generalization than partial credit schemes, especially on realistic “Live” data (80.38% vs 76.61%).
● Scaling and generalization: Performance scales well with model size, with the most gains observed in larger models. The method generalizes across backbones, with Qwen2.5-Instruct outperforming LLaMA3 variants at the same scale. | [Paper](https://arxiv.org/abs/2505.00024), [Tweet](https://x.com/ShaokunZhang1/status/1922105694167433501) | | 7) RL for Search-Efficient LLMs Proposes a new RL-based framework (SEM) that explicitly teaches LLMs when to invoke search and when to rely on internal knowledge, aiming to reduce redundant tool use while maintaining answer accuracy. Key points:
● Motivation & Setup: LLMs often overuse external search even for trivial queries. SEM addresses this by using a balanced training dataset (Musique for unknowns, MMLU for knowns) and a structured format (● Reward Optimization: The authors employ Group Relative Policy Optimization (GRPO) to compare outputs within query groups. The reward function penalizes unnecessary search and rewards correct answers, either without search or with efficient search-and-reasoning when needed.
● Experimental Results: On HotpotQA and MuSiQue, SEM significantly outperforms Naive RAG and ReSearch, achieving higher EM and LLM-Judged (LJ) accuracy with smarter search ratios. On MMLU and GSM8K (where search is often unnecessary), SEM maintains high accuracy while invoking search far less than baseline methods (e.g., 1.77% SR vs 47.98% for Naive RAG on MMLU.
● Case Study & Efficiency: SEM avoids absurd search behavior like querying “What is 1+1?” multiple times. It also uses fewer but more targeted searches for unknowns, enhancing both interpretability and computational efficiency. Training dynamics further show that SEM enables faster and more stable learning than prior methods. | [Paper](https://arxiv.org/abs/2505.07903), [Tweet](https://x.com/omarsar0/status/1922665313117552664) | | 8) Cost-Efficient, Low-Latency Vector Search Integrates DiskANN (a vector indexing library) inside of Azure Cosmos DB NoSQL (an operational dataset) that uses a single vector index per partition stored in existing index trees. Benefit: It supports < 20ms query latency over an index spanning 10 million vectors, has stable recall over updates, and offers nearly 15× and 41× lower query cost compared to Zilliz and Pinecone serverless enterprise products. It can further scale to billions of vectors with automatic partitioning. | [Paper](https://arxiv.org/abs/2505.05885), [Tweet](https://x.com/omarsar0/status/1921938925142384736) | | 9) AI Agents vs. Agentic AI This review paper distinguishes AI Agents from Agentic AI, presenting a structured taxonomy and comparing their architectures, capabilities, and challenges. AI Agents are defined as modular, task-specific systems powered by LLMs and tools, while Agentic AI represents a shift toward multi-agent collaboration, dynamic task decomposition, and orchestrated autonomy, with applications and challenges mapped out for both paradigms, along with proposed solutions like RAG, orchestration layers, and causal modeling. | [Paper](https://arxiv.org/abs/2505.10468), [Tweet](https://x.com/omarsar0/status/1923817691455873420) | | 10) CellVerse Introduces a benchmark to evaluate LLMs on single-cell biology tasks by converting multi-omics data into natural language. While generalist LLMs like DeepSeek and GPT-4 families show some reasoning ability, none significantly outperform random guessing on key tasks like drug response prediction, exposing major gaps in biological understanding by current LLMs. | [Paper](https://arxiv.org/abs/2505.07865), [Tweet](https://x.com/omarsar0/status/1922662317986099522) | ## Top ML Papers of the Week (May 5 - May 11) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) The Leaderboard Illusion The Leaderboard Illusion investigates systemic distortions in how the Chatbot Arena leaderboard evaluates LLMs, arguing that current practices undermine fair model comparison and scientific progress. Through extensive data analysis covering 2M Arena battles, the authors identify four key issues distorting rankings:
● Selective score reporting through private testing: Some providers (notably Meta, Google, and OpenAI) are allowed to test dozens of model variants privately and only publish the best-performing one. This violates the unbiased sampling assumption of the Bradley-Terry (BT) model, which powers Arena rankings. Simulations show that testing just 10 variants can artificially inflate a model’s Arena score by ~100 points.
● Extreme data asymmetries: Proprietary models are oversampled compared to open-weight and open-source models. OpenAI and Google alone received over 39% of all Arena data, while 83 open-weight models collectively received only 29.7%. These data advantages translate into significant performance gains: a model trained on 70% Arena data outperforms its baseline by 112% on the ArenaHard benchmark.
● Unfair and opaque deprecations: 205 models were silently removed from the leaderboard despite only 47 being officially marked as deprecated. Open-source models are disproportionately affected, breaking the comparison graph and violating BT model assumptions, leading to unreliable rankings.
● Overfitting to Arena-specific dynamics: Due to partial prompt repetition and distributional drift over time, access to Arena data allows providers to tune models specifically for Arena performance. This leads to high win rates on Arena benchmarks, but not on out-of-distribution tasks like MMLU, where gains diminish or reverse. | [Paper](https://arxiv.org/abs/2504.20879) | | 2) Llama-Nemotron NVIDIA introduces the Llama-Nemotron model series, LN-Nano (8B), LN-Super (49B), and LN-Ultra (253B), a family of open, efficient, and high-performing reasoning models. These models rival or outperform DeepSeek-R1 on various benchmarks while offering significantly better inference throughput and memory efficiency. LN-Ultra is noted as the most "intelligent" open model by Artificial Analysis. A key innovation is a dynamic reasoning toggle ("detailed thinking on/off") that allows users to control reasoning behavior at inference time. Highlights:
● Multi-stage training: Models were built via neural architecture search (Puzzle), knowledge distillation, continued pretraining, supervised fine-tuning (SFT), and large-scale RL. LN-Ultra is enhanced with FP8 inference and FFN Fusion for speed and scalability.
● Reasoning Toggle: The models can switch between reasoning and non-reasoning modes via a simple prompt instruction, making them adaptable for various use cases.
● Synthetic dataset: Over 33M examples across math, code, science, and instruction-following were curated, with reasoning-mode samples tagged explicitly. LN-Ultra's training used curriculum RL and GRPO to surpass its teachers on benchmarks like GPQA-D.
● Evaluation dominance: LN-Ultra outperforms DeepSeek-R1 and Llama-3.1-405B in reasoning tasks like AIME25, MATH500, and GPQA-Diamond while also achieving strong chat alignment scores (Arena-Hard: 87.0). LN-Super scores 88.3, beating Claude 3.5 and GPT-4o. NVIDIA provides the weights, training code (NeMo, Megatron-LM, NeMo-Aligner), and the full post-training dataset under a permissive license, aiming to push open research in reasoning models. | [Paper](https://arxiv.org/abs/2505.00949v1), [Models](https://huggingface.co/nvidia) | | 3) Absolute Zero Introduces an LLM training framework that eliminates the need for human-curated data. Key highlights:
● It learns to propose and solve its reasoning tasks entirely through self-play, guided by verifiable feedback from an execution environment. This zero-data RLVR (RL with Verifiable Rewards) setting achieves SOTA coding and math reasoning performance.
● AZR learns by generating its code-based reasoning tasks using three core reasoning modes (deduction, abduction, and induction), validating solutions via Python execution, not human labels.
● A single LLM plays both roles, proposing new tasks based on learnability and solving them with feedback-based reinforcement. Rewards favor moderately difficult tasks to maximize the learning signal.
● Despite using zero in-domain examples, AZR outperforms all previous zero-setting models on average by +1.8 points and even beats models trained on tens to hundreds of thousands of curated samples. AZR-Coder-7B achieves the highest average score across all tested models.
● AZR trained in a coding-only environment improves mathematical reasoning performance by up to +15.2 points, far more than expert code models trained with RLVR, showing strong generalization.
● Larger AZR models (3B → 7B → 14B) consistently show greater improvements, confirming scalability and suggesting promise for even larger models.
● AZR develops natural ReAct-like intermediate planning in code (e.g., interleaved comments and logic), trial-and-error strategies in abduction, and systematic state tracking, behaviors typically observed in much larger models.
● Llama-3.1-8B variants of AZR sometimes produce concerning reasoning chains (dubbed “uh-oh moments”), highlighting the importance of safety-aware training in autonomous systems. | [Paper](https://arxiv.org/abs/2505.03335), [Tweet](https://x.com/AndrewZ45732491/status/1919920459748909288) | | 4) Discuss-RAG This paper introduces Discuss-RAG, a plug-and-play agent-based framework that enhances retrieval-augmented generation (RAG) for medical question answering by mimicking human-like clinical reasoning. Standard RAG systems rely on embedding-based retrieval and lack mechanisms to verify relevance or logical coherence, often leading to hallucinations or outdated answers. Discuss-RAG addresses these gaps via a modular agent setup that simulates multi-turn medical discussions and performs post-retrieval verification. Key ideas:
● Multi-agent collaboration: A summarizer agent orchestrates a team of medical domain experts who iteratively refine a contextual summary through simulated brainstorming, providing deeper and more structured information to guide retrieval.
● Decision-making agent: After retrieval, a verifier and a decision-making agent assess snippet quality and trigger fallback strategies when relevance is low, improving answer accuracy and contextual grounding.
● Plug-and-play design: Discuss-RAG is training-free and modular, allowing easy integration into existing RAG pipelines.
● Strong performance gains: Across four benchmarks, Discuss-RAG outperforms MedRAG with substantial accuracy improvements, notably +16.67% on BioASQ and +12.20% on PubMedQA. | [Paper](https://arxiv.org/abs/2504.21252) | | 5) The Value of RL in Fine-Tuning This work shows that, in theory, every popular preference-fine-tuning objective collapses to maximum-likelihood estimation (MLE), yet experiments show a consistent RL advantage on real tasks. They reconcile this gap with a generation-verification complexity hypothesis.
● Theory: RLHF ≈ MLE – Under mild assumptions, trajectory-level RLHF, DPO, and related algorithms are equivalent to projecting the data back to likelihood space, so expending compute on on-policy sampling should be unnecessary.
● Empirics contradict naïve theory – On the tl;dr summarization benchmark with Pythia-1.4B/2.8B, a single online-DPO iteration lifts win-rate by 6-10 pts over offline DPO despite identical data, model, and optimizer, confirming that RL can add real value.
● Takeaways – RL helps when crafting a good answer is harder than checking one. The gap vanishes on two-word summaries (horizon = 1) or when ROUGE-L is used as the reward. RL acts as a shortcut through policy space only when the reward model is simpler than the policy it trains. For tasks where verification is as hard as generation, offline likelihood-based fine-tuning suffices, guiding practitioners on when RLHF is worth its extra cost. | [Paper](https://arxiv.org/abs/2503.01067) | | 6) WebThinker This paper introduces a reasoning agent framework that equips large reasoning models (LRMs) with autonomous web exploration and report writing abilities to overcome limitations of static internal knowledge. WebThinker integrates a Deep Web Explorer module and an Autonomous Think-Search-and-Draft strategy that lets models search the web, reason through tasks, and generate comprehensive outputs simultaneously. It also incorporates an RL-based training loop using online DPO to improve tool usage. The system supports two modes: complex problem solving and scientific report generation. Key points:
● Superior performance in complex reasoning: On GPQA, GAIA, WebWalkerQA, and HLE, WebThinker-32B-RL achieved new state-of-the-art results among 32B models, outperforming both retrieval-augmented and proprietary systems like GPT-4o and DeepSeek-R1-671B. For example, it reached 70.7% on GPQA and 15.8% on HLE, with gains of up to +21.5% over baselines.
● Best-in-class scientific report writing: On the Glaive dataset, WebThinker outperformed Gemini2.0 Deep Research and Grok3 DeeperSearch, scoring 8.1 in average quality metrics such as completeness and coherence.
● RL refinement matters: The RL-trained version outperformed its base counterpart across all benchmarks, showing that iterative preference-based learning significantly enhances reasoning-tool coordination.
● Ablation validates design: Removing components like Deep Web Explorer or automatic report drafting significantly degraded performance, confirming their necessity. | [Paper](https://arxiv.org/abs/2504.21776) | | 7) Reward Modeling as Reasoning This work proposes a new class of reward models, called ReasRMs, that reformulate reward modeling as a reasoning task. The authors introduce RM-R1, a family of generative reward models that produce interpretable reasoning traces and rubrics during preference judgments. Instead of relying on scalar scores or shallow generation, RM-R1 models leverage structured reasoning and reinforcement learning to improve both interpretability and performance across benchmarks.
● RM-R1 adopts a two-stage training process: (1) distillation of reasoning traces from stronger models, and (2) reinforcement learning with verifiable rewards. The Chain-of-Rubrics (CoR) prompting framework guides the model to either solve reasoning problems or generate evaluation rubrics depending on the task type (reasoning or chat).
● On RewardBench, RM-Bench, and RMB, RM-R1 models achieve state-of-the-art or near-SOTA performance, outperforming models like GPT-4o and Llama3.1-405B by up to 13.8% despite using fewer parameters and less data.
● Ablation studies show that cold-start RL alone is insufficient; task-type classification and high-quality distillation are key. RM-R1's distilled warm-start training leads to more stable learning and longer, more accurate reasoning traces.
● RM-R1 also shows strong generalization across domains and better rubric quality than baseline methods, especially in sensitive contexts like safety and medical judgment. The authors open-sourced six RM-R1 models, training data, and code to support reproducibility. | [Paper](https://arxiv.org/abs/2505.02387) | | 8) Paper2Code Introduces PaperCoder, a multi-agent LLM framework that transforms ML papers into full code repositories without relying on pre-existing implementations.
● PaperCoder decomposes the code generation process into three stages: Planning (roadmap, architecture, file dependencies, config files), Analyzing (file-specific logic extraction), and Coding (dependency-aware file generation). Each step is handled by specialized LLM agents.
● It is evaluated using both the proposed Paper2Code benchmark (90 papers from ICML, NeurIPS, and ICLR 2024) and PaperBench Code-Dev. Results show PaperCoder outperforms ChatDev, MetaGPT, and naive baselines across reference-based, reference-free, and human evaluations.
● In human assessments by original paper authors, 77% chose PaperCoder as best implementation; 85% said it helped them reproduce their work. On average, only 0.48% of code lines required changes for executability.
● A detailed ablation study shows consistent performance gains from each stage, especially logic design and file dependency ordering. PaperCoder, using the o3-mini-high backbone, notably outperforms other LLM variants. | [Paper](https://arxiv.org/abs/2504.17192) | | 9) ZeroSearch ZeroSearch is an RL framework that trains LLMs to develop search capabilities without using real search engines. It uses simulated LLM-generated documents with a curriculum-based degradation strategy and outperforms real-search methods like Search-R1 in both performance and cost, achieving better QA accuracy across multiple benchmarks. | [Paper](https://arxiv.org/abs/2505.04588), [Tweet](https://x.com/omarsar0/status/1920469148968362407) | | 10) Practical Efficiency of Muon for Pretraining Discusses how Muon, a simple second-order optimizer, outperforms AdamW in large-batch pretraining by expanding the compute-time Pareto frontier and maintaining better data efficiency. Combined with muP scaling and a novel telescoping algorithm for hyperparameter transfer, it enables faster training with minimal tuning overhead up to 4B parameter models. | [Paper](https://arxiv.org/abs/2505.02222) | ## Top ML Papers of the Week (April 28 - May 4) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) Phi-4-Mini-Reasoning Microsoft released Phi-4-Mini-Reasoning to explore small reasoning language models for math. Highlights:
● Phi-4-Mini-Reasoning: The paper introduces Phi-4-Mini-Reasoning, a 3.8B parameter small language model (SLM) that achieves state-of-the-art mathematical reasoning performance, rivaling or outperforming models nearly twice its size.
● Unlocking Reasoning: They use a systematic, multi-stage training pipeline to unlock strongbr> reasoning capabilities in compact models, addressing the challenges posed by their limited capacity. Uses large-scale distillation, preference learning, and RL with verifiable rewards.
● Four-Stage Training Pipeline: The model is trained using (1) mid-training with large-scale long CoT data, (2) supervised fine-tuning on high-quality CoT data, (3) rollout-based Direct Preference Optimization (DPO), and (4) RL using verifiable reward signals.
● Math Performance: On MATH-500, Phi-4-Mini-Reasoning reaches 94.6%, surpassing DeepSeek-R1-Distill-Qwen-7B (91.4%) and DeepSeek-R1-Distill-Llama-8B (86.9%), despite being smaller.
● Verifiable Reward Reinforcement Learning: The final RL stage, tailored for small models, includes prompt filtering, oversampling for balanced training signals, and temperature annealing. This improves training stability and aligns exploration with evaluation conditions.
● Massive Synthetic Data Generation: The model is mid-trained on 10M CoT rollouts generated by DeepSeek-R1, filtered for correctness using math verifiers and GPT-4o-mini, and categorized by domain and difficulty to ensure broad generalization.
● Ablation Study: Each phase of the pipeline shows clear gains. Notably, fine-tuning and RL each deliver ~5–7 point improvements after mid-training and DPO, showing the value of the full pipeline over isolated techniques. | [Paper](https://arxiv.org/abs/2504.21233), [Tweet](https://x.com/omarsar0/status/1917954418173247909) | | 2) Building Production-Ready AI Agents with Scalable Long-Term Memory This paper proposes a memory-centric architecture for LLM agents to maintain coherence across long conversations and sessions, solving the fixed-context window limitation. Main highlights:
● The solution introduces two systems: Mem0, a dense, language-based memory system, and Mem0g, an enhanced version with graph-based memory to model complex relationships. Both aim to extract, consolidate, and retrieve salient facts over time efficiently.
● Mem0: Uses a two-stage architecture (extraction & update) to maintain salient conversational memories. It detects redundant or conflicting information and manages updates using tool-calls, resulting in a lightweight, highly responsive memory store (7K tokens per conversation).
● Mem0g: By structuring memory as a knowledge graph of entities and relationships, Mem0g improves performance in tasks needing temporal and relational reasoning (e.g., event ordering, preference tracking) while maintaining reasonable latency and memory cost (14K tokens/convo).
● Benchmarking on LOCOMO: Both systems were evaluated against six memory system baselines (e.g., A-Mem, OpenAI, Zep, LangMem, RAG). Mem0g achieves the best overall LLM-as-a-Judge (J) score of 68.44%, outperforming all RAG and memory baselines by 7–28% in J and reducing p95 latency by 91% over full-context methods.
● Latency and efficiency: Mem0 achieves the lowest search and total latencies (p95 = 1.44s), and Mem0g still outperforms other graph-based or RAG systems by large margins in speed and efficiency. Great for real-time deployments.
● Use-case strengths: Mem0 and Mem0g offer a scalable memory architecture for long-term LLM agents to improve factual recall, reasoning depth, and efficiency, making them id | [Paper](https://arxiv.org/abs/2504.19413), [Tweet](https://x.com/omarsar0/status/1917247776221700134) | | 3) UniversalRAG UniversalRAG is a framework that overcomes the limitations of existing RAG systems confined to single modalities or corpora. It supports retrieval across modalities (text, image, video) and at multiple granularities (e.g., paragraph vs. document, clip vs. video). Contributions from the paper:
● Modality-aware routing: To counter modality bias in unified embedding spaces (where queries often retrieve same-modality results regardless of relevance), UniversalRAG introduces a router that dynamically selects the appropriate modality (e.g., image vs. text) for each query.
● Granularity-aware retrieval: Each modality is broken into granularity levels (e.g., paragraphs vs. documents for text, clips vs. full-length videos). This allows queries to retrieve content that matches their complexity -- factual queries use short segments while complex reasoning accesses long-form data.
● Flexible routing: It supports both training-free (zero-shot GPT-4o prompting) and trained (T5-Large) routers. Trained routers perform better on in-domain data, while GPT-4o generalizes better to out-of-domain tasks. An ensemble router combines both for robust performance.
● Performance: UniversalRAG outperforms modality-specific and unified RAG baselines across 8 benchmarks spanning text (e.g., MMLU, SQuAD), image (WebQA), and video (LVBench, VideoRAG). With T5-Large, it achieves the highest average score across modalities.
● Case study: In WebQA, UniversalRAG correctly routes a visual query to the image corpus (retrieving an actual photo of the event), while TextRAG and VideoRAG fail. Similarly, on HotpotQA and LVBench, it chooses the right granularity, retrieving documents or short clips. Overall, this is a great paper showing the importance of considering modality and granularity in a RAG system. | [Paper](https://arxiv.org/abs/2504.20734), [Tweet](https://x.com/omarsar0/status/1917637837295608180) | | 4) DeepSeek-Prover-V2 DeepSeek-Prover-V2 is an LLM (671B) that significantly advances formal theorem proving in Lean 4. The model is built through a novel cold-start training pipeline that combines informal chain-of-thought reasoning with formal subgoal decomposition, enhanced through reinforcement learning. It surpasses prior state-of-the-art on multiple theorem-proving benchmarks. Key highlights:
● Cold-start data via recursive decomposition: The authors prompt DeepSeek-V3 to generate natural-language proof sketches, decompose them into subgoals, and formalize these steps in Lean with sorry placeholders. A 7B prover model then recursively fills in the subgoal proofs, enabling efficient construction of complete formal proofs and training data.
● Curriculum learning + RL: A subgoal-based curriculum trains the model on increasingly complex problems. Reinforcement learning with a consistency reward is used to enforce alignment between proof structure and CoT decomposition, improving performance on complex tasks.
● Dual proof generation modes: The model is trained in two modes, non-CoT (efficient, minimal proofs) and CoT (high-precision, interpretable). The CoT mode yields significantly better performance, particularly on hard problems.
● Benchmark results: | [Paper](https://arxiv.org/abs/2504.21801), [Tweet](https://x.com/zhs05232838/status/1917600755936018715) | | 5) Kimi-Audio Kimi-Audio is a new open-source audio foundation model built for universal audio understanding, generation, and speech conversation. The model architecture uses a hybrid of discrete semantic audio tokens and continuous Whisper-derived acoustic features. It is initialized from a pre-trained LLM and trained on 13M+ hours of audio, spanning speech, sound, and music. It also supports a streaming detokenizer with chunk-wise decoding and a novel look-ahead mechanism for smoother audio generation. Extensive benchmarking shows that Kimi-Audio outperforms other audio LLMs across multiple modalities and tasks. Key highlights:
● Architecture: Kimi-Audio uses a 12.5Hz semantic tokenizer and an LLM with dual heads (text + audio), processing hybrid input (discrete + continuous). The audio detokenizer employs a flow-matching upsampler with BigVGAN vocoder for real-time speech synthesis.
● Massive Training Corpus: Pretrained on 13M+ hours of multilingual, multimodal audio. A rigorous preprocessing pipeline adds speech enhancement, diarization, and transcription using Whisper and Paraformer-Zh. Fine-tuning uses 300K+ hours from 30+ open datasets.
● Multitask Training: Training spans audio-only, text-only, ASR, TTS, and three audio-text interleaving strategies. Fine-tuning is instruction-based, with both audio/text instructions injected via zero-shot TTS.
● Evaluation: On ASR (e.g., LibriSpeech test-clean: 1.28 WER), audio understanding (CochlScene: 80.99), and audio-to-text chat (OpenAudioBench avg: 69.8), Kimi-Audio sets new SOTA results, beating Qwen2.5-Omni and Baichuan-Audio across the board. | [Paper](https://github.com/MoonshotAI/Kimi-Audio/blob/master/assets/kimia_report.pdf), [Tweet](https://x.com/Kimi_Moonshot/status/1915807071960007115) [Model](https://github.com/MoonshotAI/Kimi-Audio) | | 6) MiMo-7B Xiaomi releases MiMo-7B, a new language model for reasoning tasks. MiMo-7B is explicitly designed for advanced reasoning across math and code. Highlights:
● MiMo-7B: MiMo-7B narrows the capability gap with larger 32B-class models through careful pretraining & posttraining. MiMo-7B-Base is trained from scratch on 25T tokens, with a 3-stage mixture skewed toward mathematics and code (70% in stage 2).
● Pre-Training: The team improves HTML and PDF extraction to better preserve STEM data, leverages LLMs to generate diverse synthetic reasoning content, and adds a Multi-Token Prediction (MTP) objective that boosts both quality and inference speed.
● Base Performance: MiMo-7B-Base outperforms other 7B–9B models like Qwen2.5, Gemma-2, and Llama-3.1 across BBH (+5 pts), AIME24 (+22.8 pts), and LiveCodeBench (+27.9 pts). On BBH and LiveCodeBench, it even beats larger models on reasoning-heavy tasks.
● RL: MiMo-7B-RL is trained with a test difficulty–driven reward function and easy-data resampling to tackle sparse-reward issues and instabilities. In some cases, it surpasses o1-mini on math & code. RL from the SFT model reaches higher ceilings than RL-Zero from the base.
● Efficient infrastructure: A Seamless Rollout Engine accelerates RL by 2.29× and validation by 1.96× using continuous rollout, async reward computation, and early termination. MTP layers enable fast speculative decoding, with 90%+ acceptance rates in inference. | [Paper](https://github.com/XiaomiMiMo/MiMo/blob/main/MiMo-7B-Technical-Report.pdf), [Tweet](https://x.com/omarsar0/status/1917582720341008814) | | 7) Advances and Challenges in Foundation Agents A new survey frames intelligent agents with a modular, brain-inspired architecture that integrates ideas from cognitive science, neuroscience, and computational research. Key topics covered:
● Human Brain and LLM Agents: Helps to better understand what differentiates LLM agents from human/brain cognition, and what inspirations we can get from the way humans learn and operate.
● Definitions: Provides a nice, detailed, and formal definition of what makes up an AI agent.
● Reasoning: It has a detailed section on the core components of intelligent agents. There is a deep dive into reasoning, which is one of the key development areas of AI agents and what unlocks things like planning, multi-turn tooling, backtracking, and much more.
● Memory: Agent memory is a challenging area of building agentic systems, but there is already a lot of good literature out there from which to get inspiration.
● Action Systems: You can already build very complex agentic systems today, but the next frontier is agents that take actions and make decisions in the real world. We need better tooling, better training algorithms, and robust operation in different action spaces.
● Self-Evolving Agents: For now, building effective agentic systems requires human effort and careful optimization tricks. However, one of the bigger opportunities in the field is to build AI that can itself build powerful and self-improving AI systems. | [Paper](https://arxiv.org/abs/2504.01990), [Tweet](https://x.com/omarsar0/status/1916542394746421333) | | 8) MAGI MAGI is a multi-agent system designed to automate structured psychiatric interviews by operationalizing the MINI (Mini International Neuropsychiatric Interview) protocol. It involves 4 specialized agents: navigation, question generation, judgment, and diagnosis. Other highlights:
● Multi-Agent Clinical Workflow: MAGI is built with a navigation agent (interview flow control), a question agent (dynamic, empathetic probing), a judgment agent (response validation), and a diagnosis agent using Psychometric CoT to trace diagnoses explicitly to MINI/DSM-5 criteria.
● Explainable Reasoning (PsyCoT): Instead of treating diagnoses as opaque outputs, PsyCoT decomposes psychiatric reasoning into symptom anchoring, syndromal validation, and evidence binding. This helps with auditability for each diagnostic conclusion. CoT put to great use.
● Results: Evaluated on 1,002 real-world interviews, MAGI outperforms baselines (Direct prompting, Role-play, Knowledge-enhanced, and MINI-simulated LLMs) across relevance, accuracy, completeness, and guidance.
● Strong Clinical Agreement: Diagnostic evaluations show PsyCoT consistently improves F1 scores, accuracy, and Cohen’s κ across disorders like depression, generalized anxiety, social anxiety, and suicide risk, reaching clinical-grade reliability (κ 0.8) in high-risk tasks. | [Paper](https://arxiv.org/abs/2504.18260), [Tweet](https://x.com/omarsar0/status/1916862752410554423) | | 9) A Survey of Efficient LLM Inference Serving This survey reviews recent advancements in optimizing LLM inference, addressing memory and computational bottlenecks. It covers instance-level techniques (like model placement and request scheduling), cluster-level strategies (like GPU deployment and load balancing), and emerging scenario-specific solutions, concluding with future research directions. | [Paper](https://arxiv.org/abs/2504.19720) | | 10) LLM for Engineering This work finds that when RL is used, a 7B parameter model outperforms both SoTA foundation models and human experts at high-powered rocketry design. | [Paper](https://arxiv.org/abs/2504.19394) | ## Top ML Papers of the Week (April 21 - April 27) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) Does RL Incentivize Reasoning in LLMs Beyond the Base Model? This paper revisits a key assumption in recent LLM development: that Reinforcement Learning with Verifiable Rewards (RLVR) helps models acquire genuinely new reasoning capabilities. By analyzing models across tasks (math, code, vision) using pass@k metrics (with large k), the authors find that RLVR improves sample efficiency but does not expand reasoning capacity beyond the base model.
● Key insight: RLVR-trained models do better at low *k* (e.g., pass@1), but as *k* increases (up to 256 or more), base models eventually match or outperform them. This suggests RLVR doesn’t generate fundamentally new reasoning paths but just increases the likelihood of sampling already-existing correct ones.
● Reasoning already in the base: RLVR models' successful CoTs are shown to be present within the base model's sampling distribution. Perplexity analyses confirm that RL outputs are often high-probability continuations for the base model.
● Efficiency vs. exploration: RLVR narrows the model’s exploration space, improving efficiency but shrinking its coverage of diverse reasoning paths, thereby reducing overall problem-solving reach at scale.
● Distillation helps more: Unlike RLVR, distillation from a stronger teacher model (e.g., DeepSeek-R1) introduces genuinely new reasoning patterns, expanding the model’s capabilities.
● Algorithmic limits: Across PPO, GRPO, Reinforce++, etc., RL algorithms offer similar sample-efficiency improvements, but none closes the gap to the base model’s pass@256—highlighting the limits of current RL strategies. | [Paper](https://arxiv.org/abs/2504.13837), [Tweet](https://x.com/DaveShapi/status/1915408405201629684) | | 2) BitNet b1.58 2B4T This work introduces BitNet b1.58 2B4T, the first open-source, natively trained 1-bit LLM at the 2B parameter scale, achieving strong performance while being extremely efficient. The model uses a custom ternary quantization scheme (1.58 bits per weight), enabling dramatic reductions in memory (0.4 GB), energy (0.028J/token), and latency (29ms), while still competing with state-of-the-art full-precision models across diverse benchmarks.
● New Pareto frontier in efficiency-performance: Trained from scratch on 4T tokens, BitNet b1.58 2B4T outperforms or matches open full-precision models (e.g., Qwen2.5 1.5B, MiniCPM 2B) on tasks like ARC-Challenge, PIQA, WinoGrande, and GSM8K. It achieves 54.19% average. across 16 benchmarks, comparable to Qwen2.5-1.5B’s 55.23%, but with ~6.5× lower memory and 10× lower energy usage.
● Outperforms quantized baselines: Against INT4 post-training quantized Qwen2.5 models (GPTQ/AWQ), BitNet is both smaller and more accurate, showing the advantage of native 1-bit training over PTQ approaches.
● Architectural & training innovations: It replaces standard linear layers with BitLinear layers using absmean ternary quantization and 8-bit activations, combines RoPE embeddings, squared ReLU activation, and bias-free layers. Training includes cosine LR and weight decay schedules, plus supervised fine-tuning and Direct Preference Optimization (DPO) instead of full RLHF.
● Best-in-class among 1-bit LLMs: When compared to other 1-bit models like OLMo-Bitnet (1B) and post-quantized Falcon3/Llama3 (7B–8B), BitNet b1.58 2B4T is +10 pts stronger on average, establishing a new benchmark for ultra-efficient LLMs. The authors also release optimized CUDA kernels for GPU and a C++ inference library for CPU, enabling practical deployment of 1-bit LLMs on diverse hardware. BitNet b1.58 2B4T demonstrates that extreme quantization does not mean compromised capability, and it opens the door to the broader adoption of LLMs in resource-constrained environments. | [Paper](https://arxiv.org/abs/2504.12285) | | 3) UI-TARS UI-TARS introduces a powerful, end-to-end native GUI agent that operates purely from visual screenshots, performing human-like keyboard and mouse interactions across platforms. Unlike existing modular agent frameworks that rely on prompt engineering and external scripts, UI-TARS integrates perception, action, reasoning, and memory directly into its architecture, achieving strong generalization and adaptability in dynamic real-world settings. Key contributions:
● Enhanced GUI Perception: UI-TARS is trained on a large-scale, richly annotated dataset of screenshots with metadata, enabling dense captioning, state transition understanding, and precise element description. It excels in perception benchmarks like VisualWebBench, scoring 82.8, outperforming GPT-4o’s.
● Unified Action Modeling and Grounding: UI-TARS standardizes actions across platforms into a shared action space and learns from large-scale multi-step action traces. It surpasses baselines in grounding tasks with 38.1 on ScreenSpot Pro, the new SOTA.
● System-2 Reasoning via “Thoughts”: Inspired by ReAct-style frameworks, UI-TARS generates internal reasoning steps (thoughts) before actions. These thoughts reflect patterns like task decomposition, reflection, and long-term consistency, significantly improving performance in complex scenarios. For example, in OSWorld, UI-TARS-72B-DPO scores 24.6 with a 50-step budget, outperforming Claude’s.
● Iterative Self-Improvement with Reflective Learning: UI-TARS continuously refines itself through online trace collection and reflection tuning using error correction and post-error adaptation data. This allows it to recover from mistakes and adapt with minimal human oversight. Overall, UI-TARS marks a significant step forward in GUI automation, setting new benchmarks across more than 10 datasets and outperforming top commercial agents like GPT-4o and Claude. Its open-source release aims to drive further innovation in native agent development. | [Paper](https://arxiv.org/abs/2501.12326), [Blog](https://seed-tars.com/1.5/) | | 4) Describe Anything Introduces DAM, a model that generates fine-grained, region-specific captions in both images and videos. The authors address key limitations in prior vision-language models—namely, the inability to preserve local detail and the lack of suitable datasets and benchmarks for detailed localized captioning (DLC). Key contributions:
● DAM (Describe Anything Model) uses two main innovations to capture both fine regional detail and global scene context: a focal prompt that provides high-resolution encoding of user-specified regions, and a localized vision backbone that uses gated cross-attention to integrate context from the entire image. This enables DAM to generate multi-granular, accurate descriptions, especially for small or occluded regions.
● DLC-SDP (Semi-supervised Data Pipeline) tackles data scarcity by expanding segmentation datasets with VLM-generated detailed captions, followed by self-training on web images. This produces high-quality, diverse training data, enabling DAM to outperform API-only baselines like GPT-4o across several benchmarks.
● DLC-Bench is a reference-free benchmark that scores models on their ability to accurately include or exclude region-specific details using LLM judges. It provides a more reliable evaluation than traditional caption-matching metrics, which often penalize models for valid but unmatched details.
● Performance: DAM sets a new state-of-the-art on 7 benchmarks across keyword, phrase, and detailed multi-sentence captioning tasks in both images and videos. It outperforms GPT-4o, Claude 3.7, and other top VLMs in both zero-shot and in-domain evaluations, achieving up to 33.4% improvement over prior models on detailed image captioning and 19.8% on video captioning. | [Paper](https://arxiv.org/abs/2504.16072) | | 5) UXAgent Introduces a novel framework, UXAgent, for simulating large-scale usability testing using LLM-driven agents. The system empowers UX researchers to test and iterate web design and study protocols before engaging real users. This is achieved through the orchestration of simulated agents with diverse personas interacting in real web environments, providing both behavioral and reasoning data. Key highlights:
● LLM-Powered Simulation with Personas: UXAgent begins with a Persona Generator that can produce thousands of demographically diverse simulated users based on custom distributions. Each persona is fed into an LLM Agent that embodies user intent and interacts with the website via a Universal Browser Connector—a module capable of interpreting and manipulating real HTML structures.
● Dual-Loop Reasoning Architecture: At the heart of UXAgent is a dual-process agent architecture inspired by cognitive psychology: a Fast Loop for low-latency actions and a Slow Loop for deep reasoning. This design mimics System 1 and System 2 thinking and allows agents to act responsively while maintaining coherent high-level plans and reflections.
● Rich Memory Stream: All observations, actions, plans, reflections, and spontaneous thoughts (“wonders”) are stored in a Memory Stream. These memories are dynamically prioritized for retrieval using a weighted scoring system based on importance, recency, and relevance, tailored separately for fast and slow modules.
● Replay and Interview Interfaces: UX researchers can review simulated sessions via a Simulation Replay Interface and conduct natural language conversations with agents using an Agent Interview Interface. This supports qualitative analysis, such as asking agents about their decisions or presenting mockups for feedback.
● Empirical Evaluation: A case study involving 60 LLM agent simulations on a shopping platform (WebArena) showed that researchers were able to detect usability study flaws and gather early insights. A follow-up user study with five UX professionals found the system helpful for iterating study design, despite some concerns over realism and data noise. Particularly appreciated was the ability to converse with agents and gather qualitative insights that would be infeasible in traditional pilots.
● Future Implications: The authors position LLM agents not as replacements for real participants, but as early-stage collaborators in the design process, reducing the cost and risk of flawed studies. They also discuss extensions to multimodal settings, desktop or mobile interfaces, and broader agentic tasks such as digital twins or simulated A/B testing. | [Paper](https://arxiv.org/abs/2504.09407) | | 6) Test-Time Reinforcement Learning Test-Time Reinforcement Learning (TTRL) is a method that allows LLMs to improve themselves during inference without ground-truth labels. Instead of relying on labeled datasets, TTRL uses majority voting over multiple model generations to estimate pseudo-rewards, enabling reinforcement learning (RL) on unlabeled test data. The method integrates Test-Time Scaling (TTS) and Test-Time Training (TTT) strategies, letting models adapt dynamically to new and challenging inputs. Key highlights:
● Majority Voting as Reward: TTRL generates multiple candidate outputs for a query and uses majority voting to derive a pseudo-label. Rewards are assigned based on agreement with the consensus answer.
● Significant Performance Gains: Applying TTRL to Qwen2.5-Math-7B leads to a +159% improvement on AIME 2024 and +84% average gains across AIME, AMC, and MATH-500 benchmarks, without using any labeled training data.
● Self-Evolution Beyond Supervision: Remarkably, TTRL surpasses the performance ceiling of its own majority-vote supervision (Maj@N) and approaches the performance of models trained with full label leakage, indicating efficient and stable unsupervised RL.
● Generalization and Robustness: TTRL generalizes well across tasks, maintains effectiveness even under label estimation noise, and is compatible with different RL algorithms like PPO and GRPO.
● Limitations: TTRL may fail when the base model lacks sufficient prior knowledge about the domain or when hyperparameters (like batch size and temperature) are poorly tuned. | [Paper](https://www.arxiv.org/abs/2504.16084) | | 7) Discovering Values in Real-World Language Model Interactions This paper presents the first large-scale empirical analysis of values exhibited by a deployed AI assistant, Claude 3 and 3.5 models, using over 300,000 real-world conversations. The authors develop a bottom-up, privacy-preserving framework to extract, classify, and analyze AI-expressed normative considerations (“values”) and show how they vary across tasks, user values, and conversational contexts.
● The authors identify 3,307 unique AI values, which are organized into a five-domain taxonomy: Practical, Epistemic, Social, Protective, and Personal. Practical and epistemic values dominate, often aligning with Claude’s training goals around being helpful, harmless, and honest.
● Claude’s most common values, such as helpfulness (23.4%), professionalism, transparency, and clarity, are context-invariant and reflect its role as a service-oriented assistant. In contrast, human values like authenticity and efficiency are more varied.
● Many values are context-specific. For example, healthy boundaries arise in relationship advice, historical accuracy in controversial event discussions, and human agency in AI governance contexts.
● Claude tends to mirror human values in supportive contexts (20.1% mirroring rate), but expresses opposing values during resistance, especially in cases involving unethical or policy-violating requests (e.g., resisting “moral nihilism” with “ethical integrity”).
● Explicit value expression (e.g., “I value transparency”) occurs more often in moments of resistance or reframing, particularly around epistemic and ethical principles like intellectual honesty and harm prevention. This suggests that AI values become most visible when the system is challenged.
● Across Claude variants, 3 Opus expresses more emotionally nuanced and ethically grounded values (e.g., academic rigor, emotional authenticity) and shows a stronger inclination for both support and resistance compared to 3.5/3.7 Sonnet. | [Paper](https://assets.anthropic.com/m/18d20cca3cde3503/original/Values-in-the-Wild-Paper.pdf), [Tweet](https://x.com/AnthropicAI/status/1914333220067213529) | | 8) Evaluate the Goal-Directedness of LLMs Introduces a new framework to assess whether LLMs use their capabilities effectively toward achieving given goals. The study finds that even top models like GPT-4o and Claude 3.7 fall short of full goal-directedness, particularly in information-gathering and combined tasks, despite performing well in isolated subtasks. | [Paper](https://arxiv.org/abs/2504.11844), [Tweet](https://x.com/tom4everitt/status/1912806499862139275), [GitHub](https://github.com/Crista23/goal_directedness_llms) | | 9) General-Reasoner General-Reasoner is a reinforcement learning approach that boosts LLM reasoning across diverse domains by using a 230K-question dataset and a model-based verifier trained to understand semantics beyond exact matches. It outperforms strong baselines like SimpleRL and Qwen2.5 on both general reasoning (MMLU-Pro, GPQA, SuperGPQA) and math tasks (MATH-500, GSM8K), showing over 10-point gains without sacrificing mathematical capability. | [Paper](https://github.com/TIGER-AI-Lab/General-Reasoner/blob/main/General_Reasoner.pdf), [Tweet](https://x.com/WenhuChen/status/1912242238110789671) | | 10) Tiny Reasoning Models Tina is a family of 1.5B parameter reasoning models trained using LoRA-based reinforcement learning (RL) to achieve high reasoning accuracy at very low cost. It outperforms or matches full fine-tuned models on reasoning tasks like AIME and MATH with only ~$9 post-training cost, demonstrating that efficient reasoning can be instilled via minimal updates to a tiny model. | [Paper](https://arxiv.org/abs/2504.15777) | ## Top ML Papers of the Week (April 14 - April 20) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) GUI-R1 Researchers from the National University of Singapore and the Chinese Academy of Sciences introduce GUI-R1, a reinforcement learning (RL) framework aimed at improving graphical user interface (GUI) agents through unified action-space modeling. Key insights include:
● Reinforcement Fine-Tuning (RFT) over Supervised Fine-Tuning (SFT) – GUI-R1 utilizes RFT inspired by methods such as DeepSeek-R1, significantly reducing training data requirements. It uses only 3K carefully curated examples versus millions used by previous models.
● Unified Action Space and Reward Modeling – The authors introduce a unified action space that covers actions across different platforms (Windows, Linux, MacOS, Android, and Web). This enables consistent reward signals for evaluating GUI actions, enhancing the model’s adaptability and generalization.
● Superior Performance with Minimal Data – GUI-R1 outperforms state-of-the-art methods like OS-Atlas using merely 0.02% of the training data (3K vs. 13M). Evaluations across eight benchmarks spanning mobile, desktop, and web platforms show significant improvements in grounding, low-level, and high-level GUI task capabilities.
● Efficient Training and Strong Generalization – By leveraging policy optimization algorithms like Group Relative Policy Optimization (GRPO), GUI-R1 quickly converges to high performance, demonstrating robustness and efficiency even in resource-constrained scenarios. | [Paper](https://arxiv.org/abs/2504.10458) | | 2) Scaling Reasoning in Diffusion LLMs via RL Proposes d1, a two‑stage recipe that equips masked diffusion LLMs with strong step‑by‑step reasoning.
● Two‑stage pipeline (SFT → diffu‑GRPO) – d1 first applies supervised fine‑tuning on the 1 k‑example s1K dataset and then runs task‑specific RL with the new diffu‑GRPO objective, yielding larger gains than either stage alone.
● diffu‑GRPO: RL for masked dLLMs – Extends GRPO to diffusion LLMs via (i) a mean‑field sequence‑log‑prob approximation and (ii) a one‑step per‑token log‑prob estimator with random prompt masking, enabling many gradient updates from a single generation.
● Consistent gains on four reasoning benchmarks – On GSM8K, MATH500, Countdown, and Sudoku, diffu‑GRPO beats SFT, and the full d1‑LLaDA variant attains the best scores (e.g., 81.1 % GSM8K & 38.6 % MATH500 at 256 tokens, +5–12 pp over baseline).
● Competitive among 7‑8 B models – d1‑LLaDA outperforms DeepSeek‑7B, Mistral‑7B and Llama‑3‑8B on GSM8K and ranks second on MATH500 in the same size class.
● Longer decoding unlocks “aha moments” – At 512‑token generation, the model shows self‑verification/backtracking; effective‑token usage grows smoothly, echoing test‑time compute scaling trends.
● Random masking speeds RL – Ablations show that random prompt masking during diffu‑GRPO accelerates convergence and boosts correctness relative to fixed masking, with fewer online generations needed. | [Paper](https://arxiv.org/abs/2504.12216) | | 3) Enhancing Non-Reasoning Models with Reasoning Models Researchers explore how to distill reasoning-intensive outputs (answers and explanations) from top-tier LLMs into more lightweight models that don’t explicitly reason step by step. By fine-tuning smaller models on the high-quality final answers (and optionally summarized thinking traces) from advanced reasoning models, they demonstrate consistent performance boosts across multiple benchmarks.
● Test-time scaling vs. knowledge distillation – While large models like DeepSeek-R1 and OpenAI-o1 can allocate more compute to generate better reasoning traces, this paper focuses on systematically transferring those rich final answers (and possibly a summarized version of the reasoning steps) to more compact models.
● Data curation – The authors construct a 1.3M-instance dataset by pulling prompts from multiple open-source repositories (including Infinity Instruct, CodeContests, FLAN, etc.) and generating final answers plus detailed reasoning from DeepSeek-R1.
● Three fine-tuning strategies – (1) Use the original baseline answers from existing open-source sets, (2) fine-tune on only the final answer portion of a reasoning model, and (3) combine a summarized chain-of-thought with the final answer. Models trained on the second strategy excelled at math/coding tasks, while the third approach proved better for more conversational or alignment-oriented tasks.
● Empirical gains – Fine-tuning Qwen2.5-32B on the reasoning model’s final answers led to notable improvements on GSM8K (92.2%) and HumanEval (90.9%). A think-summarization approach boosted a different set of benchmarks (GPQA and chat-based tests). However, weaving in the “thinking trace” sometimes caused slight drops in instruction strictness (IFEval).
● Trade-offs and future work – Distilling advanced reasoning data definitely helps smaller models, but deciding how much of the reasoning trace to include is domain-dependent. The authors suggest that more refined ways of seamlessly blending reasoning steps into final answers (e.g., specialized prompts or partial merges) could further improve performance and avoid alignment regressions. | [Paper](https://arxiv.org/abs/2504.09639) | | 4) AgentA/B AgentA/B is a fully automated A/B testing framework that replaces live human traffic with large-scale LLM-based agents. These agents simulate realistic, intention-driven user behaviors on actual web environments, enabling faster, cheaper, and risk-free UX evaluations — even on real websites like Amazon. Key Insights:
● Modular agent simulation pipeline – Four components—agent generation, condition prep, interaction loop, and post-analysis—allow plug-and-play simulations on live webpages using diverse LLM personas.
● Real-world fidelity – The system parses live DOM into JSON, enabling structured interaction loops (search, filter, click, purchase) executed via LLM reasoning + Selenium.
● Behavioral realism – Simulated agents show more goal-directed but comparable interaction patterns vs. 1M real Amazon users (e.g., shorter sessions but similar purchase rates).
● Design sensitivity – A/B test comparing full vs. reduced filter panels revealed that agents in the treatment condition clicked more, used filters more often, and purchased more.
● Inclusive prototyping – Agents can represent hard-to-reach populations (e.g., low-tech users), making early-stage UX testing more inclusive and risk-free.
● Notable results AgentA/B shows how LLM agents can augment — not replace — traditional A/B testing by offering a new pre-deployment simulation layer. This can accelerate iteration, reduce development waste, and support UX inclusivity without needing immediate live traffic. | [Paper](https://arxiv.org/abs/2504.09723) | | 5) Reasoning Models Can Be Effective Without Thinking This paper challenges the necessity of long chain-of-thought (CoT) reasoning in LLMs by introducing a simple prompting method called NoThinking, which bypasses explicit "thinking" steps. Surprisingly, NoThinking performs comparably to or better than traditional reasoning under comparable or even lower compute budgets, especially when paired with parallel decoding and best-of-N selection. Key Insights:
● NoThinking prepends a dummy “Thinking” block and jumps straight to final answers.
● Despite skipping structured reasoning, it outperforms Thinking in pass@k (1–64) on many benchmarks, especially under token constraints.
● With parallel scaling, NoThinking achieves higher pass@1 accuracy than Thinking while using 4× fewer tokens and up to 9× lower latency.
● Tasks evaluated: competitive math (AIME24/25, AMC23, OlympiadBench), coding (LiveCodeBench), and formal theorem proving (MiniF2F, ProofNet).
● NoThinking is shown to provide superior accuracy–latency tradeoffs and generalizes across diverse tasks. Results:
● Low-budget wins: On AMC23 (700 tokens), NoThinking achieves 51.3% vs. 28.9% (Thinking).
● Better scaling: As k increases, NoThinking consistently surpasses Thinking.
● Efficiency frontier: Across benchmarks, NoThinking dominates the accuracy–cost Pareto frontier.
● Parallel wins: With simple confidence-based or majority vote strategies, NoThinking + best-of-N beats full Thinking on pass@1 with significantly less latency. | [Paper](https://www.arxiv.org/abs/2504.09858) | | 6) SocioVerse Researchers from Fudan University and collaborators propose SocioVerse, a large-scale world model for social simulation using LLM agents aligned with real-world user behavior. Key ideas include:
● Four-fold alignment framework – SocioVerse tackles major challenges in aligning simulated environments with reality across four dimensions:
● Three representative simulations – SocioVerse showcases its generalizability through:
● Impressive empirical accuracy –
● Ablation insights – Removing prior demographic distribution and user knowledge severely degrades election prediction accuracy (Acc drops from 0.80 → 0.60), highlighting the value of realistic population modelingpapersoftheweek.
● Toward trustworthy virtual societies – SocioVerse not only standardizes scalable social simulations but also provides a sandbox for testing sociopolitical hypotheses (e.g., fairness, policy change), bridging AI agent systems with traditional social science. | [Paper](https://arxiv.org/abs/2504.10157) | | 7) DocAgent Researchers from Meta AI present DocAgent, a tool‑integrated, dependency‑aware framework that turns large, complex codebases into well‑written docstrings. Key ideas include:
● Topological Navigator for context building – DocAgent parses the repository’s AST, builds a dependency DAG, and documents components in topological order, so each function/class is visited only after its prerequisites, enabling incremental context accumulation and preventing context‑length explosions.
● Role‑specialised agent team – Five agents work together: Reader analyses code, Searcher gathers internal & external references, Writer drafts docstrings, Verifier critiques and revises them, while the Orchestrator manages iterations until quality converges.
● Adaptive context management – When retrieved context exceeds the model’s token budget, the Orchestrator trims low‑priority segments while preserving overall structure, keeping generation efficient and faithful2504.08725v1.
● Three‑facet automatic evaluation – A new framework scores Completeness (section coverage), Helpfulness (LLM‑as‑judge semantic utility), and Truthfulness (entity grounding against the code DAG) for every docstring.
● Substantial gains over baselines – On 366 components across nine Python repos, DocAgent + GPT‑4o‑mini lifts Completeness to 0.934 vs 0.815, Helpfulness to 3.88 / 5 vs 2.95, and Truthfulness (existence ratio) to 95.7 % vs 61.1 % compared with a Chat‑GPT baseline; FIM baselines fare far worse.
● Navigator is crucial – An ablation that randomises processing order drops helpfulness by ‑0.44 and truthfulness by ‑7.9 pp, confirming the importance of dependency‑aware traversal. | [Paper](https://arxiv.org/abs/2504.08725) | | 8) SWE-PolyBench SWE-PolyBench is a new multi-language benchmark for evaluating coding agents on real-world software tasks across Java, JavaScript, TypeScript, and Python. It introduces execution-based assessments, syntax tree metrics, and reveals that current agents struggle with complex tasks and show inconsistent performance across languages. | [Paper](https://arxiv.org/abs/2504.08703v1) | | 9) A Survey of Frontiers in LLM Reasoning This survey categorizes LLM reasoning methods by when reasoning occurs (inference-time vs. training) and the system's architecture (standalone vs. agentic or multi-agent). It highlights trends like learning-to-reason (e.g., DeepSeek-R1) and agentic workflows (e.g., OpenAI Deep Research), covering prompt engineering, output refinement, and learning strategies such as PPO and verifier training. | [Paper](https://arxiv.org/abs/2504.09037) | | 10) Advances in Embodied Agents, Smart Cities, and Earth Science This paper surveys how spatial intelligence manifests across disciplines—from embodied agents to urban and global systems—by connecting human spatial cognition with how LLMs handle spatial memory, representations, and reasoning. It offers a unifying framework to bridge research in AI, robotics, urban planning, and earth science, highlighting LLMs’ evolving spatial capabilities and their interdisciplinary potential. | [Paper](https://arxiv.org/abs/2504.09848) | ## Top ML Papers of the Week (April 6 - April 13) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) The AI Scientist V2 The AI Scientist-v2 refines and extends its predecessor to achieve a new milestone: autonomously generating a workshop-accepted research manuscript. The system removes dependencies on human-authored code templates, incorporates agentic tree-search methods for deeper exploration, uses Vision-Language Models to refine figures, and demonstrates impressive real-world outcomes by passing the peer-review bar.
● Enhanced Autonomy – Eliminates reliance on human-crafted code templates, enabling out-of-the-box deployment across diverse ML domains.
● Agentic Tree Search – Systematically searches and refines hypotheses through a branching exploration, managed by a new experiment manager agent.
● VLM Feedback Loop – Integrates Vision-Language Models in the reviewing process to critique and improve experimental figures and paper aesthetics.
● Workshop Acceptance – Generated three fully autonomous manuscripts for an ICLR workshop; one was accepted, showcasing the feasibility of AI-driven end-to-end scientific discovery. | [Paper](https://pub.sakana.ai/ai-scientist-v2/paper/paper.pdf), [Tweet](https://x.com/SakanaAILabs/status/1909497165925536212) | | 2) Benchmarking Browsing Agents OpenAI introduces BrowseComp, a benchmark with 1,266 questions that require AI agents to locate hard-to-find, entangled information on the web. Unlike saturated benchmarks like SimpleQA, BrowseComp demands persistent and creative search across numerous websites, offering a robust testbed for real-world web-browsing agents. Key insights:
● Extremely difficult questions: Benchmarked tasks were verified to be unsolvable by humans in under 10 minutes and also by GPT-4o (with/without browsing), OpenAI o1, and earlier Deep Research models.
● Human performance is low: Only 29.2% of problems were solved by humans (even with 2-hour limits). 70.8% were abandoned.
● Model performance:
● Test-time scaling matters: Accuracy improves with more browsing attempts. With 64 parallel samples and best-of-N aggregation, Deep Research significantly boosts its performance (15–25% gain over a single attempt).
● Reasoning browsing: OpenAI o1 (no browsing but better reasoning) outperforms GPT-4.5 with browsing, showing that tool use alone isn't enough—strategic reasoning is key.
● Calibration struggles: Models with browsing access often exhibit overconfidence in incorrect answers, revealing current limits in uncertainty estimation.
● Dataset diversity: Includes a wide topical spread: TV/movies, science, art, sports, politics, geography, etc. | [Paper](https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf), [Blog](https://openai.com/index/browsecomp/), [Tweet](https://x.com/OpenAI/status/1910393421652520967) | | 3) OLMOTrace Allen Institute for AI & University of Washington present OLMOTRACE, a real-time system that traces LLM-generated text back to its verbatim sources in the original training data, even across multi-trillion-token corpora.
● What it does: For a given LM output, OLMOTRACE highlights exact matches with training data segments and lets users inspect full documents for those matches. Think "reverse-engineering" a model’s response via lexical lookup.
● How it works:
● Supported models: Works with OLMo models (e.g., OLMo-2-32B-Instruct) and their full pre/mid/post-training datasets, totaling 4.6T tokens.
● Use cases:
● Benchmarked:
● Not RAG: It retrieves after generation, without changing output, unlike retrieval-augmented generation. | [Paper](https://arxiv.org/abs/2504.07096), [Tweet](https://x.com/omarsar0/status/1910323386603262316), [Blog](https://5910970.hs-sites.com/olmotrace-points-model-output-back-to-training-data?ecid=ACsprvuggQcD4yCdO--rKTZKDvmczdSQkb96ct95zLH9eiysrXjF_WuKgsmIMaz8byfiL1H1-2A6&utm_campaign=AI2%20Newsletter&utm_medium=email&_hsenc=p2ANqtz-__MqUAVPXfHPpHpf2xC86iZG8qC3J-z5nW141VBN9gZW4j61ymW3dM7mhkiHGTWtjQt3Eao7Cqf7pB1k24CfEhYe9fmA&_hsmi=355925505) | | 4) Concise Reasoning via RL This new paper proposes a new training strategy that promotes concise and accurate reasoning in LLMs using RL. It challenges the belief that long responses improve accuracy; it offers both theoretical and empirical evidence showing that conciseness often correlates with better performance.
● Long ≠ better reasoning – The authors mathematically show that RL with PPO tends to generate unnecessarily long responses, especially when answers are wrong. Surprisingly, shorter outputs correlate more with correct answers, across reasoning and non-reasoning models.
● Two-phase RL for reasoning + conciseness – They introduce a two-phase RL strategy: (1) train on hard problems to build reasoning ability (length may increase), then (2) fine-tune on occasionally solvable tasks to enforce concise CoT without hurting accuracy. The second phase alone dramatically reduces token usage by over 50%, with no loss in accuracy.
● Works with tiny data – Their method succeeds with as few as 4–8 training examples, showing large gains in both math and STEM benchmarks (MATH, AIME24, MMLU-STEM). For instance, on MMLU-STEM, they improved accuracy by +12.5% while cutting response length by over 2×.
● Better under low sampling – Post-trained models remain robust even when the temperature is reduced to 0. At temperature=0, the fine-tuned model outperformed the baseline by 10–30%, showing enhanced deterministic performance.
● Practical implications – Besides improving model output, their method reduces latency, cost, and token usage, making LLMs more deployable. The authors also recommend setting λ < 1 during PPO to avoid instability and encourage correct response shaping. | [Paper](https://arxiv.org/abs/2504.05185), [Tweet](https://x.com/omarsar0/status/1909634850304503977) | | 5) Rethinking Reflection in Pre-Training Reflection — the ability of LLMs to identify and correct their own reasoning — has often been attributed to reinforcement learning or fine-tuning. This paper argues otherwise: reflection emerges during pre-training. The authors introduce adversarial reasoning tasks to show that self-reflection and correction capabilities steadily improve as compute increases, even in the absence of supervised post-training. Key contributions:
● Propose two kinds of reflection:
● Build six adversarial datasets (GSM8K, TriviaQA, CruxEval, BBH) to test reflection across math, coding, logic, and knowledge domains. On GSM8K-Platinum, explicit reflection rates grow from ~10% to 60% with increasing pre-training tokens.
● Demonstrate that simple triggers like “Wait,” reliably induce reflection.
● Evaluate 40 OLMo-2 and Qwen2.5 checkpoints, finding a strong correlation between pre-training compute and both accuracy and reflection rate. Why it matters:
● Reflection is a precursor to reasoning and can develop before RLHF or test-time decoding strategies.
● Implication: We can instill advanced reasoning traits with better pre-training data and scale, rather than relying entirely on post-training tricks.
● They also show a trade-off: more training compute reduces the need for expensive test-time compute like long CoT traces. | [Paper](https://arxiv.org/abs/2504.04022), [Tweet](https://x.com/ashVaswani/status/1909642828554387675) | | 6) Efficient KG Reasoning for Small LLMs LightPROF is a lightweight framework that enables small-scale language models to perform complex reasoning over knowledge graphs (KGs) using structured prompts. Key highlights:
● Retrieve-Embed-Reason pipeline – LightPROF introduces a three-stage architecture:
● Plug-and-play & parameter-efficient – LightPROF trains only the adapter and projection modules, allowing seamless integration with any open-source LLM (e.g., LLaMa2-7B, LLaMa3-8B) without expensive fine-tuning.
● Outperforms larger models – Despite using small LLMs, LightPROF beats baselines like StructGPT (ChatGPT) and ToG (LLaMa2-70B) on KGQA tasks: 83.8% (vs. 72.6%) on WebQSP and 59.3% (vs. 57.6%)on CWQ.
● Extreme efficiency – Compared to StructGPT, LightPROF reduces token input by 98% and runtime by 30%, while maintaining accuracy and stable output even in complex multi-hop questions.
● Ablation insights – Removing structural signals or training steps severely degrades performance, confirming the critical role of the Knowledge Adapter and retrieval strategy. | [Paper](https://arxiv.org/abs/2504.03137), [Tweet](https://x.com/omarsar0/status/1910319109096747191) | | 7) Compute Agent Arena Computer Agent Arena is a new open platform for benchmarking LLM and VLM-based agents on real-world computer-use tasks, like coding, editing, and web navigation, using a virtual desktop environment. Initial results show that OpenAI and Anthropic are leading with modest success rates, while the platform aims to grow through crowdsourced tasks, agent submissions, and open-sourcing of its infrastructure. [Report](https://arena.xlang.ai/blog/computer-agent-arena) | [Tweet](https://x.com/BowenWangNLP/status/1909618451259572328) | | 8) Agentic Knowledgeable Self-awareness KnowSelf is a new framework that introduces agentic knowledgeable self-awareness, enabling LLM agents to dynamically decide when to reflect or seek knowledge based on situational complexity, mimicking human cognition. Using special tokens for "fast," "slow," and "knowledgeable" thinking, KnowSelf reduces inference costs and achieves state-of-the-art performance on ALFWorld and WebShop tasks with minimal external knowledge. | [Paper](https://arxiv.org/abs/2504.03553v1) | | 9) One-Minute Video Generation with Test-Time Training One-Minute Video Generation with Test-Time Training introduces TTT layers, a novel sequence modeling component where hidden states are neural networks updated via self-supervised loss at test time. By integrating these into a pre-trained diffusion model, the authors enable single-shot generation of one-minute, multi-scene videos from storyboards, achieving 34 Elo points higher than strong baselines like Mamba 2 and DeltaNet in human evaluations | [Paper](https://test-time-training.github.io/video-dit/assets/ttt_cvpr_2025.pdf), [Tweet](https://x.com/karansdalal/status/1909312851795411093) | | 10) NoProp NoProp is a novel gradient-free learning method where each neural network layer independently learns to denoise a noisy version of the target, inspired by diffusion and flow matching. Unlike backpropagation, it avoids hierarchical representation learning and achieves competitive performance and efficiency on image classification benchmarks like MNIST and CIFAR. | [Paper](https://arxiv.org/abs/2503.24322) | ## Top ML Papers of the Week (March 31 - April 6) - 2025 | **Paper** | **Links** | | 1) PaperBench OpenAI introduces a new benchmark, PaperBench, to test whether AI agents can replicate cutting-edge machine learning research papers, from scratch. ● A rigorous replication challenge – PaperBench evaluates agents on reproducing entire ML papers from ICML 2024 (20 total, across 12 research areas). Agents must understand the paper, build the codebase from scratch, and run experiments to match results. Each paper comes with a fine-grained rubric (~8,316 tasks total) co-designed with the original authors. ● Automatic grading with LLM judges – To make evaluation scalable, the team built a rubric-based judge (o3-mini with scaffolding) that scores replications with high agreement (F1 = 0.83) against human experts. They also release JudgeEval, a benchmark for assessing judge accuracy. ● Frontier model performance is modest – Claude 3.5 Sonnet scored highest with 21.0%, followed by o1 (13.2%) and GPT-4o (4.1%). Even with longer runtimes and prompt tuning (IterativeAgent), no model surpassed a 26.0% score. By contrast, ML PhDs hit 41.4% on a 3-paper subset in 48 hours, showing humans still lead in long-horizon agentic tasks. ● CodeDev variant for lightweight evals – A simplified PaperBench Code-Dev version skips execution and just grades code structure. o1 scored 43.4% there, showing more promise when runtime issues are excluded. ● Failure modes and insights – Models often “gave up early,” lacked strategic planning, and failed to iterate. Claude did better with BasicAgent (freer form), while o1 benefited from IterativeAgent (structured prompts). This highlights how sensitive agents are to prompting and scaffolding. ● Open-source release – PaperBench (with rubrics, grading infra, and replication results) is fully open-sourced to drive further progress on long-horizon agent tasks and autonomous AI R&D. | [Paper](https://arxiv.org/abs/2504.01848), [Tweet](https://x.com/OpenAI/status/1907481490457506235), [GitHub](https://github.com/openai/preparedness) | | 2) Command A: An Enterprise-Ready LLM Cohere announced Command A, a 111B parameter open-weights LLM built for enterprise-grade RAG, agents, code, and multilingual tasks. Key contributions: ● Modular expert merging for domain mastery – Instead of monolithic post-training, Command A uses a decentralized training pipeline. Separate expert models are fine-tuned for specific domains (e.g., math, RAG, multilingual, safety, code), then merged into one model using efficient weighted parameter soup techniques. This preserves most expert performance with just ~1.8% average drop. ● Hybrid architecture for long-context efficiency – Command A interleaves sliding window and full attention layers, achieving 256k context support with drastically lower KV cache memory usage—e.g., only ~33% of LLaMA 3 70B at 128k. It scores 95.0% on RULER, outperforming most long-context peers. ● Superb agentic capabilities – Built for RAG, tool use, and ReAct-style agents, Command A beats GPT-4o and Claude 3.5 on TauBench and BFCL. Tool use is trained via a blend of human-annotated and synthetic data, then aligned with CoPG and SRPO (self-improving preference optimization). ● Best-in-class enterprise evaluations – On real-world generative tasks (e.g., chat summarization, FAQ generation) and RAG use cases (long workplace policy documents), Command A tops the leaderboard with 94.2% pass rate, 4.73 correctness, and 91% unanswerable QA accuracy. ● Multilingual excellence – Command A is trained in 23 global languages with heavy data curation and preference tuning. It scores #1 in dialect alignment (ADI2), 90.3% average LPR (language consistency), and outperforms LLaMA 3.3, GPT-4o, and DeepSeek in manual Arena-style win rates across all languages. ● Polishing for human alignment – Final alignment used a ping-pong loop of offline SRPO and online CoPG with RLHF. This yielded +17pt human win rate gains on code, +10pt on reasoning, and lifted Command A’s win rate over GPT-4o to parity (~50.4%). ● Fast, efficient, and open – Despite its power, Command A runs on just 2×A100s or H100s and generates 156 tokens/sec—faster than GPT-4o and DeepSeek. Model weights are released (CC-BY-NC) on Hugging Face. | [Paper](https://arxiv.org/abs/2504.00698), [Tweet](https://x.com/nrehiew_/status/1908181303339471020), [Models](https://huggingface.co/CohereForAI/c4ai-command-a-03-2025) | | 3) CodeScientist Researchers at AI2 release CodeScientist, a system that autonomously generates and tests scientific hypotheses via code-based experimentation. It’s among the first to produce validated discoveries with minimal human input. Key ideas: ● Code-first scientific agent – CodeScientist reviews research papers and assembles experiments using vetted Python code blocks (e.g., for analysis, simulation). It follows a five-step pipeline: Ideation → Planning → Code Execution → Reporting → Meta-Analysis. ● Validated AI discoveries – From 50 AI research papers on agents and virtual environments, CodeScientist proposed 19 findings. Of these, 6 were judged scientifically sound and novel. Examples: ● Human-guided autonomy – Full automation is possible, but brief human feedback (e.g., ranking ideas) significantly boosts output quality. Human-in-the-loop interaction improves idea selection and experiment debugging. ● Challenges remain – Despite successes, over half the generated experiments fail due to code errors, not scientific flaws. Peer review is still needed to verify results, and current systems lack deep methodological rigor. | [Paper](https://arxiv.org/abs/2503.22708), [Blog](https://allenai.org/blog/codescientist), [GitHub](https://github.com/allenai/codescientist) | | 4) Retrieval-Augmented Reasoning Model Introduces RARE, a new paradigm for training domain-specific LLMs that focuses on reasoning, not memorization. Key ideas: ● Inspired by Bloom’s Taxonomy – RARE shifts LLM training from memorizing knowledge (“Remember”) to applying and evaluating it (“Analyze”, “Create”). It separates domain knowledge (retrieved externally) from domain thinking (learned during training), enabling better performance under tight parameter budgets. ● Open-book prepared training – RARE injects retrieved knowledge into training prompts, letting models learn reasoning patterns instead of rote facts. This open-book, reasoning-first setup beats both standard SFT and RAG approaches, especially in medicine. ● Massive accuracy gains with small models – On five medical QA benchmarks, RARE-trained Llama-3.1-8B and Qwen-2.5-7B outperformed GPT-4 + RAG, with up to +20% accuracy boosts (e.g., PubMedQA: 78.63% vs. GPT-4’s 75.2%, CoVERT: 74.14% vs. GPT-4’s 65.67%). ● Training via distillation + adaptive retries – RARE distills answers (and reasoning paths) from a strong teacher (e.g., QwQ-32B), refining outputs until a correct answer is found. This creates a high-quality dataset that teaches contextualized, case-based thinking. ● New role for retrieval – Unlike standard RAG (used only at inference), RARE uses retrieval during training to shape reasoning. It models knowledge integration (p(kx, R(x))) and reasoning (p(rx, R(x), k)) as separate steps, replacing memorization with application. Overall, this work reframes LLM training for domain-specific intelligence: externalize facts, internalize reasoning. It unlocks strong performance from small models without overfitting or hallucination. | [Paper](https://arxiv.org/abs/2503.23513), [Tweet](https://x.com/omarsar0/status/1907796990966247484) | | 5) Why do LLMs Attend to First Token? This new paper explains why LLMs obsessively focus attention on the first token — a phenomenon known as an attention sink. Their theory: it’s a useful trick to prevent representational collapse in deep Transformers. ● Sinks = over-mixing shields – LLMs with long contexts and deep layers tend to over-mix information, causing similar embeddings for all tokens (i.e., rank collapse or over-squashing). Attention sinks—where many heads fixate on the ⟨bos⟩ token—act as no-ops that reduce token interaction and preserve representation diversity across layers. ● Sharp experiments on Gemma & LLaMa – Perturbation tests in Gemma 7B show ⟨bos⟩ significantly slows the spread of changes through the model. Meanwhile, in LLaMa 3.1 models, over 80% of attention heads show strong sink behavior in the 405B variant, supporting the theory that larger models need stronger sinks. ● Sinks emerge naturally – Even without special pretraining, sinks tend to form at the first position, not because of the ⟨bos⟩ token itself, but due to its location. However, if ⟨bos⟩ is fixed during training and later removed, performance collapses, showing that sink formation is data-dependent. ● Theoretical grounding – The authors connect sink emergence to Jacobian norm bounds, proving that sinks reduce sensitivity to token perturbations. Their math shows that deeper models and longer contexts require stronger sinks. ● Layerwise dynamics insight – Some attention heads use ⟨bos⟩ as a “default” target, unless a special pattern (e.g., apostrophe) triggers real computation. This supports a conditional attention mechanism—attend to ⟨bos⟩ unless needed elsewhere. | [Paper](https://arxiv.org/abs/2504.02732), [Tweet](https://x.com/omarsar0/status/1908187563422261411) | | 6) Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions Presents MedAgentSim is a fully automated, open-source hospital simulation where LLM-powered agents simulate doctor-patient interactions in dynamic diagnostic settings. Unlike previous static QA benchmarks, MedAgentSim mimics real-world clinical workflows with multi-turn dialogue, test requests, and self-improvement. More about this paper: ● Active doctor agents – MedAgentSim requires LLM doctor agents to engage in multi-turn consultations, request labs and imaging (e.g., ECG, X-ray), and iteratively refine diagnoses, making it far more realistic than pre-filled medical QA datasets. ● Self-improvement via memory + reflection – The system maintains buffers of successful and failed diagnoses. It uses retrieved past cases (via kNN), chain-of-thought reasoning, and ensembling to improve performance over time. Misdiagnoses trigger a reflection phase before inclusion in memory. ● Fully autonomous or human-in-the-loop – Users can optionally take control of the doctor or patient agents. Simulation assets are built using a 2D game engine (Phaser), and the agents can navigate, converse, and interact with virtual medical tools. ● Big performance boost across benchmarks – On NEJM, MedQA, and MIMIC-IV, MedAgentSim (with LLaMA 3.3) outperforms baseline setups by +6–37%, especially in vision-language tasks using LLaVA for interpreting medical images. ● Bias analysis & fairness focus – The team studied diagnostic accuracy under cognitive and implicit bias conditions. Models like GPT-4o and LLaMA proved more robust than Mixtral/Mistral, highlighting the importance of bias-aware evaluation. | [Paper](https://arxiv.org/abs/2503.22678), [Tweet](https://x.com/omarsar0/status/1906719555482702147), [Code](https://github.com/MAXNORM8650/MedAgentSim) | | 7) Open Deep Search Researchers from Sentient, UW, Princeton, and UC Berkeley introduce Open Deep Search (ODS), an open-source search AI framework that rivals top proprietary systems like GPT-4o Search Preview and Perplexity Sonar. Key insights: ● Two open components: search + reasoning – ODS has two modular parts: (1) Open Search Tool, which retrieves and refines high-quality web results using query rephrasing, snippet reranking, and site-specific logic; and (2) Open Reasoning Agent, a controller that orchestrates tool usage (search, calculator, etc.) to answer queries. Two variants are offered: ODS-v1 (ReAct) and ODS-v2 (CodeAct). ● SOTA open-source performance – With DeepSeek-R1 as the base LLM, ODS-v2 scores 88.3% on SimpleQA and 75.3% on FRAMES, beating GPT-4o Search Preview by +9.7% on the latter. ODS adapts the number of searches per query (avg. 3.39 on FRAMES), balancing cost and accuracy more efficiently than fixed-query baselines. ● Better than Perplexity Sonar – On both FRAMES and SimpleQA, ODS+DeepSeek-R1 outperforms Perplexity’s flagship search models, even in complex reasoning tasks involving multi-hop questions, time/date calculations, and name disambiguation. ● Code-based agents enhance reasoning – ODS-v2 builds on CodeAct, allowing it to write and run Python code to perform symbolic reasoning and tool calls. This results in sharper numerical precision and task flexibility compared to CoT-based ReAct in ODS-v1. | [Paper](https://arxiv.org/abs/2503.20201), [Tweet](https://x.com/sewoong79/status/1906595129965912341), [GitHub](https://github.com/sentient-agi/OpenDeepSearch) | | 8) Efficient Test-time Scaling with Code Z1 is a new method for making large language models more compute-efficient at test time, especially during reasoning. The core idea is to train LLMs with short and long code-based reasoning trajectories, and then dynamically adjust reasoning depth during inference. Key contributions: ● Z1-Code-Reasoning-107K dataset – They construct a 107K-sample dataset with short and long reasoning paths for simple and complex coding problems. Trajectories are distilled from QwQ-32B and paired to help the model learn when to stop thinking. ● Shifted Thinking Window – A new test-time strategy that eliminates explicit