# ML Papers of The Week [Subscribe to our newsletter](https://nlpnews.substack.com/) to get a weekly list of top ML papers in your inbox. At DAIR.AI we ❤️ reading ML papers so we've created this repo to highlight the top ML papers of every week. Here is the weekly series: Here is the weekly series: ## 2025 - [Top ML Papers of the Week (April 21 - April 27)](./#top-ml-papers-of-the-week-april-21---april-27---2025) - [Top ML Papers of the Week (April 14 - April 20)](./#top-ml-papers-of-the-week-april-14---april-20---2025) - [Top ML Papers of the Week (April 7 - April 13)](./#top-ml-papers-of-the-week-april-7---april-13---2025) - [Top ML Papers of the Week (March 31 - April 6)](./#top-ml-papers-of-the-week-march-31---april-6---2025) - [Top ML Papers of the Week (March 24 - March 30)](./#top-ml-papers-of-the-week-march-24---march-30---2025) - [Top ML Papers of the Week (March 17 - March 23)](./#top-ml-papers-of-the-week-march-17---march-23---2025) - [Top ML Papers of the Week (March 10 - March 16)](./#top-ml-papers-of-the-week-march-10---march-16---2025) - [Top ML Papers of the Week (March 3 - March 9)](./#top-ml-papers-of-the-week-march-3---march-9---2025) - [Top ML Papers of the Week (February 24 - March 2)](./#top-ml-papers-of-the-week-february-24---march-2---2025) - [Top ML Papers of the Week (February 17 - February 23)](./#top-ml-papers-of-the-week-february-17---february-23---2025) - [Top ML Papers of the Week (February 10 - February 16)](./#top-ml-papers-of-the-week-february-10---february-16---2025) - [Top ML Papers of the Week (February 3 - February 9)](./#top-ml-papers-of-the-week-february-3---february-9---2025) - [Top ML Papers of the Week (January 27 - February 2)](./#top-ml-papers-of-the-week-january-27---february-2---2025) - [Top ML Papers of the Week (January 20 - January 26)](./#top-ml-papers-of-the-week-january-20---january-26---2025) - [Top ML Papers of the Week (January 13 - January 19)](./#top-ml-papers-of-the-week-january-13---january-19---2025) - [Top ML Papers of the Week (January 6 - January 12)](./#top-ml-papers-of-the-week-january-6---january-12---2025) ## 2024 - [Top ML Papers of the Week (December 30 - January 5)](./#top-ml-papers-of-the-week-december-30---january-5---2025) - [Top ML Papers of the Week (December 23 - December 29)](./#top-ml-papers-of-the-week-december-23---december-29---2024) - [Top ML Papers of the Week (December 16 - December 22)](./#top-ml-papers-of-the-week-december-16---december-22---2024) - [Top ML Papers of the Week (December 9 - December 15)](./#top-ml-papers-of-the-week-december-9---december-15---2024) - [Top ML Papers of the Week (December 2 - December 8)](./#top-ml-papers-of-the-week-december-2---december-8---2024) - [Top ML Papers of the Week (November 25 - December 1)](./#top-ml-papers-of-the-week-november-25---december-1---2024) - [Top ML Papers of the Week (November 18 - November 24)](./#top-ml-papers-of-the-week-november-18---november-24---2024) - [Top ML Papers of the Week (November 11 - November 17)](./#top-ml-papers-of-the-week-november-11---november-17---2024) - [Top ML Papers of the Week (November 4 - November 10)](./#top-ml-papers-of-the-week-november-4---november-10---2024) - [Top ML Papers of the Week (October 28 - November 3)](./#top-ml-papers-of-the-week-october-28---november-3---2024) - [Top ML Papers of the Week (October 21 - October 27)](./#top-ml-papers-of-the-week-october-14---october-20---2024) - [Top ML Papers of the Week (October 14 - October 20)](./#top-ml-papers-of-the-week-october-14---october-20---2024) - [Top ML Papers of the Week (October 7 - October 13)](./#top-ml-papers-of-the-week-october-7---october-13---2024) - [Top ML Papers of the Week (September 30 - October 6)](./#top-ml-papers-of-the-week-september-30---october-6---2024) - [Top ML Papers of the Week (September 23 - September 29)](./#top-ml-papers-of-the-week-september-23---september-29---2024) - [Top ML Papers of the Week (September 16 - September 22)](./#top-ml-papers-of-the-week-september-16---september-22---2024) - [Top ML Papers of the Week (September 9 - September 15)](./#top-ml-papers-of-the-week-september-9---september-15---2024) - [Top ML Papers of the Week (September 2 - September 8)](./#top-ml-papers-of-the-week-september-2---september-8---2024) - [Top ML Papers of the Week (August 26 - September 1)](./#top-ml-papers-of-the-week-august-26---september-1---2024) - [Top ML Papers of the Week (August 19 - August 25)](./#top-ml-papers-of-the-week-august-19---august-25---2024) - [Top ML Papers of the Week (August 12 - August 18)](./#top-ml-papers-of-the-week-august-12---august-18---2024) - [Top ML Papers of the Week (August 5 - August 11)](./#top-ml-papers-of-the-week-august-5---august-11---2024) - [Top ML Papers of the Week (July 29 - August 4)](./#top-ml-papers-of-the-week-july-29---august-4---2024) - [Top ML Papers of the Week (July 22 - July 28)](./#top-ml-papers-of-the-week-july-15---july-21---2024) - [Top ML Papers of the Week (July 15 - July 21)](./#top-ml-papers-of-the-week-july-15---july-21---2024) - [Top ML Papers of the Week (July 8 - July 14)](./#top-ml-papers-of-the-week-july-8---july-14---2024) - [Top ML Papers of the Week (July 1 - July 7)](./#top-ml-papers-of-the-week-july-1---july-7---2024) - [Top ML Papers of the Week (June 24 - June 30)](./#top-ml-papers-of-the-week-june-24---june-30---2024) - [Top ML Papers of the Week (June 17 - June 23)](./#top-ml-papers-of-the-week-june-17---june-23---2024) - [Top ML Papers of the Week (June 10 - June 16)](./#top-ml-papers-of-the-week-june-10---june-16---2024) - [Top ML Papers of the Week (June 3 - June 9)](./#top-ml-papers-of-the-week-june-3---june-9---2024) - [Top ML Papers of the Week (May 27 - June 2)](./#top-ml-papers-of-the-week-may-27---june-2---2024) - [Top ML Papers of the Week (May 20 - May 26)](./#top-ml-papers-of-the-week-may-20---may-26---2024) - [Top ML Papers of the Week (May 13 - May 19)](./#top-ml-papers-of-the-week-may-13---may-19---2024) - [Top ML Papers of the Week (May 6 - May 12)](./#top-ml-papers-of-the-week-may-6---may-12---2024) - [Top ML Papers of the Week (April 29 - May 5)](./#top-ml-papers-of-the-week-april-29---may-5---2024) - [Top ML Papers of the Week (April 22 - April 28)](./#top-ml-papers-of-the-week-april-22---april-28---2024) - [Top ML Papers of the Week (April 15 - April 21)](./#top-ml-papers-of-the-week-april-15---april-21---2024) - [Top ML Papers of the Week (April 8 - April 14)](./#top-ml-papers-of-the-week-april-8---april-14---2024) - [Top ML Papers of the Week (April 1 - April 7)](./#top-ml-papers-of-the-week-april-1---april-7---2024) - [Top ML Papers of the Week (March 26 - March 31)](./#top-ml-papers-of-the-week-march-26---march-31---2024) - [Top ML Papers of the Week (March 18 - March 25)](./#top-ml-papers-of-the-week-march-18---march-25---2024) - [Top ML Papers of the Week (March 11 - March 17)](./#top-ml-papers-of-the-week-march-11---march-17---2024) - [Top ML Papers of the Week (March 4 - March 10)](./#top-ml-papers-of-the-week-march-4---march-10---2024) - [Top ML Papers of the Week (February 26 - March 3)](./#top-ml-papers-of-the-week-february-26---march-3---2024) - [Top ML Papers of the Week (February 19 - February 25)](./#top-ml-papers-of-the-week-february-19---february-25---2024) - [Top ML Papers of the Week (February 12 - February 18)](./#top-ml-papers-of-the-week-february-12---february-18---2024) - [Top ML Papers of the Week (February 5 - February 11)](./#top-ml-papers-of-the-week-february-5---february-11---2024) - [Top ML Papers of the Week (January 29 - February 4)](./#top-ml-papers-of-the-week-january-29---february-4---2024) - [Top ML Papers of the Week (January 22 - January 28)](./#top-ml-papers-of-the-week-january-22---january-28---2024) - [Top ML Papers of the Week (January 15 - January 21)](./#top-ml-papers-of-the-week-january-15---january-21---2024) - [Top ML Papers of the Week (January 8 - January 14)](./#top-ml-papers-of-the-week-january-8---january-14---2024) - [Top ML Papers of the Week (January 1 - January 7)](./#top-ml-papers-of-the-week-january-1---january-7---2024) ## 2023 - [Top ML Papers of the Week (December 24 - December 31)](./#top-ml-papers-of-the-week-december-25---december-31) - [Top ML Papers of the Week (December 18 - December 24)](./#top-ml-papers-of-the-week-december-18---december-24) - [Top ML Papers of the Week (December 11 - December 17)](./#top-ml-papers-of-the-week-december-11---december-17) - [Top ML Papers of the Week (December 4 - December 10)](./#top-ml-papers-of-the-week-december-4---december-10) - [Top ML Papers of the Week (November 27 - December 3)](./#top-ml-papers-of-the-week-november-27---december-3) - [Top ML Papers of the Week (November 20 - November 26)](./#top-ml-papers-of-the-week-november-20---november-26) - [Top ML Papers of the Week (November 13 - November 19)](./#top-ml-papers-of-the-week-november-13---november-19) - [Top ML Papers of the Week (November 6 - November 12)](./#top-ml-papers-of-the-week-november-6---november-12) - [Top ML Papers of the Week (October 30 - November 5)](./#top-ml-papers-of-the-week-october-30---november-5) - [Top ML Papers of the Week (October 23 - October 29)](./#top-ml-papers-of-the-week-october-23---october-29) - [Top ML Papers of the Week (October 16 - October 22)](./#top-ml-papers-of-the-week-october-16---october-22) - [Top ML Papers of the Week (October 9 - October 15)](./#top-ml-papers-of-the-week-october-9---october-15) - [Top ML Papers of the Week (October 2 - October 8)](./#top-ml-papers-of-the-week-october-2---october-8) - [Top ML Papers of the Week (September 25 - October 1)](./#top-ml-papers-of-the-week-september-25---october-1) - [Top ML Papers of the Week (September 18 - September 24)](./#top-ml-papers-of-the-week-september-18---september-24) - [Top ML Papers of the Week (September 11 - September 17)](./#top-ml-papers-of-the-week-september-11---september-17) - [Top ML Papers of the Week (September 4 - September 10)](./#top-ml-papers-of-the-week-september-4---september-10) - [Top ML Papers of the Week (August 28 - September 3)](./#top-ml-papers-of-the-week-august-28---september-3) - [Top ML Papers of the Week (August 21 - August 27)](./#top-ml-papers-of-the-week-august-21---august-27) - [Top ML Papers of the Week (August 14 - August 20)](./#top-ml-papers-of-the-week-august-14---august-20) - [Top ML Papers of the Week (August 7 - August 13)](./#top-ml-papers-of-the-week-august-7---august-13) - [Top ML Papers of the Week (July 31 - August 6)](./#top-ml-papers-of-the-week-july-31---august-6) - [Top ML Papers of the Week (July 24 - July 30)](./#top-ml-papers-of-the-week-july-24---july-30) - [Top ML Papers of the Week (July 17 - July 23)](./#top-ml-papers-of-the-week-july-17---july-23) - [Top ML Papers of the Week (July 10 - July 16)](./#top-ml-papers-of-the-week-july-10---july-16) - [Top ML Papers of the Week (July 3 - July 9)](./#top-ml-papers-of-the-week-july-3---july-9) - [Top ML Papers of the Week (June 26 - July 2)](./#top-ml-papers-of-the-week-june-26---july-2) - [Top ML Papers of the Week (June 19 - June 25)](./#top-ml-papers-of-the-week-june-19---june-25) - [Top ML Papers of the Week (June 12 - June 18)](./#top-ml-papers-of-the-week-june-12---june-18) - [Top ML Papers of the Week (June 5 - June 11)](./#top-ml-papers-of-the-week-june-5---june-11) - [Top ML Papers of the Week (May 29 - June 4)](./#top-ml-papers-of-the-week-may-29-june-4) - [Top ML Papers of the Week (May 22 - 28)](./#top-ml-papers-of-the-week-may-22-28) - [Top ML Papers of the Week (May 15 - 21)](./#top-ml-papers-of-the-week-may-15-21) - [Top ML Papers of the Week (May 8 - 14)](./#top-ml-papers-of-the-week-may-8-14) - [Top ML Papers of the Week (May 1-7)](./#top-ml-papers-of-the-week-may-1-7) - [Top ML Papers of the Week (April 24 - April 30)](./#top-ml-papers-of-the-week-april-24---april-30) - [Top ML Papers of the Week (April 17 - April 23)](./#top-ml-papers-of-the-week-april-17---april-23) - [Top ML Papers of the Week (April 10 - April 16)](./#top-ml-papers-of-the-week-april-10---april-16) - [Top ML Papers of the Week (April 3 - April 9)](./#top-ml-papers-of-the-week-april-3---april-9) - [Top ML Papers of the Week (Mar 27 - April 2)](./#top-ml-papers-of-the-week-mar-27---april-2) - [Top ML Papers of the Week (Mar 20-Mar 26)](./#top-ml-papers-of-the-week-mar-20-mar-26) - [Top ML Papers of the Week (Mar 13-Mar 19)](./#top-ml-papers-of-the-week-mar-13-mar-19) - [Top ML Papers of the Week (Mar 6-Mar 12)](./#top-ml-papers-of-the-week-mar-6-mar-12) - [Top ML Papers of the Week (Feb 27-Mar 5)](./#top-ml-papers-of-the-week-feb-27-mar-5) - [Top ML Papers of the Week (Feb 20-26)](./#top-ml-papers-of-the-week-feb-20-26) - [Top ML Papers of the Week (Feb 13 - 19)](./#top-ml-papers-of-the-week-feb-13---19) - [Top ML Papers of the Week (Feb 6 - 12)](./#top-ml-papers-of-the-week-feb-6---12) - [Top ML Papers of the Week (Jan 30-Feb 5)](./#top-ml-papers-of-the-week-jan-30-feb-5) - [Top ML Papers of the Week (Jan 23-29)](./#top-ml-papers-of-the-week-jan-23-29) - [Top ML Papers of the Week (Jan 16-22)](./#top-ml-papers-of-the-week-jan-16-22) - [Top ML Papers of the Week (Jan 9-15)](./#top-ml-papers-of-the-week-jan-9-15) - [Top ML Papers of the Week (Jan 1-8)](./#top-ml-papers-of-the-week-jan-1-8) [Follow us on Twitter](https://twitter.com/dair_ai) [Join our Discord](https://discord.gg/SKgkVT8BGJ) ## Top ML Papers of the Week (April 21 - April 27) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) Does RL Incentivize Reasoning in LLMs Beyond the Base Model? This paper revisits a key assumption in recent LLM development: that Reinforcement Learning with Verifiable Rewards (RLVR) helps models acquire genuinely new reasoning capabilities. By analyzing models across tasks (math, code, vision) using pass@k metrics (with large k), the authors find that RLVR improves sample efficiency but does not expand reasoning capacity beyond the base model.
● Key insight: RLVR-trained models do better at low *k* (e.g., pass@1), but as *k* increases (up to 256 or more), base models eventually match or outperform them. This suggests RLVR doesn’t generate fundamentally new reasoning paths but just increases the likelihood of sampling already-existing correct ones.
● Reasoning already in the base: RLVR models' successful CoTs are shown to be present within the base model's sampling distribution. Perplexity analyses confirm that RL outputs are often high-probability continuations for the base model.
● Efficiency vs. exploration: RLVR narrows the model’s exploration space, improving efficiency but shrinking its coverage of diverse reasoning paths, thereby reducing overall problem-solving reach at scale.
● Distillation helps more: Unlike RLVR, distillation from a stronger teacher model (e.g., DeepSeek-R1) introduces genuinely new reasoning patterns, expanding the model’s capabilities.
● Algorithmic limits: Across PPO, GRPO, Reinforce++, etc., RL algorithms offer similar sample-efficiency improvements, but none closes the gap to the base model’s pass@256—highlighting the limits of current RL strategies. | [Paper](https://arxiv.org/abs/2504.13837), [Tweet](https://x.com/DaveShapi/status/1915408405201629684) | | 2) BitNet b1.58 2B4T This work introduces BitNet b1.58 2B4T, the first open-source, natively trained 1-bit LLM at the 2B parameter scale, achieving strong performance while being extremely efficient. The model uses a custom ternary quantization scheme (1.58 bits per weight), enabling dramatic reductions in memory (0.4 GB), energy (0.028J/token), and latency (29ms), while still competing with state-of-the-art full-precision models across diverse benchmarks.
● New Pareto frontier in efficiency-performance: Trained from scratch on 4T tokens, BitNet b1.58 2B4T outperforms or matches open full-precision models (e.g., Qwen2.5 1.5B, MiniCPM 2B) on tasks like ARC-Challenge, PIQA, WinoGrande, and GSM8K. It achieves 54.19% average. across 16 benchmarks, comparable to Qwen2.5-1.5B’s 55.23%, but with ~6.5× lower memory and 10× lower energy usage.
● Outperforms quantized baselines: Against INT4 post-training quantized Qwen2.5 models (GPTQ/AWQ), BitNet is both smaller and more accurate, showing the advantage of native 1-bit training over PTQ approaches.
● Architectural & training innovations: It replaces standard linear layers with BitLinear layers using absmean ternary quantization and 8-bit activations, combines RoPE embeddings, squared ReLU activation, and bias-free layers. Training includes cosine LR and weight decay schedules, plus supervised fine-tuning and Direct Preference Optimization (DPO) instead of full RLHF.
● Best-in-class among 1-bit LLMs: When compared to other 1-bit models like OLMo-Bitnet (1B) and post-quantized Falcon3/Llama3 (7B–8B), BitNet b1.58 2B4T is +10 pts stronger on average, establishing a new benchmark for ultra-efficient LLMs. The authors also release optimized CUDA kernels for GPU and a C++ inference library for CPU, enabling practical deployment of 1-bit LLMs on diverse hardware. BitNet b1.58 2B4T demonstrates that extreme quantization does not mean compromised capability, and it opens the door to the broader adoption of LLMs in resource-constrained environments. | [Paper](https://arxiv.org/abs/2504.12285) | | 3) UI-TARS UI-TARS introduces a powerful, end-to-end native GUI agent that operates purely from visual screenshots, performing human-like keyboard and mouse interactions across platforms. Unlike existing modular agent frameworks that rely on prompt engineering and external scripts, UI-TARS integrates perception, action, reasoning, and memory directly into its architecture, achieving strong generalization and adaptability in dynamic real-world settings. Key contributions:
● Enhanced GUI Perception: UI-TARS is trained on a large-scale, richly annotated dataset of screenshots with metadata, enabling dense captioning, state transition understanding, and precise element description. It excels in perception benchmarks like VisualWebBench, scoring 82.8, outperforming GPT-4o’s.
● Unified Action Modeling and Grounding: UI-TARS standardizes actions across platforms into a shared action space and learns from large-scale multi-step action traces. It surpasses baselines in grounding tasks with 38.1 on ScreenSpot Pro, the new SOTA.
● System-2 Reasoning via “Thoughts”: Inspired by ReAct-style frameworks, UI-TARS generates internal reasoning steps (thoughts) before actions. These thoughts reflect patterns like task decomposition, reflection, and long-term consistency, significantly improving performance in complex scenarios. For example, in OSWorld, UI-TARS-72B-DPO scores 24.6 with a 50-step budget, outperforming Claude’s.
● Iterative Self-Improvement with Reflective Learning: UI-TARS continuously refines itself through online trace collection and reflection tuning using error correction and post-error adaptation data. This allows it to recover from mistakes and adapt with minimal human oversight. Overall, UI-TARS marks a significant step forward in GUI automation, setting new benchmarks across more than 10 datasets and outperforming top commercial agents like GPT-4o and Claude. Its open-source release aims to drive further innovation in native agent development. | [Paper](https://arxiv.org/abs/2501.12326), [Blog](https://seed-tars.com/1.5/) | | 4) Describe Anything Introduces DAM, a model that generates fine-grained, region-specific captions in both images and videos. The authors address key limitations in prior vision-language models—namely, the inability to preserve local detail and the lack of suitable datasets and benchmarks for detailed localized captioning (DLC). Key contributions:
● DAM (Describe Anything Model) uses two main innovations to capture both fine regional detail and global scene context: a focal prompt that provides high-resolution encoding of user-specified regions, and a localized vision backbone that uses gated cross-attention to integrate context from the entire image. This enables DAM to generate multi-granular, accurate descriptions, especially for small or occluded regions.
● DLC-SDP (Semi-supervised Data Pipeline) tackles data scarcity by expanding segmentation datasets with VLM-generated detailed captions, followed by self-training on web images. This produces high-quality, diverse training data, enabling DAM to outperform API-only baselines like GPT-4o across several benchmarks.
● DLC-Bench is a reference-free benchmark that scores models on their ability to accurately include or exclude region-specific details using LLM judges. It provides a more reliable evaluation than traditional caption-matching metrics, which often penalize models for valid but unmatched details.
● Performance: DAM sets a new state-of-the-art on 7 benchmarks across keyword, phrase, and detailed multi-sentence captioning tasks in both images and videos. It outperforms GPT-4o, Claude 3.7, and other top VLMs in both zero-shot and in-domain evaluations, achieving up to 33.4% improvement over prior models on detailed image captioning and 19.8% on video captioning. | [Paper](https://arxiv.org/abs/2504.16072) | | 5) UXAgent Introduces a novel framework, UXAgent, for simulating large-scale usability testing using LLM-driven agents. The system empowers UX researchers to test and iterate web design and study protocols before engaging real users. This is achieved through the orchestration of simulated agents with diverse personas interacting in real web environments, providing both behavioral and reasoning data. Key highlights:
● LLM-Powered Simulation with Personas: UXAgent begins with a Persona Generator that can produce thousands of demographically diverse simulated users based on custom distributions. Each persona is fed into an LLM Agent that embodies user intent and interacts with the website via a Universal Browser Connector—a module capable of interpreting and manipulating real HTML structures.
● Dual-Loop Reasoning Architecture: At the heart of UXAgent is a dual-process agent architecture inspired by cognitive psychology: a Fast Loop for low-latency actions and a Slow Loop for deep reasoning. This design mimics System 1 and System 2 thinking and allows agents to act responsively while maintaining coherent high-level plans and reflections.
● Rich Memory Stream: All observations, actions, plans, reflections, and spontaneous thoughts (“wonders”) are stored in a Memory Stream. These memories are dynamically prioritized for retrieval using a weighted scoring system based on importance, recency, and relevance, tailored separately for fast and slow modules.
● Replay and Interview Interfaces: UX researchers can review simulated sessions via a Simulation Replay Interface and conduct natural language conversations with agents using an Agent Interview Interface. This supports qualitative analysis, such as asking agents about their decisions or presenting mockups for feedback.
● Empirical Evaluation: A case study involving 60 LLM agent simulations on a shopping platform (WebArena) showed that researchers were able to detect usability study flaws and gather early insights. A follow-up user study with five UX professionals found the system helpful for iterating study design, despite some concerns over realism and data noise. Particularly appreciated was the ability to converse with agents and gather qualitative insights that would be infeasible in traditional pilots.
● Future Implications: The authors position LLM agents not as replacements for real participants, but as early-stage collaborators in the design process, reducing the cost and risk of flawed studies. They also discuss extensions to multimodal settings, desktop or mobile interfaces, and broader agentic tasks such as digital twins or simulated A/B testing. | [Paper](https://arxiv.org/abs/2504.09407) | | 6) Test-Time Reinforcement Learning Test-Time Reinforcement Learning (TTRL) is a method that allows LLMs to improve themselves during inference without ground-truth labels. Instead of relying on labeled datasets, TTRL uses majority voting over multiple model generations to estimate pseudo-rewards, enabling reinforcement learning (RL) on unlabeled test data. The method integrates Test-Time Scaling (TTS) and Test-Time Training (TTT) strategies, letting models adapt dynamically to new and challenging inputs. Key highlights:
● Majority Voting as Reward: TTRL generates multiple candidate outputs for a query and uses majority voting to derive a pseudo-label. Rewards are assigned based on agreement with the consensus answer.
● Significant Performance Gains: Applying TTRL to Qwen2.5-Math-7B leads to a +159% improvement on AIME 2024 and +84% average gains across AIME, AMC, and MATH-500 benchmarks, without using any labeled training data.
● Self-Evolution Beyond Supervision: Remarkably, TTRL surpasses the performance ceiling of its own majority-vote supervision (Maj@N) and approaches the performance of models trained with full label leakage, indicating efficient and stable unsupervised RL.
● Generalization and Robustness: TTRL generalizes well across tasks, maintains effectiveness even under label estimation noise, and is compatible with different RL algorithms like PPO and GRPO.
● Limitations: TTRL may fail when the base model lacks sufficient prior knowledge about the domain or when hyperparameters (like batch size and temperature) are poorly tuned. | [Paper](https://www.arxiv.org/abs/2504.16084) | | 7) Discovering Values in Real-World Language Model Interactions This paper presents the first large-scale empirical analysis of values exhibited by a deployed AI assistant, Claude 3 and 3.5 models, using over 300,000 real-world conversations. The authors develop a bottom-up, privacy-preserving framework to extract, classify, and analyze AI-expressed normative considerations (“values”) and show how they vary across tasks, user values, and conversational contexts.
● The authors identify 3,307 unique AI values, which are organized into a five-domain taxonomy: Practical, Epistemic, Social, Protective, and Personal. Practical and epistemic values dominate, often aligning with Claude’s training goals around being helpful, harmless, and honest.
● Claude’s most common values, such as helpfulness (23.4%), professionalism, transparency, and clarity, are context-invariant and reflect its role as a service-oriented assistant. In contrast, human values like authenticity and efficiency are more varied.
● Many values are context-specific. For example, healthy boundaries arise in relationship advice, historical accuracy in controversial event discussions, and human agency in AI governance contexts.
● Claude tends to mirror human values in supportive contexts (20.1% mirroring rate), but expresses opposing values during resistance, especially in cases involving unethical or policy-violating requests (e.g., resisting “moral nihilism” with “ethical integrity”).
● Explicit value expression (e.g., “I value transparency”) occurs more often in moments of resistance or reframing, particularly around epistemic and ethical principles like intellectual honesty and harm prevention. This suggests that AI values become most visible when the system is challenged.
● Across Claude variants, 3 Opus expresses more emotionally nuanced and ethically grounded values (e.g., academic rigor, emotional authenticity) and shows a stronger inclination for both support and resistance compared to 3.5/3.7 Sonnet. | [Paper](https://assets.anthropic.com/m/18d20cca3cde3503/original/Values-in-the-Wild-Paper.pdf), [Tweet](https://x.com/AnthropicAI/status/1914333220067213529) | | 8) Evaluate the Goal-Directedness of LLMs Introduces a new framework to assess whether LLMs use their capabilities effectively toward achieving given goals. The study finds that even top models like GPT-4o and Claude 3.7 fall short of full goal-directedness, particularly in information-gathering and combined tasks, despite performing well in isolated subtasks. | [Paper](https://arxiv.org/abs/2504.11844), [Tweet](https://x.com/tom4everitt/status/1912806499862139275), [GitHub](https://github.com/Crista23/goal_directedness_llms) | | 9) General-Reasoner General-Reasoner is a reinforcement learning approach that boosts LLM reasoning across diverse domains by using a 230K-question dataset and a model-based verifier trained to understand semantics beyond exact matches. It outperforms strong baselines like SimpleRL and Qwen2.5 on both general reasoning (MMLU-Pro, GPQA, SuperGPQA) and math tasks (MATH-500, GSM8K), showing over 10-point gains without sacrificing mathematical capability. | [Paper](https://github.com/TIGER-AI-Lab/General-Reasoner/blob/main/General_Reasoner.pdf), [Tweet](https://x.com/WenhuChen/status/1912242238110789671) | | 10) Tiny Reasoning Models Tina is a family of 1.5B parameter reasoning models trained using LoRA-based reinforcement learning (RL) to achieve high reasoning accuracy at very low cost. It outperforms or matches full fine-tuned models on reasoning tasks like AIME and MATH with only ~$9 post-training cost, demonstrating that efficient reasoning can be instilled via minimal updates to a tiny model. | [Paper](https://arxiv.org/abs/2504.15777) | ## Top ML Papers of the Week (April 14 - April 20) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) GUI-R1 Researchers from the National University of Singapore and the Chinese Academy of Sciences introduce GUI-R1, a reinforcement learning (RL) framework aimed at improving graphical user interface (GUI) agents through unified action-space modeling. Key insights include:
● Reinforcement Fine-Tuning (RFT) over Supervised Fine-Tuning (SFT) – GUI-R1 utilizes RFT inspired by methods such as DeepSeek-R1, significantly reducing training data requirements. It uses only 3K carefully curated examples versus millions used by previous models.
● Unified Action Space and Reward Modeling – The authors introduce a unified action space that covers actions across different platforms (Windows, Linux, MacOS, Android, and Web). This enables consistent reward signals for evaluating GUI actions, enhancing the model’s adaptability and generalization.
● Superior Performance with Minimal Data – GUI-R1 outperforms state-of-the-art methods like OS-Atlas using merely 0.02% of the training data (3K vs. 13M). Evaluations across eight benchmarks spanning mobile, desktop, and web platforms show significant improvements in grounding, low-level, and high-level GUI task capabilities.
● Efficient Training and Strong Generalization – By leveraging policy optimization algorithms like Group Relative Policy Optimization (GRPO), GUI-R1 quickly converges to high performance, demonstrating robustness and efficiency even in resource-constrained scenarios. | [Paper](https://arxiv.org/abs/2504.10458) | | 2) Scaling Reasoning in Diffusion LLMs via RL Proposes d1, a two‑stage recipe that equips masked diffusion LLMs with strong step‑by‑step reasoning.
● Two‑stage pipeline (SFT → diffu‑GRPO) – d1 first applies supervised fine‑tuning on the 1 k‑example s1K dataset and then runs task‑specific RL with the new diffu‑GRPO objective, yielding larger gains than either stage alone.
● diffu‑GRPO: RL for masked dLLMs – Extends GRPO to diffusion LLMs via (i) a mean‑field sequence‑log‑prob approximation and (ii) a one‑step per‑token log‑prob estimator with random prompt masking, enabling many gradient updates from a single generation.
● Consistent gains on four reasoning benchmarks – On GSM8K, MATH500, Countdown, and Sudoku, diffu‑GRPO beats SFT, and the full d1‑LLaDA variant attains the best scores (e.g., 81.1 % GSM8K & 38.6 % MATH500 at 256 tokens, +5–12 pp over baseline).
● Competitive among 7‑8 B models – d1‑LLaDA outperforms DeepSeek‑7B, Mistral‑7B and Llama‑3‑8B on GSM8K and ranks second on MATH500 in the same size class.
● Longer decoding unlocks “aha moments” – At 512‑token generation, the model shows self‑verification/backtracking; effective‑token usage grows smoothly, echoing test‑time compute scaling trends.
● Random masking speeds RL – Ablations show that random prompt masking during diffu‑GRPO accelerates convergence and boosts correctness relative to fixed masking, with fewer online generations needed. | [Paper](https://arxiv.org/abs/2504.12216) | | 3) Enhancing Non-Reasoning Models with Reasoning Models Researchers explore how to distill reasoning-intensive outputs (answers and explanations) from top-tier LLMs into more lightweight models that don’t explicitly reason step by step. By fine-tuning smaller models on the high-quality final answers (and optionally summarized thinking traces) from advanced reasoning models, they demonstrate consistent performance boosts across multiple benchmarks.
● Test-time scaling vs. knowledge distillation – While large models like DeepSeek-R1 and OpenAI-o1 can allocate more compute to generate better reasoning traces, this paper focuses on systematically transferring those rich final answers (and possibly a summarized version of the reasoning steps) to more compact models.
● Data curation – The authors construct a 1.3M-instance dataset by pulling prompts from multiple open-source repositories (including Infinity Instruct, CodeContests, FLAN, etc.) and generating final answers plus detailed reasoning from DeepSeek-R1.
● Three fine-tuning strategies – (1) Use the original baseline answers from existing open-source sets, (2) fine-tune on only the final answer portion of a reasoning model, and (3) combine a summarized chain-of-thought with the final answer. Models trained on the second strategy excelled at math/coding tasks, while the third approach proved better for more conversational or alignment-oriented tasks.
● Empirical gains – Fine-tuning Qwen2.5-32B on the reasoning model’s final answers led to notable improvements on GSM8K (92.2%) and HumanEval (90.9%). A think-summarization approach boosted a different set of benchmarks (GPQA and chat-based tests). However, weaving in the “thinking trace” sometimes caused slight drops in instruction strictness (IFEval).
● Trade-offs and future work – Distilling advanced reasoning data definitely helps smaller models, but deciding how much of the reasoning trace to include is domain-dependent. The authors suggest that more refined ways of seamlessly blending reasoning steps into final answers (e.g., specialized prompts or partial merges) could further improve performance and avoid alignment regressions. | [Paper](https://arxiv.org/abs/2504.09639) | | 4) AgentA/B AgentA/B is a fully automated A/B testing framework that replaces live human traffic with large-scale LLM-based agents. These agents simulate realistic, intention-driven user behaviors on actual web environments, enabling faster, cheaper, and risk-free UX evaluations — even on real websites like Amazon. Key Insights:
● Modular agent simulation pipeline – Four components—agent generation, condition prep, interaction loop, and post-analysis—allow plug-and-play simulations on live webpages using diverse LLM personas.
● Real-world fidelity – The system parses live DOM into JSON, enabling structured interaction loops (search, filter, click, purchase) executed via LLM reasoning + Selenium.
● Behavioral realism – Simulated agents show more goal-directed but comparable interaction patterns vs. 1M real Amazon users (e.g., shorter sessions but similar purchase rates).
● Design sensitivity – A/B test comparing full vs. reduced filter panels revealed that agents in the treatment condition clicked more, used filters more often, and purchased more.
● Inclusive prototyping – Agents can represent hard-to-reach populations (e.g., low-tech users), making early-stage UX testing more inclusive and risk-free.
● Notable results AgentA/B shows how LLM agents can augment — not replace — traditional A/B testing by offering a new pre-deployment simulation layer. This can accelerate iteration, reduce development waste, and support UX inclusivity without needing immediate live traffic. | [Paper](https://arxiv.org/abs/2504.09723) | | 5) Reasoning Models Can Be Effective Without Thinking This paper challenges the necessity of long chain-of-thought (CoT) reasoning in LLMs by introducing a simple prompting method called NoThinking, which bypasses explicit "thinking" steps. Surprisingly, NoThinking performs comparably to or better than traditional reasoning under comparable or even lower compute budgets, especially when paired with parallel decoding and best-of-N selection. Key Insights:
● NoThinking prepends a dummy “Thinking” block and jumps straight to final answers.
● Despite skipping structured reasoning, it outperforms Thinking in pass@k (1–64) on many benchmarks, especially under token constraints.
● With parallel scaling, NoThinking achieves higher pass@1 accuracy than Thinking while using 4× fewer tokens and up to 9× lower latency.
● Tasks evaluated: competitive math (AIME24/25, AMC23, OlympiadBench), coding (LiveCodeBench), and formal theorem proving (MiniF2F, ProofNet).
● NoThinking is shown to provide superior accuracy–latency tradeoffs and generalizes across diverse tasks. Results:
● Low-budget wins: On AMC23 (700 tokens), NoThinking achieves 51.3% vs. 28.9% (Thinking).
● Better scaling: As k increases, NoThinking consistently surpasses Thinking.
● Efficiency frontier: Across benchmarks, NoThinking dominates the accuracy–cost Pareto frontier.
● Parallel wins: With simple confidence-based or majority vote strategies, NoThinking + best-of-N beats full Thinking on pass@1 with significantly less latency. | [Paper](https://www.arxiv.org/abs/2504.09858) | | 6) SocioVerse Researchers from Fudan University and collaborators propose SocioVerse, a large-scale world model for social simulation using LLM agents aligned with real-world user behavior. Key ideas include:
● Four-fold alignment framework – SocioVerse tackles major challenges in aligning simulated environments with reality across four dimensions:
● Three representative simulations – SocioVerse showcases its generalizability through:
● Impressive empirical accuracy –
● Ablation insights – Removing prior demographic distribution and user knowledge severely degrades election prediction accuracy (Acc drops from 0.80 → 0.60), highlighting the value of realistic population modelingpapersoftheweek.
● Toward trustworthy virtual societies – SocioVerse not only standardizes scalable social simulations but also provides a sandbox for testing sociopolitical hypotheses (e.g., fairness, policy change), bridging AI agent systems with traditional social science. | [Paper](https://arxiv.org/abs/2504.10157) | | 7) DocAgent Researchers from Meta AI present DocAgent, a tool‑integrated, dependency‑aware framework that turns large, complex codebases into well‑written docstrings. Key ideas include:
● Topological Navigator for context building – DocAgent parses the repository’s AST, builds a dependency DAG, and documents components in topological order, so each function/class is visited only after its prerequisites, enabling incremental context accumulation and preventing context‑length explosions.
● Role‑specialised agent team – Five agents work together: Reader analyses code, Searcher gathers internal & external references, Writer drafts docstrings, Verifier critiques and revises them, while the Orchestrator manages iterations until quality converges.
● Adaptive context management – When retrieved context exceeds the model’s token budget, the Orchestrator trims low‑priority segments while preserving overall structure, keeping generation efficient and faithful2504.08725v1.
● Three‑facet automatic evaluation – A new framework scores Completeness (section coverage), Helpfulness (LLM‑as‑judge semantic utility), and Truthfulness (entity grounding against the code DAG) for every docstring.
● Substantial gains over baselines – On 366 components across nine Python repos, DocAgent + GPT‑4o‑mini lifts Completeness to 0.934 vs 0.815, Helpfulness to 3.88 / 5 vs 2.95, and Truthfulness (existence ratio) to 95.7 % vs 61.1 % compared with a Chat‑GPT baseline; FIM baselines fare far worse.
● Navigator is crucial – An ablation that randomises processing order drops helpfulness by ‑0.44 and truthfulness by ‑7.9 pp, confirming the importance of dependency‑aware traversal. | [Paper](https://arxiv.org/abs/2504.08725) | | 8) SWE-PolyBench SWE-PolyBench is a new multi-language benchmark for evaluating coding agents on real-world software tasks across Java, JavaScript, TypeScript, and Python. It introduces execution-based assessments, syntax tree metrics, and reveals that current agents struggle with complex tasks and show inconsistent performance across languages. | [Paper](https://arxiv.org/abs/2504.08703v1) | | 9) A Survey of Frontiers in LLM Reasoning This survey categorizes LLM reasoning methods by when reasoning occurs (inference-time vs. training) and the system's architecture (standalone vs. agentic or multi-agent). It highlights trends like learning-to-reason (e.g., DeepSeek-R1) and agentic workflows (e.g., OpenAI Deep Research), covering prompt engineering, output refinement, and learning strategies such as PPO and verifier training. | [Paper](https://arxiv.org/abs/2504.09037) | | 10) Advances in Embodied Agents, Smart Cities, and Earth Science This paper surveys how spatial intelligence manifests across disciplines—from embodied agents to urban and global systems—by connecting human spatial cognition with how LLMs handle spatial memory, representations, and reasoning. It offers a unifying framework to bridge research in AI, robotics, urban planning, and earth science, highlighting LLMs’ evolving spatial capabilities and their interdisciplinary potential. | [Paper](https://arxiv.org/abs/2504.09848) | ## Top ML Papers of the Week (April 6 - April 13) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) The AI Scientist V2 The AI Scientist-v2 refines and extends its predecessor to achieve a new milestone: autonomously generating a workshop-accepted research manuscript. The system removes dependencies on human-authored code templates, incorporates agentic tree-search methods for deeper exploration, uses Vision-Language Models to refine figures, and demonstrates impressive real-world outcomes by passing the peer-review bar.
● Enhanced Autonomy – Eliminates reliance on human-crafted code templates, enabling out-of-the-box deployment across diverse ML domains.
● Agentic Tree Search – Systematically searches and refines hypotheses through a branching exploration, managed by a new experiment manager agent.
● VLM Feedback Loop – Integrates Vision-Language Models in the reviewing process to critique and improve experimental figures and paper aesthetics.
● Workshop Acceptance – Generated three fully autonomous manuscripts for an ICLR workshop; one was accepted, showcasing the feasibility of AI-driven end-to-end scientific discovery. | [Paper](https://pub.sakana.ai/ai-scientist-v2/paper/paper.pdf), [Tweet](https://x.com/SakanaAILabs/status/1909497165925536212) | | 2) Benchmarking Browsing Agents OpenAI introduces BrowseComp, a benchmark with 1,266 questions that require AI agents to locate hard-to-find, entangled information on the web. Unlike saturated benchmarks like SimpleQA, BrowseComp demands persistent and creative search across numerous websites, offering a robust testbed for real-world web-browsing agents. Key insights:
● Extremely difficult questions: Benchmarked tasks were verified to be unsolvable by humans in under 10 minutes and also by GPT-4o (with/without browsing), OpenAI o1, and earlier Deep Research models.
● Human performance is low: Only 29.2% of problems were solved by humans (even with 2-hour limits). 70.8% were abandoned.
● Model performance:
● Test-time scaling matters: Accuracy improves with more browsing attempts. With 64 parallel samples and best-of-N aggregation, Deep Research significantly boosts its performance (15–25% gain over a single attempt).
● Reasoning browsing: OpenAI o1 (no browsing but better reasoning) outperforms GPT-4.5 with browsing, showing that tool use alone isn't enough—strategic reasoning is key.
● Calibration struggles: Models with browsing access often exhibit overconfidence in incorrect answers, revealing current limits in uncertainty estimation.
● Dataset diversity: Includes a wide topical spread: TV/movies, science, art, sports, politics, geography, etc. | [Paper](https://cdn.openai.com/pdf/5e10f4ab-d6f7-442e-9508-59515c65e35d/browsecomp.pdf), [Blog](https://openai.com/index/browsecomp/), [Tweet](https://x.com/OpenAI/status/1910393421652520967) | | 3) OLMOTrace Allen Institute for AI & University of Washington present OLMOTRACE, a real-time system that traces LLM-generated text back to its verbatim sources in the original training data, even across multi-trillion-token corpora.
● What it does: For a given LM output, OLMOTRACE highlights exact matches with training data segments and lets users inspect full documents for those matches. Think "reverse-engineering" a model’s response via lexical lookup.
● How it works:
● Supported models: Works with OLMo models (e.g., OLMo-2-32B-Instruct) and their full pre/mid/post-training datasets, totaling 4.6T tokens.
● Use cases:
● Benchmarked:
● Not RAG: It retrieves after generation, without changing output, unlike retrieval-augmented generation. | [Paper](https://arxiv.org/abs/2504.07096), [Tweet](https://x.com/omarsar0/status/1910323386603262316), [Blog](https://5910970.hs-sites.com/olmotrace-points-model-output-back-to-training-data?ecid=ACsprvuggQcD4yCdO--rKTZKDvmczdSQkb96ct95zLH9eiysrXjF_WuKgsmIMaz8byfiL1H1-2A6&utm_campaign=AI2%20Newsletter&utm_medium=email&_hsenc=p2ANqtz-__MqUAVPXfHPpHpf2xC86iZG8qC3J-z5nW141VBN9gZW4j61ymW3dM7mhkiHGTWtjQt3Eao7Cqf7pB1k24CfEhYe9fmA&_hsmi=355925505) | | 4) Concise Reasoning via RL This new paper proposes a new training strategy that promotes concise and accurate reasoning in LLMs using RL. It challenges the belief that long responses improve accuracy; it offers both theoretical and empirical evidence showing that conciseness often correlates with better performance.
● Long ≠ better reasoning – The authors mathematically show that RL with PPO tends to generate unnecessarily long responses, especially when answers are wrong. Surprisingly, shorter outputs correlate more with correct answers, across reasoning and non-reasoning models.
● Two-phase RL for reasoning + conciseness – They introduce a two-phase RL strategy: (1) train on hard problems to build reasoning ability (length may increase), then (2) fine-tune on occasionally solvable tasks to enforce concise CoT without hurting accuracy. The second phase alone dramatically reduces token usage by over 50%, with no loss in accuracy.
● Works with tiny data – Their method succeeds with as few as 4–8 training examples, showing large gains in both math and STEM benchmarks (MATH, AIME24, MMLU-STEM). For instance, on MMLU-STEM, they improved accuracy by +12.5% while cutting response length by over 2×.
● Better under low sampling – Post-trained models remain robust even when the temperature is reduced to 0. At temperature=0, the fine-tuned model outperformed the baseline by 10–30%, showing enhanced deterministic performance.
● Practical implications – Besides improving model output, their method reduces latency, cost, and token usage, making LLMs more deployable. The authors also recommend setting λ < 1 during PPO to avoid instability and encourage correct response shaping. | [Paper](https://arxiv.org/abs/2504.05185), [Tweet](https://x.com/omarsar0/status/1909634850304503977) | | 5) Rethinking Reflection in Pre-Training Reflection — the ability of LLMs to identify and correct their own reasoning — has often been attributed to reinforcement learning or fine-tuning. This paper argues otherwise: reflection emerges during pre-training. The authors introduce adversarial reasoning tasks to show that self-reflection and correction capabilities steadily improve as compute increases, even in the absence of supervised post-training. Key contributions:
● Propose two kinds of reflection:
● Build six adversarial datasets (GSM8K, TriviaQA, CruxEval, BBH) to test reflection across math, coding, logic, and knowledge domains. On GSM8K-Platinum, explicit reflection rates grow from ~10% to 60% with increasing pre-training tokens.
● Demonstrate that simple triggers like “Wait,” reliably induce reflection.
● Evaluate 40 OLMo-2 and Qwen2.5 checkpoints, finding a strong correlation between pre-training compute and both accuracy and reflection rate. Why it matters:
● Reflection is a precursor to reasoning and can develop before RLHF or test-time decoding strategies.
● Implication: We can instill advanced reasoning traits with better pre-training data and scale, rather than relying entirely on post-training tricks.
● They also show a trade-off: more training compute reduces the need for expensive test-time compute like long CoT traces. | [Paper](https://arxiv.org/abs/2504.04022), [Tweet](https://x.com/ashVaswani/status/1909642828554387675) | | 6) Efficient KG Reasoning for Small LLMs LightPROF is a lightweight framework that enables small-scale language models to perform complex reasoning over knowledge graphs (KGs) using structured prompts. Key highlights:
● Retrieve-Embed-Reason pipeline – LightPROF introduces a three-stage architecture:
● Plug-and-play & parameter-efficient – LightPROF trains only the adapter and projection modules, allowing seamless integration with any open-source LLM (e.g., LLaMa2-7B, LLaMa3-8B) without expensive fine-tuning.
● Outperforms larger models – Despite using small LLMs, LightPROF beats baselines like StructGPT (ChatGPT) and ToG (LLaMa2-70B) on KGQA tasks: 83.8% (vs. 72.6%) on WebQSP and 59.3% (vs. 57.6%)on CWQ.
● Extreme efficiency – Compared to StructGPT, LightPROF reduces token input by 98% and runtime by 30%, while maintaining accuracy and stable output even in complex multi-hop questions.
● Ablation insights – Removing structural signals or training steps severely degrades performance, confirming the critical role of the Knowledge Adapter and retrieval strategy. | [Paper](https://arxiv.org/abs/2504.03137), [Tweet](https://x.com/omarsar0/status/1910319109096747191) | | 7) Compute Agent Arena Computer Agent Arena is a new open platform for benchmarking LLM and VLM-based agents on real-world computer-use tasks, like coding, editing, and web navigation, using a virtual desktop environment. Initial results show that OpenAI and Anthropic are leading with modest success rates, while the platform aims to grow through crowdsourced tasks, agent submissions, and open-sourcing of its infrastructure. [Report](https://arena.xlang.ai/blog/computer-agent-arena) | [Tweet](https://x.com/BowenWangNLP/status/1909618451259572328) | | 8) Agentic Knowledgeable Self-awareness KnowSelf is a new framework that introduces agentic knowledgeable self-awareness, enabling LLM agents to dynamically decide when to reflect or seek knowledge based on situational complexity, mimicking human cognition. Using special tokens for "fast," "slow," and "knowledgeable" thinking, KnowSelf reduces inference costs and achieves state-of-the-art performance on ALFWorld and WebShop tasks with minimal external knowledge. | [Paper](https://arxiv.org/abs/2504.03553v1) | | 9) One-Minute Video Generation with Test-Time Training One-Minute Video Generation with Test-Time Training introduces TTT layers, a novel sequence modeling component where hidden states are neural networks updated via self-supervised loss at test time. By integrating these into a pre-trained diffusion model, the authors enable single-shot generation of one-minute, multi-scene videos from storyboards, achieving 34 Elo points higher than strong baselines like Mamba 2 and DeltaNet in human evaluations | [Paper](https://test-time-training.github.io/video-dit/assets/ttt_cvpr_2025.pdf), [Tweet](https://x.com/karansdalal/status/1909312851795411093) | | 10) NoProp NoProp is a novel gradient-free learning method where each neural network layer independently learns to denoise a noisy version of the target, inspired by diffusion and flow matching. Unlike backpropagation, it avoids hierarchical representation learning and achieves competitive performance and efficiency on image classification benchmarks like MNIST and CIFAR. | [Paper](https://arxiv.org/abs/2503.24322) | ## Top ML Papers of the Week (March 31 - April 6) - 2025 | **Paper** | **Links** | | 1) PaperBench OpenAI introduces a new benchmark, PaperBench, to test whether AI agents can replicate cutting-edge machine learning research papers, from scratch. ● A rigorous replication challenge – PaperBench evaluates agents on reproducing entire ML papers from ICML 2024 (20 total, across 12 research areas). Agents must understand the paper, build the codebase from scratch, and run experiments to match results. Each paper comes with a fine-grained rubric (~8,316 tasks total) co-designed with the original authors. ● Automatic grading with LLM judges – To make evaluation scalable, the team built a rubric-based judge (o3-mini with scaffolding) that scores replications with high agreement (F1 = 0.83) against human experts. They also release JudgeEval, a benchmark for assessing judge accuracy. ● Frontier model performance is modest – Claude 3.5 Sonnet scored highest with 21.0%, followed by o1 (13.2%) and GPT-4o (4.1%). Even with longer runtimes and prompt tuning (IterativeAgent), no model surpassed a 26.0% score. By contrast, ML PhDs hit 41.4% on a 3-paper subset in 48 hours, showing humans still lead in long-horizon agentic tasks. ● CodeDev variant for lightweight evals – A simplified PaperBench Code-Dev version skips execution and just grades code structure. o1 scored 43.4% there, showing more promise when runtime issues are excluded. ● Failure modes and insights – Models often “gave up early,” lacked strategic planning, and failed to iterate. Claude did better with BasicAgent (freer form), while o1 benefited from IterativeAgent (structured prompts). This highlights how sensitive agents are to prompting and scaffolding. ● Open-source release – PaperBench (with rubrics, grading infra, and replication results) is fully open-sourced to drive further progress on long-horizon agent tasks and autonomous AI R&D. | [Paper](https://arxiv.org/abs/2504.01848), [Tweet](https://x.com/OpenAI/status/1907481490457506235), [GitHub](https://github.com/openai/preparedness) | | 2) Command A: An Enterprise-Ready LLM Cohere announced Command A, a 111B parameter open-weights LLM built for enterprise-grade RAG, agents, code, and multilingual tasks. Key contributions: ● Modular expert merging for domain mastery – Instead of monolithic post-training, Command A uses a decentralized training pipeline. Separate expert models are fine-tuned for specific domains (e.g., math, RAG, multilingual, safety, code), then merged into one model using efficient weighted parameter soup techniques. This preserves most expert performance with just ~1.8% average drop. ● Hybrid architecture for long-context efficiency – Command A interleaves sliding window and full attention layers, achieving 256k context support with drastically lower KV cache memory usage—e.g., only ~33% of LLaMA 3 70B at 128k. It scores 95.0% on RULER, outperforming most long-context peers. ● Superb agentic capabilities – Built for RAG, tool use, and ReAct-style agents, Command A beats GPT-4o and Claude 3.5 on TauBench and BFCL. Tool use is trained via a blend of human-annotated and synthetic data, then aligned with CoPG and SRPO (self-improving preference optimization). ● Best-in-class enterprise evaluations – On real-world generative tasks (e.g., chat summarization, FAQ generation) and RAG use cases (long workplace policy documents), Command A tops the leaderboard with 94.2% pass rate, 4.73 correctness, and 91% unanswerable QA accuracy. ● Multilingual excellence – Command A is trained in 23 global languages with heavy data curation and preference tuning. It scores #1 in dialect alignment (ADI2), 90.3% average LPR (language consistency), and outperforms LLaMA 3.3, GPT-4o, and DeepSeek in manual Arena-style win rates across all languages. ● Polishing for human alignment – Final alignment used a ping-pong loop of offline SRPO and online CoPG with RLHF. This yielded +17pt human win rate gains on code, +10pt on reasoning, and lifted Command A’s win rate over GPT-4o to parity (~50.4%). ● Fast, efficient, and open – Despite its power, Command A runs on just 2×A100s or H100s and generates 156 tokens/sec—faster than GPT-4o and DeepSeek. Model weights are released (CC-BY-NC) on Hugging Face. | [Paper](https://arxiv.org/abs/2504.00698), [Tweet](https://x.com/nrehiew_/status/1908181303339471020), [Models](https://huggingface.co/CohereForAI/c4ai-command-a-03-2025) | | 3) CodeScientist Researchers at AI2 release CodeScientist, a system that autonomously generates and tests scientific hypotheses via code-based experimentation. It’s among the first to produce validated discoveries with minimal human input. Key ideas: ● Code-first scientific agent – CodeScientist reviews research papers and assembles experiments using vetted Python code blocks (e.g., for analysis, simulation). It follows a five-step pipeline: Ideation → Planning → Code Execution → Reporting → Meta-Analysis. ● Validated AI discoveries – From 50 AI research papers on agents and virtual environments, CodeScientist proposed 19 findings. Of these, 6 were judged scientifically sound and novel. Examples: ● Human-guided autonomy – Full automation is possible, but brief human feedback (e.g., ranking ideas) significantly boosts output quality. Human-in-the-loop interaction improves idea selection and experiment debugging. ● Challenges remain – Despite successes, over half the generated experiments fail due to code errors, not scientific flaws. Peer review is still needed to verify results, and current systems lack deep methodological rigor. | [Paper](https://arxiv.org/abs/2503.22708), [Blog](https://allenai.org/blog/codescientist), [GitHub](https://github.com/allenai/codescientist) | | 4) Retrieval-Augmented Reasoning Model Introduces RARE, a new paradigm for training domain-specific LLMs that focuses on reasoning, not memorization. Key ideas: ● Inspired by Bloom’s Taxonomy – RARE shifts LLM training from memorizing knowledge (“Remember”) to applying and evaluating it (“Analyze”, “Create”). It separates domain knowledge (retrieved externally) from domain thinking (learned during training), enabling better performance under tight parameter budgets. ● Open-book prepared training – RARE injects retrieved knowledge into training prompts, letting models learn reasoning patterns instead of rote facts. This open-book, reasoning-first setup beats both standard SFT and RAG approaches, especially in medicine. ● Massive accuracy gains with small models – On five medical QA benchmarks, RARE-trained Llama-3.1-8B and Qwen-2.5-7B outperformed GPT-4 + RAG, with up to +20% accuracy boosts (e.g., PubMedQA: 78.63% vs. GPT-4’s 75.2%, CoVERT: 74.14% vs. GPT-4’s 65.67%). ● Training via distillation + adaptive retries – RARE distills answers (and reasoning paths) from a strong teacher (e.g., QwQ-32B), refining outputs until a correct answer is found. This creates a high-quality dataset that teaches contextualized, case-based thinking. ● New role for retrieval – Unlike standard RAG (used only at inference), RARE uses retrieval during training to shape reasoning. It models knowledge integration (p(kx, R(x))) and reasoning (p(rx, R(x), k)) as separate steps, replacing memorization with application. Overall, this work reframes LLM training for domain-specific intelligence: externalize facts, internalize reasoning. It unlocks strong performance from small models without overfitting or hallucination. | [Paper](https://arxiv.org/abs/2503.23513), [Tweet](https://x.com/omarsar0/status/1907796990966247484) | | 5) Why do LLMs Attend to First Token? This new paper explains why LLMs obsessively focus attention on the first token — a phenomenon known as an attention sink. Their theory: it’s a useful trick to prevent representational collapse in deep Transformers. ● Sinks = over-mixing shields – LLMs with long contexts and deep layers tend to over-mix information, causing similar embeddings for all tokens (i.e., rank collapse or over-squashing). Attention sinks—where many heads fixate on the ⟨bos⟩ token—act as no-ops that reduce token interaction and preserve representation diversity across layers. ● Sharp experiments on Gemma & LLaMa – Perturbation tests in Gemma 7B show ⟨bos⟩ significantly slows the spread of changes through the model. Meanwhile, in LLaMa 3.1 models, over 80% of attention heads show strong sink behavior in the 405B variant, supporting the theory that larger models need stronger sinks. ● Sinks emerge naturally – Even without special pretraining, sinks tend to form at the first position, not because of the ⟨bos⟩ token itself, but due to its location. However, if ⟨bos⟩ is fixed during training and later removed, performance collapses, showing that sink formation is data-dependent. ● Theoretical grounding – The authors connect sink emergence to Jacobian norm bounds, proving that sinks reduce sensitivity to token perturbations. Their math shows that deeper models and longer contexts require stronger sinks. ● Layerwise dynamics insight – Some attention heads use ⟨bos⟩ as a “default” target, unless a special pattern (e.g., apostrophe) triggers real computation. This supports a conditional attention mechanism—attend to ⟨bos⟩ unless needed elsewhere. | [Paper](https://arxiv.org/abs/2504.02732), [Tweet](https://x.com/omarsar0/status/1908187563422261411) | | 6) Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions Presents MedAgentSim is a fully automated, open-source hospital simulation where LLM-powered agents simulate doctor-patient interactions in dynamic diagnostic settings. Unlike previous static QA benchmarks, MedAgentSim mimics real-world clinical workflows with multi-turn dialogue, test requests, and self-improvement. More about this paper: ● Active doctor agents – MedAgentSim requires LLM doctor agents to engage in multi-turn consultations, request labs and imaging (e.g., ECG, X-ray), and iteratively refine diagnoses, making it far more realistic than pre-filled medical QA datasets. ● Self-improvement via memory + reflection – The system maintains buffers of successful and failed diagnoses. It uses retrieved past cases (via kNN), chain-of-thought reasoning, and ensembling to improve performance over time. Misdiagnoses trigger a reflection phase before inclusion in memory. ● Fully autonomous or human-in-the-loop – Users can optionally take control of the doctor or patient agents. Simulation assets are built using a 2D game engine (Phaser), and the agents can navigate, converse, and interact with virtual medical tools. ● Big performance boost across benchmarks – On NEJM, MedQA, and MIMIC-IV, MedAgentSim (with LLaMA 3.3) outperforms baseline setups by +6–37%, especially in vision-language tasks using LLaVA for interpreting medical images. ● Bias analysis & fairness focus – The team studied diagnostic accuracy under cognitive and implicit bias conditions. Models like GPT-4o and LLaMA proved more robust than Mixtral/Mistral, highlighting the importance of bias-aware evaluation. | [Paper](https://arxiv.org/abs/2503.22678), [Tweet](https://x.com/omarsar0/status/1906719555482702147), [Code](https://github.com/MAXNORM8650/MedAgentSim) | | 7) Open Deep Search Researchers from Sentient, UW, Princeton, and UC Berkeley introduce Open Deep Search (ODS), an open-source search AI framework that rivals top proprietary systems like GPT-4o Search Preview and Perplexity Sonar. Key insights: ● Two open components: search + reasoning – ODS has two modular parts: (1) Open Search Tool, which retrieves and refines high-quality web results using query rephrasing, snippet reranking, and site-specific logic; and (2) Open Reasoning Agent, a controller that orchestrates tool usage (search, calculator, etc.) to answer queries. Two variants are offered: ODS-v1 (ReAct) and ODS-v2 (CodeAct). ● SOTA open-source performance – With DeepSeek-R1 as the base LLM, ODS-v2 scores 88.3% on SimpleQA and 75.3% on FRAMES, beating GPT-4o Search Preview by +9.7% on the latter. ODS adapts the number of searches per query (avg. 3.39 on FRAMES), balancing cost and accuracy more efficiently than fixed-query baselines. ● Better than Perplexity Sonar – On both FRAMES and SimpleQA, ODS+DeepSeek-R1 outperforms Perplexity’s flagship search models, even in complex reasoning tasks involving multi-hop questions, time/date calculations, and name disambiguation. ● Code-based agents enhance reasoning – ODS-v2 builds on CodeAct, allowing it to write and run Python code to perform symbolic reasoning and tool calls. This results in sharper numerical precision and task flexibility compared to CoT-based ReAct in ODS-v1. | [Paper](https://arxiv.org/abs/2503.20201), [Tweet](https://x.com/sewoong79/status/1906595129965912341), [GitHub](https://github.com/sentient-agi/OpenDeepSearch) | | 8) Efficient Test-time Scaling with Code Z1 is a new method for making large language models more compute-efficient at test time, especially during reasoning. The core idea is to train LLMs with short and long code-based reasoning trajectories, and then dynamically adjust reasoning depth during inference. Key contributions: ● Z1-Code-Reasoning-107K dataset – They construct a 107K-sample dataset with short and long reasoning paths for simple and complex coding problems. Trajectories are distilled from QwQ-32B and paired to help the model learn when to stop thinking. ● Shifted Thinking Window – A new test-time strategy that eliminates explicit