# ML Papers of The Week [Subscribe to our newsletter](https://nlpnews.substack.com/) to get a weekly list of top ML papers in your inbox. At DAIR.AI we ❤️ reading ML papers so we've created this repo to highlight the top ML papers of every week. Here is the weekly series: Here is the weekly series: ## 2025 - [Top ML Papers of the Week (March 31 - April 6)](./#top-ml-papers-of-the-week-march-31---april-6---2025) - [Top ML Papers of the Week (March 24 - March 30)](./#top-ml-papers-of-the-week-march-24---march-30---2025) - [Top ML Papers of the Week (March 17 - March 23)](./#top-ml-papers-of-the-week-march-17---march-23---2025) - [Top ML Papers of the Week (March 10 - March 16)](./#top-ml-papers-of-the-week-march-10---march-16---2025) - [Top ML Papers of the Week (March 3 - March 9)](./#top-ml-papers-of-the-week-march-3---march-9---2025) - [Top ML Papers of the Week (February 24 - March 2)](./#top-ml-papers-of-the-week-february-24---march-2---2025) - [Top ML Papers of the Week (February 17 - February 23)](./#top-ml-papers-of-the-week-february-17---february-23---2025) - [Top ML Papers of the Week (February 10 - February 16)](./#top-ml-papers-of-the-week-february-10---february-16---2025) - [Top ML Papers of the Week (February 3 - February 9)](./#top-ml-papers-of-the-week-february-3---february-9---2025) - [Top ML Papers of the Week (January 27 - February 2)](./#top-ml-papers-of-the-week-january-27---february-2---2025) - [Top ML Papers of the Week (January 20 - January 26)](./#top-ml-papers-of-the-week-january-20---january-26---2025) - [Top ML Papers of the Week (January 13 - January 19)](./#top-ml-papers-of-the-week-january-13---january-19---2025) - [Top ML Papers of the Week (January 6 - January 12)](./#top-ml-papers-of-the-week-january-6---january-12---2025) ## 2024 - [Top ML Papers of the Week (December 30 - January 5)](./#top-ml-papers-of-the-week-december-30---january-5---2025) - [Top ML Papers of the Week (December 23 - December 29)](./#top-ml-papers-of-the-week-december-23---december-29---2024) - [Top ML Papers of the Week (December 16 - December 22)](./#top-ml-papers-of-the-week-december-16---december-22---2024) - [Top ML Papers of the Week (December 9 - December 15)](./#top-ml-papers-of-the-week-december-9---december-15---2024) - [Top ML Papers of the Week (December 2 - December 8)](./#top-ml-papers-of-the-week-december-2---december-8---2024) - [Top ML Papers of the Week (November 25 - December 1)](./#top-ml-papers-of-the-week-november-25---december-1---2024) - [Top ML Papers of the Week (November 18 - November 24)](./#top-ml-papers-of-the-week-november-18---november-24---2024) - [Top ML Papers of the Week (November 11 - November 17)](./#top-ml-papers-of-the-week-november-11---november-17---2024) - [Top ML Papers of the Week (November 4 - November 10)](./#top-ml-papers-of-the-week-november-4---november-10---2024) - [Top ML Papers of the Week (October 28 - November 3)](./#top-ml-papers-of-the-week-october-28---november-3---2024) - [Top ML Papers of the Week (October 21 - October 27)](./#top-ml-papers-of-the-week-october-14---october-20---2024) - [Top ML Papers of the Week (October 14 - October 20)](./#top-ml-papers-of-the-week-october-14---october-20---2024) - [Top ML Papers of the Week (October 7 - October 13)](./#top-ml-papers-of-the-week-october-7---october-13---2024) - [Top ML Papers of the Week (September 30 - October 6)](./#top-ml-papers-of-the-week-september-30---october-6---2024) - [Top ML Papers of the Week (September 23 - September 29)](./#top-ml-papers-of-the-week-september-23---september-29---2024) - [Top ML Papers of the Week (September 16 - September 22)](./#top-ml-papers-of-the-week-september-16---september-22---2024) - [Top ML Papers of the Week (September 9 - September 15)](./#top-ml-papers-of-the-week-september-9---september-15---2024) - [Top ML Papers of the Week (September 2 - September 8)](./#top-ml-papers-of-the-week-september-2---september-8---2024) - [Top ML Papers of the Week (August 26 - September 1)](./#top-ml-papers-of-the-week-august-26---september-1---2024) - [Top ML Papers of the Week (August 19 - August 25)](./#top-ml-papers-of-the-week-august-19---august-25---2024) - [Top ML Papers of the Week (August 12 - August 18)](./#top-ml-papers-of-the-week-august-12---august-18---2024) - [Top ML Papers of the Week (August 5 - August 11)](./#top-ml-papers-of-the-week-august-5---august-11---2024) - [Top ML Papers of the Week (July 29 - August 4)](./#top-ml-papers-of-the-week-july-29---august-4---2024) - [Top ML Papers of the Week (July 22 - July 28)](./#top-ml-papers-of-the-week-july-15---july-21---2024) - [Top ML Papers of the Week (July 15 - July 21)](./#top-ml-papers-of-the-week-july-15---july-21---2024) - [Top ML Papers of the Week (July 8 - July 14)](./#top-ml-papers-of-the-week-july-8---july-14---2024) - [Top ML Papers of the Week (July 1 - July 7)](./#top-ml-papers-of-the-week-july-1---july-7---2024) - [Top ML Papers of the Week (June 24 - June 30)](./#top-ml-papers-of-the-week-june-24---june-30---2024) - [Top ML Papers of the Week (June 17 - June 23)](./#top-ml-papers-of-the-week-june-17---june-23---2024) - [Top ML Papers of the Week (June 10 - June 16)](./#top-ml-papers-of-the-week-june-10---june-16---2024) - [Top ML Papers of the Week (June 3 - June 9)](./#top-ml-papers-of-the-week-june-3---june-9---2024) - [Top ML Papers of the Week (May 27 - June 2)](./#top-ml-papers-of-the-week-may-27---june-2---2024) - [Top ML Papers of the Week (May 20 - May 26)](./#top-ml-papers-of-the-week-may-20---may-26---2024) - [Top ML Papers of the Week (May 13 - May 19)](./#top-ml-papers-of-the-week-may-13---may-19---2024) - [Top ML Papers of the Week (May 6 - May 12)](./#top-ml-papers-of-the-week-may-6---may-12---2024) - [Top ML Papers of the Week (April 29 - May 5)](./#top-ml-papers-of-the-week-april-29---may-5---2024) - [Top ML Papers of the Week (April 22 - April 28)](./#top-ml-papers-of-the-week-april-22---april-28---2024) - [Top ML Papers of the Week (April 15 - April 21)](./#top-ml-papers-of-the-week-april-15---april-21---2024) - [Top ML Papers of the Week (April 8 - April 14)](./#top-ml-papers-of-the-week-april-8---april-14---2024) - [Top ML Papers of the Week (April 1 - April 7)](./#top-ml-papers-of-the-week-april-1---april-7---2024) - [Top ML Papers of the Week (March 26 - March 31)](./#top-ml-papers-of-the-week-march-26---march-31---2024) - [Top ML Papers of the Week (March 18 - March 25)](./#top-ml-papers-of-the-week-march-18---march-25---2024) - [Top ML Papers of the Week (March 11 - March 17)](./#top-ml-papers-of-the-week-march-11---march-17---2024) - [Top ML Papers of the Week (March 4 - March 10)](./#top-ml-papers-of-the-week-march-4---march-10---2024) - [Top ML Papers of the Week (February 26 - March 3)](./#top-ml-papers-of-the-week-february-26---march-3---2024) - [Top ML Papers of the Week (February 19 - February 25)](./#top-ml-papers-of-the-week-february-19---february-25---2024) - [Top ML Papers of the Week (February 12 - February 18)](./#top-ml-papers-of-the-week-february-12---february-18---2024) - [Top ML Papers of the Week (February 5 - February 11)](./#top-ml-papers-of-the-week-february-5---february-11---2024) - [Top ML Papers of the Week (January 29 - February 4)](./#top-ml-papers-of-the-week-january-29---february-4---2024) - [Top ML Papers of the Week (January 22 - January 28)](./#top-ml-papers-of-the-week-january-22---january-28---2024) - [Top ML Papers of the Week (January 15 - January 21)](./#top-ml-papers-of-the-week-january-15---january-21---2024) - [Top ML Papers of the Week (January 8 - January 14)](./#top-ml-papers-of-the-week-january-8---january-14---2024) - [Top ML Papers of the Week (January 1 - January 7)](./#top-ml-papers-of-the-week-january-1---january-7---2024) ## 2023 - [Top ML Papers of the Week (December 24 - December 31)](./#top-ml-papers-of-the-week-december-25---december-31) - [Top ML Papers of the Week (December 18 - December 24)](./#top-ml-papers-of-the-week-december-18---december-24) - [Top ML Papers of the Week (December 11 - December 17)](./#top-ml-papers-of-the-week-december-11---december-17) - [Top ML Papers of the Week (December 4 - December 10)](./#top-ml-papers-of-the-week-december-4---december-10) - [Top ML Papers of the Week (November 27 - December 3)](./#top-ml-papers-of-the-week-november-27---december-3) - [Top ML Papers of the Week (November 20 - November 26)](./#top-ml-papers-of-the-week-november-20---november-26) - [Top ML Papers of the Week (November 13 - November 19)](./#top-ml-papers-of-the-week-november-13---november-19) - [Top ML Papers of the Week (November 6 - November 12)](./#top-ml-papers-of-the-week-november-6---november-12) - [Top ML Papers of the Week (October 30 - November 5)](./#top-ml-papers-of-the-week-october-30---november-5) - [Top ML Papers of the Week (October 23 - October 29)](./#top-ml-papers-of-the-week-october-23---october-29) - [Top ML Papers of the Week (October 16 - October 22)](./#top-ml-papers-of-the-week-october-16---october-22) - [Top ML Papers of the Week (October 9 - October 15)](./#top-ml-papers-of-the-week-october-9---october-15) - [Top ML Papers of the Week (October 2 - October 8)](./#top-ml-papers-of-the-week-october-2---october-8) - [Top ML Papers of the Week (September 25 - October 1)](./#top-ml-papers-of-the-week-september-25---october-1) - [Top ML Papers of the Week (September 18 - September 24)](./#top-ml-papers-of-the-week-september-18---september-24) - [Top ML Papers of the Week (September 11 - September 17)](./#top-ml-papers-of-the-week-september-11---september-17) - [Top ML Papers of the Week (September 4 - September 10)](./#top-ml-papers-of-the-week-september-4---september-10) - [Top ML Papers of the Week (August 28 - September 3)](./#top-ml-papers-of-the-week-august-28---september-3) - [Top ML Papers of the Week (August 21 - August 27)](./#top-ml-papers-of-the-week-august-21---august-27) - [Top ML Papers of the Week (August 14 - August 20)](./#top-ml-papers-of-the-week-august-14---august-20) - [Top ML Papers of the Week (August 7 - August 13)](./#top-ml-papers-of-the-week-august-7---august-13) - [Top ML Papers of the Week (July 31 - August 6)](./#top-ml-papers-of-the-week-july-31---august-6) - [Top ML Papers of the Week (July 24 - July 30)](./#top-ml-papers-of-the-week-july-24---july-30) - [Top ML Papers of the Week (July 17 - July 23)](./#top-ml-papers-of-the-week-july-17---july-23) - [Top ML Papers of the Week (July 10 - July 16)](./#top-ml-papers-of-the-week-july-10---july-16) - [Top ML Papers of the Week (July 3 - July 9)](./#top-ml-papers-of-the-week-july-3---july-9) - [Top ML Papers of the Week (June 26 - July 2)](./#top-ml-papers-of-the-week-june-26---july-2) - [Top ML Papers of the Week (June 19 - June 25)](./#top-ml-papers-of-the-week-june-19---june-25) - [Top ML Papers of the Week (June 12 - June 18)](./#top-ml-papers-of-the-week-june-12---june-18) - [Top ML Papers of the Week (June 5 - June 11)](./#top-ml-papers-of-the-week-june-5---june-11) - [Top ML Papers of the Week (May 29 - June 4)](./#top-ml-papers-of-the-week-may-29-june-4) - [Top ML Papers of the Week (May 22 - 28)](./#top-ml-papers-of-the-week-may-22-28) - [Top ML Papers of the Week (May 15 - 21)](./#top-ml-papers-of-the-week-may-15-21) - [Top ML Papers of the Week (May 8 - 14)](./#top-ml-papers-of-the-week-may-8-14) - [Top ML Papers of the Week (May 1-7)](./#top-ml-papers-of-the-week-may-1-7) - [Top ML Papers of the Week (April 24 - April 30)](./#top-ml-papers-of-the-week-april-24---april-30) - [Top ML Papers of the Week (April 17 - April 23)](./#top-ml-papers-of-the-week-april-17---april-23) - [Top ML Papers of the Week (April 10 - April 16)](./#top-ml-papers-of-the-week-april-10---april-16) - [Top ML Papers of the Week (April 3 - April 9)](./#top-ml-papers-of-the-week-april-3---april-9) - [Top ML Papers of the Week (Mar 27 - April 2)](./#top-ml-papers-of-the-week-mar-27---april-2) - [Top ML Papers of the Week (Mar 20-Mar 26)](./#top-ml-papers-of-the-week-mar-20-mar-26) - [Top ML Papers of the Week (Mar 13-Mar 19)](./#top-ml-papers-of-the-week-mar-13-mar-19) - [Top ML Papers of the Week (Mar 6-Mar 12)](./#top-ml-papers-of-the-week-mar-6-mar-12) - [Top ML Papers of the Week (Feb 27-Mar 5)](./#top-ml-papers-of-the-week-feb-27-mar-5) - [Top ML Papers of the Week (Feb 20-26)](./#top-ml-papers-of-the-week-feb-20-26) - [Top ML Papers of the Week (Feb 13 - 19)](./#top-ml-papers-of-the-week-feb-13---19) - [Top ML Papers of the Week (Feb 6 - 12)](./#top-ml-papers-of-the-week-feb-6---12) - [Top ML Papers of the Week (Jan 30-Feb 5)](./#top-ml-papers-of-the-week-jan-30-feb-5) - [Top ML Papers of the Week (Jan 23-29)](./#top-ml-papers-of-the-week-jan-23-29) - [Top ML Papers of the Week (Jan 16-22)](./#top-ml-papers-of-the-week-jan-16-22) - [Top ML Papers of the Week (Jan 9-15)](./#top-ml-papers-of-the-week-jan-9-15) - [Top ML Papers of the Week (Jan 1-8)](./#top-ml-papers-of-the-week-jan-1-8) [Follow us on Twitter](https://twitter.com/dair_ai) [Join our Discord](https://discord.gg/SKgkVT8BGJ) ## Top ML Papers of the Week (March 31 - April 6) - 2025 | **Paper** | **Links** | | ------------- | ------------- | | 1) PaperBench OpenAI introduces a new benchmark, PaperBench, to test whether AI agents can replicate cutting-edge machine learning research papers, from scratch. ● A rigorous replication challenge – PaperBench evaluates agents on reproducing entire ML papers from ICML 2024 (20 total, across 12 research areas). Agents must understand the paper, build the codebase from scratch, and run experiments to match results. Each paper comes with a fine-grained rubric (~8,316 tasks total) co-designed with the original authors. ● Automatic grading with LLM judges – To make evaluation scalable, the team built a rubric-based judge (o3-mini with scaffolding) that scores replications with high agreement (F1 = 0.83) against human experts. They also release JudgeEval, a benchmark for assessing judge accuracy. ● Frontier model performance is modest – Claude 3.5 Sonnet scored highest with 21.0%, followed by o1 (13.2%) and GPT-4o (4.1%). Even with longer runtimes and prompt tuning (IterativeAgent), no model surpassed a 26.0% score. By contrast, ML PhDs hit 41.4% on a 3-paper subset in 48 hours, showing humans still lead in long-horizon agentic tasks. ● CodeDev variant for lightweight evals – A simplified PaperBench Code-Dev version skips execution and just grades code structure. o1 scored 43.4% there, showing more promise when runtime issues are excluded. ● Failure modes and insights – Models often “gave up early,” lacked strategic planning, and failed to iterate. Claude did better with BasicAgent (freer form), while o1 benefited from IterativeAgent (structured prompts). This highlights how sensitive agents are to prompting and scaffolding. ● Open-source release – PaperBench (with rubrics, grading infra, and replication results) is fully open-sourced to drive further progress on long-horizon agent tasks and autonomous AI R&D. | [Paper](https://arxiv.org/abs/2504.01848), [Tweet](https://x.com/OpenAI/status/1907481490457506235), [GitHub](https://github.com/openai/preparedness) | | 2) Command A: An Enterprise-Ready LLM Cohere announced Command A, a 111B parameter open-weights LLM built for enterprise-grade RAG, agents, code, and multilingual tasks. Key contributions: ● Modular expert merging for domain mastery – Instead of monolithic post-training, Command A uses a decentralized training pipeline. Separate expert models are fine-tuned for specific domains (e.g., math, RAG, multilingual, safety, code), then merged into one model using efficient weighted parameter soup techniques. This preserves most expert performance with just ~1.8% average drop. ● Hybrid architecture for long-context efficiency – Command A interleaves sliding window and full attention layers, achieving 256k context support with drastically lower KV cache memory usage—e.g., only ~33% of LLaMA 3 70B at 128k. It scores 95.0% on RULER, outperforming most long-context peers. ● Superb agentic capabilities – Built for RAG, tool use, and ReAct-style agents, Command A beats GPT-4o and Claude 3.5 on TauBench and BFCL. Tool use is trained via a blend of human-annotated and synthetic data, then aligned with CoPG and SRPO (self-improving preference optimization). ● Best-in-class enterprise evaluations – On real-world generative tasks (e.g., chat summarization, FAQ generation) and RAG use cases (long workplace policy documents), Command A tops the leaderboard with 94.2% pass rate, 4.73 correctness, and 91% unanswerable QA accuracy. ● Multilingual excellence – Command A is trained in 23 global languages with heavy data curation and preference tuning. It scores #1 in dialect alignment (ADI2), 90.3% average LPR (language consistency), and outperforms LLaMA 3.3, GPT-4o, and DeepSeek in manual Arena-style win rates across all languages. ● Polishing for human alignment – Final alignment used a ping-pong loop of offline SRPO and online CoPG with RLHF. This yielded +17pt human win rate gains on code, +10pt on reasoning, and lifted Command A’s win rate over GPT-4o to parity (~50.4%). ● Fast, efficient, and open – Despite its power, Command A runs on just 2×A100s or H100s and generates 156 tokens/sec—faster than GPT-4o and DeepSeek. Model weights are released (CC-BY-NC) on Hugging Face. | [Paper](https://arxiv.org/abs/2504.00698), [Tweet](https://x.com/nrehiew_/status/1908181303339471020), [Models](https://huggingface.co/CohereForAI/c4ai-command-a-03-2025) | | 3) CodeScientist Researchers at AI2 release CodeScientist, a system that autonomously generates and tests scientific hypotheses via code-based experimentation. It’s among the first to produce validated discoveries with minimal human input. Key ideas: ● Code-first scientific agent – CodeScientist reviews research papers and assembles experiments using vetted Python code blocks (e.g., for analysis, simulation). It follows a five-step pipeline: Ideation → Planning → Code Execution → Reporting → Meta-Analysis. ● Validated AI discoveries – From 50 AI research papers on agents and virtual environments, CodeScientist proposed 19 findings. Of these, 6 were judged scientifically sound and novel. Examples: ● Human-guided autonomy – Full automation is possible, but brief human feedback (e.g., ranking ideas) significantly boosts output quality. Human-in-the-loop interaction improves idea selection and experiment debugging. ● Challenges remain – Despite successes, over half the generated experiments fail due to code errors, not scientific flaws. Peer review is still needed to verify results, and current systems lack deep methodological rigor. | [Paper](https://arxiv.org/abs/2503.22708), [Blog](https://allenai.org/blog/codescientist), [GitHub](https://github.com/allenai/codescientist) | | 4) Retrieval-Augmented Reasoning Model Introduces RARE, a new paradigm for training domain-specific LLMs that focuses on reasoning, not memorization. Key ideas: ● Inspired by Bloom’s Taxonomy – RARE shifts LLM training from memorizing knowledge (“Remember”) to applying and evaluating it (“Analyze”, “Create”). It separates domain knowledge (retrieved externally) from domain thinking (learned during training), enabling better performance under tight parameter budgets. ● Open-book prepared training – RARE injects retrieved knowledge into training prompts, letting models learn reasoning patterns instead of rote facts. This open-book, reasoning-first setup beats both standard SFT and RAG approaches, especially in medicine. ● Massive accuracy gains with small models – On five medical QA benchmarks, RARE-trained Llama-3.1-8B and Qwen-2.5-7B outperformed GPT-4 + RAG, with up to +20% accuracy boosts (e.g., PubMedQA: 78.63% vs. GPT-4’s 75.2%, CoVERT: 74.14% vs. GPT-4’s 65.67%). ● Training via distillation + adaptive retries – RARE distills answers (and reasoning paths) from a strong teacher (e.g., QwQ-32B), refining outputs until a correct answer is found. This creates a high-quality dataset that teaches contextualized, case-based thinking. ● New role for retrieval – Unlike standard RAG (used only at inference), RARE uses retrieval during training to shape reasoning. It models knowledge integration (p(kx, R(x))) and reasoning (p(rx, R(x), k)) as separate steps, replacing memorization with application. Overall, this work reframes LLM training for domain-specific intelligence: externalize facts, internalize reasoning. It unlocks strong performance from small models without overfitting or hallucination. | [Paper](https://arxiv.org/abs/2503.23513), [Tweet](https://x.com/omarsar0/status/1907796990966247484) | | 5) Why do LLMs Attend to First Token? This new paper explains why LLMs obsessively focus attention on the first token — a phenomenon known as an attention sink. Their theory: it’s a useful trick to prevent representational collapse in deep Transformers. ● Sinks = over-mixing shields – LLMs with long contexts and deep layers tend to over-mix information, causing similar embeddings for all tokens (i.e., rank collapse or over-squashing). Attention sinks—where many heads fixate on the ⟨bos⟩ token—act as no-ops that reduce token interaction and preserve representation diversity across layers. ● Sharp experiments on Gemma & LLaMa – Perturbation tests in Gemma 7B show ⟨bos⟩ significantly slows the spread of changes through the model. Meanwhile, in LLaMa 3.1 models, over 80% of attention heads show strong sink behavior in the 405B variant, supporting the theory that larger models need stronger sinks. ● Sinks emerge naturally – Even without special pretraining, sinks tend to form at the first position, not because of the ⟨bos⟩ token itself, but due to its location. However, if ⟨bos⟩ is fixed during training and later removed, performance collapses, showing that sink formation is data-dependent. ● Theoretical grounding – The authors connect sink emergence to Jacobian norm bounds, proving that sinks reduce sensitivity to token perturbations. Their math shows that deeper models and longer contexts require stronger sinks. ● Layerwise dynamics insight – Some attention heads use ⟨bos⟩ as a “default” target, unless a special pattern (e.g., apostrophe) triggers real computation. This supports a conditional attention mechanism—attend to ⟨bos⟩ unless needed elsewhere. | [Paper](https://arxiv.org/abs/2504.02732), [Tweet](https://x.com/omarsar0/status/1908187563422261411) | | 6) Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions Presents MedAgentSim is a fully automated, open-source hospital simulation where LLM-powered agents simulate doctor-patient interactions in dynamic diagnostic settings. Unlike previous static QA benchmarks, MedAgentSim mimics real-world clinical workflows with multi-turn dialogue, test requests, and self-improvement. More about this paper: ● Active doctor agents – MedAgentSim requires LLM doctor agents to engage in multi-turn consultations, request labs and imaging (e.g., ECG, X-ray), and iteratively refine diagnoses, making it far more realistic than pre-filled medical QA datasets. ● Self-improvement via memory + reflection – The system maintains buffers of successful and failed diagnoses. It uses retrieved past cases (via kNN), chain-of-thought reasoning, and ensembling to improve performance over time. Misdiagnoses trigger a reflection phase before inclusion in memory. ● Fully autonomous or human-in-the-loop – Users can optionally take control of the doctor or patient agents. Simulation assets are built using a 2D game engine (Phaser), and the agents can navigate, converse, and interact with virtual medical tools. ● Big performance boost across benchmarks – On NEJM, MedQA, and MIMIC-IV, MedAgentSim (with LLaMA 3.3) outperforms baseline setups by +6–37%, especially in vision-language tasks using LLaVA for interpreting medical images. ● Bias analysis & fairness focus – The team studied diagnostic accuracy under cognitive and implicit bias conditions. Models like GPT-4o and LLaMA proved more robust than Mixtral/Mistral, highlighting the importance of bias-aware evaluation. | [Paper](https://arxiv.org/abs/2503.22678), [Tweet](https://x.com/omarsar0/status/1906719555482702147), [Code](https://github.com/MAXNORM8650/MedAgentSim) | | 7) Open Deep Search Researchers from Sentient, UW, Princeton, and UC Berkeley introduce Open Deep Search (ODS), an open-source search AI framework that rivals top proprietary systems like GPT-4o Search Preview and Perplexity Sonar. Key insights: ● Two open components: search + reasoning – ODS has two modular parts: (1) Open Search Tool, which retrieves and refines high-quality web results using query rephrasing, snippet reranking, and site-specific logic; and (2) Open Reasoning Agent, a controller that orchestrates tool usage (search, calculator, etc.) to answer queries. Two variants are offered: ODS-v1 (ReAct) and ODS-v2 (CodeAct). ● SOTA open-source performance – With DeepSeek-R1 as the base LLM, ODS-v2 scores 88.3% on SimpleQA and 75.3% on FRAMES, beating GPT-4o Search Preview by +9.7% on the latter. ODS adapts the number of searches per query (avg. 3.39 on FRAMES), balancing cost and accuracy more efficiently than fixed-query baselines. ● Better than Perplexity Sonar – On both FRAMES and SimpleQA, ODS+DeepSeek-R1 outperforms Perplexity’s flagship search models, even in complex reasoning tasks involving multi-hop questions, time/date calculations, and name disambiguation. ● Code-based agents enhance reasoning – ODS-v2 builds on CodeAct, allowing it to write and run Python code to perform symbolic reasoning and tool calls. This results in sharper numerical precision and task flexibility compared to CoT-based ReAct in ODS-v1. | [Paper](https://arxiv.org/abs/2503.20201), [Tweet](https://x.com/sewoong79/status/1906595129965912341), [GitHub](https://github.com/sentient-agi/OpenDeepSearch) | | 8) Efficient Test-time Scaling with Code Z1 is a new method for making large language models more compute-efficient at test time, especially during reasoning. The core idea is to train LLMs with short and long code-based reasoning trajectories, and then dynamically adjust reasoning depth during inference. Key contributions: ● Z1-Code-Reasoning-107K dataset – They construct a 107K-sample dataset with short and long reasoning paths for simple and complex coding problems. Trajectories are distilled from QwQ-32B and paired to help the model learn when to stop thinking. ● Shifted Thinking Window – A new test-time strategy that eliminates explicit