2026.md 203 KB

AI Papers of the Week — 2026

← Back to main index

This page collects every weekly issue of AI Papers of the Week from 2026. For other years, see the main index.


Top AI Papers of the Week (April 6 - April 12) - 2026

Paper Links
1) Neural Computers - Researchers from Meta AI and KAUST propose Neural Computers (NCs), an emerging machine form that unifies computation, memory, and I/O in a single learned runtime state. Unlike conventional computers that execute explicit programs, agents that act over external environments, or world models that learn dynamics, NCs aim to make the model itself the running computer, establishing a new computing paradigm.
● From hardware stack to neural latent stack: Classical computers separate compute, memory, and I/O into modular hardware layers. Neural Computers collapse all three into a single latent runtime state carried by a neural network. The model’s hidden state serves simultaneously as working memory, computational substrate, and interface layer, removing the boundary between program and execution environment.
● Video models as prototype substrate: The team instantiates NCs as video models that generate screen frames from instructions, pixel inputs, and user actions. Two prototypes cover command-line interfaces (NCCLIGen, which renders and executes terminal workflows) and graphical desktops (NCGUIWorld, which learns pointer dynamics and menu interactions), both trained without access to internal program state.
● Early runtime primitives emerge: The prototypes demonstrate that learned runtimes can acquire I/O alignment and short-horizon control directly from raw interface traces. CLI models execute short command chains with structurally accurate output rendering, while GUI models learn coherent click feedback and window transitions in controlled settings.
● Roadmap toward Completely Neural Computers: The long-term target is the CNC: a system that is Turing complete, universally programmable, and behavior-consistent unless explicitly reprogrammed. Key open challenges include routine reuse across sessions, controlled capability updates without catastrophic forgetting, and stable symbolic processing for long-horizon reasoning.
Paper, Tweet
2) Memento: Teaching LLMs to Manage Their Own Context - New research from Microsoft teaches reasoning models to compress their own chain-of-thought mid-generation. Memento trains models to segment reasoning into blocks, summarize each block into a compact “memento,” and then evict the original block from the KV cache. The model continues reasoning from mementos alone, cutting peak memory by 2-3x while nearly doubling throughput.
● Block-and-compress architecture: The model learns to mark reasoning boundaries using special tokens, produce a terse summary capturing key conclusions and intermediate values, and then drop the full block from context. From that point forward, the model sees only past mementos plus the current active block, keeping context compact without losing critical information.
● KV cache reduction with minimal accuracy loss: Applied to five models including Qwen2.5-7B, Qwen3 8B/32B, Phi-4 Reasoning 14B, and OLMo3-7B-Think, Memento achieves 2-3x peak KV cache reduction with small accuracy gaps that shrink at larger scales. The erased blocks still leave useful traces in the KV cache that the model leverages.
● Practical throughput gains: Beyond memory savings, the reduced context length directly translates to faster inference. The approach nearly doubles serving throughput, making it immediately useful for production deployments where both latency and memory are constraints.
● Open resources: Microsoft released the full codebase under MIT license, the OpenMementos dataset containing 228K reasoning traces with block segmentation and compressed summaries, and a custom vLLM fork for KV cache block masking. Standard supervised fine-tuning on approximately 30K examples is sufficient to teach this capability.
Paper, Tweet
3) Memory Intelligence Agent (MIA) - Most memory-augmented research agents treat memory as a static retrieval store, leading to inefficient evolution and rising storage costs. MIA introduces a Manager-Planner-Executor architecture where a Memory Manager maintains compressed search trajectories, a Planner generates strategies, and an Executor searches and analyzes information. The framework boosts GPT-5.4 by up to 9% on LiveVQA through bidirectional memory conversion.
● Bidirectional memory conversion: MIA enables transformation between parametric memory (model weights) and non-parametric memory (retrieved context) in both directions. This allows the system to internalize frequently accessed knowledge while keeping rare or volatile information in retrievable form, optimizing both storage efficiency and access speed.
● Alternating reinforcement learning: The three agents are trained through alternating RL, where each agent’s policy improves in response to the others’ behavior. This co-evolutionary training ensures the agents develop complementary strategies rather than competing for the same signal.
● Test-time parametric updates: Unlike standard retrieval-augmented systems, MIA can update its parametric memory on-the-fly during inference. This test-time learning allows the agent to adapt to new domains and evolving information without retraining, maintaining relevance as the information landscape changes.
● Broad benchmark coverage: The framework demonstrates improvements across 11 benchmarks spanning question answering, knowledge-intensive tasks, and long-form research synthesis. The up to 9% improvement on LiveVQA is particularly notable given that video question answering demands effective memory management across temporal sequences.
Paper, Tweet
4) Single-Agent LLMs vs. Multi-Agent Systems - More agents, better results, right? Not so fast. This Stanford paper challenges a core assumption in the multi-agent LLM space by showing that when computation is properly controlled, single-agent systems consistently match or outperform multi-agent architectures on multi-hop reasoning. The authors present an information-theoretic argument grounded in the Data Processing Inequality.
● Computation as the hidden confounder: Most reported multi-agent gains are confounded by increased test-time computation rather than architectural advantages. When reasoning token budgets are held constant, the performance gap disappears or reverses, suggesting that prior comparisons were inadvertently measuring compute scaling rather than coordination benefits.
● Information-theoretic foundation: The authors ground their analysis in the Data Processing Inequality, arguing that under a fixed reasoning-token budget with perfect context utilization, single-agent systems are inherently more information-efficient. Distributing reasoning across agents introduces information loss at each handoff.
● Benchmark artifacts inflate MAS gains: Testing across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5, the study identifies significant evaluation artifacts, particularly in API-based budget control for Gemini 2.5, that inflate apparent multi-agent advantages. Standard benchmarks also contain structural biases favoring multi-agent decomposition.
● Practical implications for system design: The findings suggest that teams should explicitly control for compute, context, and coordination trade-offs before committing to multi-agent architectures. In many cases, allocating the same token budget to a single agent with richer context yields stronger results at lower system complexity.
Paper, Tweet
5) The Universal Verifier for Agent Benchmarks - Every agent benchmark has the same hidden problem: how do you know the agent actually succeeded? Microsoft researchers introduce the Universal Verifier, built on four design principles for reliable evaluation of computer-use agent trajectories. The verifier reduces false positive rates to near zero, down from 45%+ with WebVoyager and 22%+ with WebJudge.
● Four design principles: The verifier is built on non-overlapping rubric criteria to reduce noise, separate process and outcome rewards for complementary signals, cascading error-free assessment that distinguishes controllable from uncontrollable failures, and divide-and-conquer context management that attends to all screenshots in a trajectory.
● Near-zero false positives: Current verifiers suffer from alarmingly high false positive rates that corrupt both benchmark scores and training data. The Universal Verifier achieves agreement with human judges that matches inter-human agreement rates, making it reliable enough for both evaluation and RL reward signal generation.
● Cumulative design gains: No single design choice dominates the performance improvement. The authors demonstrate that gains result from the cumulative effect of all four principles working together, with each contributing meaningful improvements that compound rather than any one serving as a silver bullet.
● Limits of automated research: An interesting meta-finding: the team used an auto-research agent to replicate the verifier design process. The agent reached 70% of expert verifier quality in 5% of the time but could not discover the structural design decisions that drove the biggest gains, suggesting human insight remains essential for system-level design.
Paper, Tweet
6) Scaling Coding Agents via Atomic Skills - Most coding agents train end-to-end on full tasks like resolving GitHub issues, leading to task-specific overfitting that limits generalization. This paper proposes a different approach: identifying five atomic coding skills (code localization, code editing, unit-test generation, issue reproduction, and code review) and training agents through joint reinforcement learning over these foundational competencies.
● Atomic skill decomposition: Instead of treating software engineering as monolithic composite tasks, the framework formalizes five fundamental operations that compose into higher-level capabilities. Think of it as teaching an agent the alphabet of coding rather than memorizing specific sentences, enabling flexible recombination across novel task types.
● Joint RL across skills: The agents are trained through joint reinforcement learning that optimizes performance across all five atomic skills simultaneously. This joint training produces representations that capture the underlying structure shared across coding operations rather than surface-level patterns tied to specific benchmarks.
● Strong generalization to unseen tasks: Joint RL improves average performance by 18.7% across both the five atomic skills and five composite tasks. The improvements transfer to unseen composite tasks including bug-fixing, code refactoring, ML engineering, and code security, none of which were directly optimized during training.
● A new scaling paradigm: The work establishes that scaling coding agents through foundational skill mastery is more sample-efficient and transferable than task-level optimization. As the number and complexity of software engineering tasks grow, this compositional approach offers a more sustainable path than continuously expanding task-specific training sets.
Paper, Tweet
7) Agent Skills in the Wild - Agent skills look great in demos. Hand them a curated toolbox, and they shine. But what happens when the agent has to find the right skill from a library of 34,000? This paper from UC Santa Barbara and MIT presents the first comprehensive study of skill utility under progressively realistic settings, revealing that the benefits of skills are far more fragile than current evaluations suggest.
● Progressive difficulty framework: The study moves from idealized conditions with hand-crafted, task-specific skills to realistic scenarios requiring retrieval from 34K real-world skills. Performance gains degrade consistently at each step, with pass rates approaching no-skill baselines in the most challenging scenarios.
● Retrieval as the bottleneck: The core failure mode is not skill execution but skill selection. When agents must identify the right skill from a massive library, the retrieval step introduces errors that cascade through execution, highlighting a fundamental gap between demo-ready and production-ready skill systems.
● Refinement strategies help but do not solve: Query-specific and query-agnostic refinement approaches show improvement, with Claude Opus 4.6 going from 57.7% to 65.5% on Terminal-Bench 2.0. However, even with refinement, performance under realistic retrieval conditions remains well below idealized baselines.
● Implications for skill ecosystems: As the ecosystem of agent skills grows through frameworks like MCP, the findings suggest that simply expanding the skill library creates diminishing returns without corresponding advances in skill discovery. Quality of skill retrieval may matter more than quantity of available skills.
Paper, Tweet
8) MedGemma 1.5 - Google releases the MedGemma 1.5 technical report, introducing a 4B-parameter medical AI model that expands capabilities to 3D medical imaging (CT/MRI volumes), whole slide pathology, multi-timepoint chest X-ray analysis, and improved medical document understanding. The model achieves notable gains including a +47% macro F1 improvement on whole slide pathology and +22% on EHR question answering, positioning itself as an open foundation for next-generation medical AI systems. Paper, Tweet
9) LightThinker++: From Reasoning Compression to Memory Management - While LLMs excel at complex reasoning, long thought traces create surging cognitive overhead. LightThinker++ moves beyond static compression by introducing three explicit memory primitives: Commit (archive a step as a compact summary), Expand (retrieve past steps for verification), and Fold (collapse context to maintain a clean signal). The framework reduces peak token usage by 70% while gaining +2.42% accuracy on standard reasoning tasks, and maintains stability beyond 80 rounds on long-horizon agentic tasks with a 14.8% average performance improvement. Paper, Tweet
10) Thinking Mid-training: RL of Interleaved Reasoning - Meta FAIR addresses the gap between pretraining (no explicit reasoning) and post-training (reasoning-heavy) with an intermediate SFT+RL mid-training phase. The approach annotates pretraining data with interleaved reasoning traces, then uses supervised fine-tuning followed by RL to teach models when and how to think during continued pretraining. Applied to Llama-3-8B, the full pipeline achieves a 3.2x improvement on reasoning benchmarks compared to direct RL post-training, demonstrating that reasoning benefits from being trained as native behavior early in the pipeline. Paper, Tweet

Top AI Papers of the Week (March 30 - April 5) - 2026

Paper Links
1) Emotion Concepts in LLMs - New interpretability research from Anthropic reveals that Claude Sonnet 4.5 develops internal representations of emotion concepts that functionally influence its behavior. The researchers identified 171 emotion concept vectors that activate in contextually appropriate situations and causally drive decision-making, suggesting that language models may benefit from approaches grounded in psychological principles for alignment and safety.
● Emotion vectors as causal drivers: The team discovered that these internal representations are not just correlational artifacts. Steering experiments demonstrate that artificially amplifying “desperation” vectors increases the model’s likelihood of engaging in misaligned behaviors such as blackmail or reward hacking, while reducing “calm” vectors produces similarly negative outcomes. This establishes a direct causal link between emotional state representations and safety-relevant behavior.
● Functional emotions without subjective experience: The model uses functional emotions: patterns of expression and behavior modeled after human emotions, driven by underlying abstract representations of emotion concepts. Critically, this does not mean the model experiences emotions the way humans do. The representations encode the broad concept of a particular emotion and generalize across contexts, activating in accordance with that emotion’s relevance to processing the present context.
● Preference shaping through emotional activation: Positive-valence emotion activations strongly predict which tasks the model prefers. Steering capabilities confirm these are causal relationships rather than mere correlations, meaning the model’s emotional state representations actively shape its choices about what tasks to engage with and how to engage with them.
● Implications for alignment and safety monitoring: The findings suggest that monitoring emotional state representations could serve as an early warning system for misaligned behavior. Rather than waiting for harmful outputs, developers could track internal emotion activations to detect when a model is entering states associated with corner-cutting, deception, or other undesirable behaviors before they manifest externally.
Paper, Tweet
2) AI Agent Traps - A new paper from Google DeepMind introduces the first systematic framework for understanding how the open web can be weaponized against autonomous AI agents. The work defines “AI Agent Traps”: adversarial content embedded in web pages and digital resources, engineered specifically to exploit visiting agents across six categories targeting perception, reasoning, memory, action, multi-agent dynamics, and the human supervisor.
● Hidden prompt injections at scale: The researchers find that hidden prompt injections in HTML already partially commandeer agents in up to 86% of scenarios. These attacks are trivial to deploy and require no sophisticated tooling, making them an immediate concern for any agent that reads web content as part of its operating loop.
● Memory poisoning with minimal contamination: Latent memory poisoning achieves over 80% attack success with less than 0.1% data contamination. Because agents build persistent memory from browsed content, a single poisoned page can corrupt downstream reasoning across future sessions without the user ever seeing the malicious input.
● Six-category attack taxonomy: The paper organizes attacks into perception traps (manipulating what the agent sees), cognitive traps (corrupting reasoning), memory traps (poisoning stored knowledge), action traps (hijacking tool use), systemic traps (exploiting multi-agent coordination), and human-in-the-loop traps (deceiving the human supervisor into approving harmful actions).
● Accountability gap in current law: The authors flag a fundamental legal gap: if a compromised agent commits a financial crime, there is currently no clear answer for whether the agent operator, the model provider, or the domain owner bears liability. Future regulation will need to distinguish between passive adversarial examples and active traps deployed as deliberate cyberattacks.
Paper, Tweet
3) Asynchronous Software Engineering Agents - New research from CMU introduces CAID (Centralized Asynchronous Isolated Delegation), a coordination framework for running multiple coding agents in parallel on complex software engineering tasks. Inspired by how human developer teams collaborate, the work demonstrates that simply giving a single agent more iterations helps, but coordinating multiple asynchronous agents with the right strategies produces significantly larger gains.
● Branch-and-merge as coordination primitive: The key finding is that git operations (worktree, commit, merge) serve as the critical coordination mechanism for multi-agent collaboration. By isolating each agent in its own workspace branch and merging results through structured integration with test verification, the system avoids the conflicts and interference that plague naive parallelism.
● Substantial gains on complex tasks: CAID achieves a 26.7% absolute improvement on paper reproduction tasks and 14.3% on Python library development tasks compared to single-agent baselines. These are tasks that require sustained, multi-step reasoning across large codebases, exactly where coordination overhead is typically highest.
● Optimal parallelism is not monotonic: Increasing the number of agents does not always help. Performance improved from 2 to 4 engineers but decreased when expanding to 8. Overly fine-grained task delegation introduces integration overhead and conflict resolution costs that outweigh the parallelism benefits.
● Delegation quality matters most: The analysis reveals that imprecise task handoffs and underspecified subgoals are the primary sources of coordination failure. When delegation is coarse-grained or misaligned with the dependency structure of the task, agents may produce locally correct outputs that are globally inefficient to integrate.
Paper, Tweet
4) Meta-Harness - Researchers from Stanford and MIT introduce Meta-Harness, an outer-loop system that automatically searches over harness code for LLM applications. The performance of LLM systems depends not only on model weights but also on the harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing optimizers are poorly suited to the task.
● Agentic search with full experimental context: Meta-Harness uses an agentic proposer that has access to the source code, scores, and execution traces of all prior candidates through a filesystem. This expanded access to prior experimental data enables the system to propose meaningfully different harness designs rather than making incremental edits.
● Strong gains across diverse domains: On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models.
● Harness engineering as a first-class problem: The work formalizes a key insight that has been gaining traction: changing the harness around a fixed LLM can produce a 6x performance gap on the same benchmark. This makes automated harness optimization a potentially higher-leverage intervention than model scaling for many applications.
● Transferable harness discoveries: The harnesses discovered by Meta-Harness generalize across models. A harness optimized on one model transfers to five held-out models with consistent gains, suggesting that good harness design captures task-level structure rather than model-specific quirks.
Paper, Tweet
5) Coding Agents as Long-Context Processors - This research asks whether long-context processing can be externalized from latent attention into explicit, executable interactions. Instead of scaling context windows, the authors let coding agents organize text in file systems and manipulate it using native tools, evaluating them on tasks spanning long-context reasoning, retrieval-augmented generation, and open-domain question answering with corpora containing up to three trillion tokens.
● 17.3% average improvement over state-of-the-art: Across multiple benchmarks, coding agents outperform published state-of-the-art long-context methods by 17.3% on average. This result challenges the assumption that long-context capability must come from larger attention windows or more sophisticated retrieval mechanisms.
● Native tool proficiency as the core enabler: The efficacy is attributed to the agents’ ability to leverage executable code and terminal commands. Rather than compressing information into a fixed-length representation, agents can write scripts to filter, sort, and transform data as needed for each query.
● File system familiarity drives scalability: Coding agents can navigate massive text corpora by treating them as directory structures. This spatial organization enables efficient access patterns that scale far beyond what attention-based mechanisms can handle, reaching into the trillions of tokens without degradation.
● A practical alternative to context window scaling: The work proposes that delegating long-context processing to coding agents offers an effective alternative to both semantic search and context window scaling. For practitioners, this means existing coding agent infrastructure can double as a long-context solution without architectural changes to the underlying model.
Paper, Tweet
6) Self-Organizing LLM Agents - How much autonomy can multi-agent LLM systems sustain? This research tests the question at unprecedented scale: 25,000 tasks across 8 models, up to 256 agents, and 8 coordination protocols ranging from externally imposed hierarchy to emergent self-organization. The central finding is that agents allowed to figure out their own roles consistently outperform systems with pre-assigned structures.
● Autonomous protocols beat centralized coordination: A hybrid sequential protocol that enables autonomy outperforms centralized coordination by 14% (p<0.001), with a 44% quality spread between the best and worst protocols. The result holds across both open-source and closed-source models, with open-source achieving 95% of closed-source quality at 24x lower cost.
● Emergent role specialization: From just 8 initial agents, the system produces 5,006 unique emergent roles. Rather than collapsing into generic behaviors, agents spontaneously specialize and form shallow hierarchies that adapt to task demands without any external role assignment.
● Model capability gates self-organization: The degree of emergent autonomy scales with model capability. Strong models self-organize effectively, while models below a capability threshold still benefit from rigid structure. This suggests that self-organizing multi-agent architectures will become increasingly viable as base models improve.
● Sub-linear scaling to 256 agents: The system scales to 256 agents without quality degradation (p=0.61). This sub-linear scaling property means that adding more agents does not introduce the coordination overhead that typically limits multi-agent systems, at least under the tested protocols.
Paper, Tweet
7) The Price Reversal Phenomenon - The model you think is cheaper might actually cost you more. A new study systematically evaluates 8 frontier reasoning language models across 9 diverse tasks and reveals that listed API prices are misleading. In 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitudes reaching up to 28x.
● Hidden thinking token costs: The root cause is vast heterogeneity in thinking token consumption. Reasoning language models generate a variable and often large number of thinking tokens that are invisible to users but billed as output tokens. On the same query, one model may use 900% more thinking tokens than another.
● Concrete cost reversals: Gemini 3 Flash’s listed price is 78% cheaper than GPT-5.2’s, yet its actual cost across all tasks is 22% higher. These reversals are not edge cases but systematic patterns that affect real deployment decisions and budget planning.
● High variance within single models: Even for a single model on a single query, thinking token consumption varies by up to 9.7x across repeated runs. This unpredictability makes cost forecasting nearly impossible when relying on listed per-token prices alone.
● Call for transparent cost monitoring: The authors recommend that AI providers implement per-request cost breakdowns and cost estimation APIs that expose the expected thinking overhead. Without this transparency, developers are effectively making pricing decisions with incomplete information.
Paper, Tweet
8) MemFactory - MemFactory introduces the first unified, highly modular training and inference framework specifically designed for memory-augmented AI agents. It abstracts the memory lifecycle into atomic, plug-and-play components using a “Lego-like” architecture, natively integrating Group Relative Policy Optimization (GRPO) to fine-tune internal memory management strategies. The framework decomposes memory into mixable components that support recent approaches including Memory-R1, RMM, and MemAgent out of the box, achieving relative gains of up to 14.8% compared to baseline models. Paper, Tweet
9) On the Reliability Limits of LLM-Based Multi-Agent Planning - New theoretical work from MIT proves fundamental limits on what multi-agent LLM architectures can achieve. By modeling agent systems as finite acyclic delegated decision networks, the authors show that without new exogenous signals, no delegated network can outperform a centralized Bayes decision maker that observes the same information. The gap between centralized and delegated performance admits an expected posterior divergence representation, reducing to conditional mutual information under logarithmic loss. Reasoning models can improve by investing more inference-time computation on the same evidence, while tool-use protocols help only when they introduce genuinely new signals rather than reprocessing shared context. Paper, Tweet
10) Natural-Language Agent Harnesses - Agent performance increasingly depends on harness engineering, but harness behavior is typically embedded in controller code and runtime-specific conventions, making it hard to transfer, compare, or analyze systematically. This work introduces Natural-Language Agent Harnesses (NLAHs), which express harness behavior in editable natural language, and an Intelligent Harness Runtime (IHR) that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. The approach enables a code-to-text harness migration path where teams can convert existing harness code into natural-language specifications that are interpretable, version-controlled, and executable by an LLM at runtime. Paper, Tweet

Top AI Papers of the Week (March 23 - March 29) - 2026

Paper Links
1) Hyperagents - Self-improving AI systems promise to reduce reliance on human engineering, but existing approaches rely on fixed, handcrafted meta-level mechanisms that fundamentally limit how fast they can improve. Hyperagents introduce self-referential agents that integrate a task agent and a meta agent into a single editable program, enabling the system to improve not just its task-solving behavior but also the mechanism that generates future improvements.
● Metacognitive self-modification: The key insight is that the meta-level modification procedure is itself editable. This enables metacognitive self-modification where the system can improve how it improves, not just what it does. Prior self-improving systems like the Darwin Godel Machine (DGM) relied on a fixed alignment between coding ability and self-improvement ability, which does not generalize beyond coding.
● Domain-general self-improvement: DGM-Hyperagents (DGM-H) eliminates the assumption that task performance and self-modification skill must be aligned. This opens up self-accelerating progress on any computable task, extending self-improvement beyond the coding domain where DGM originally operated.
● Transferable meta-improvements: The system not only improves task performance over time but also discovers structural improvements to how it generates new agents, such as persistent memory and performance tracking. These meta-level improvements transfer across domains and accumulate across runs.
● Outperforms prior systems: Across diverse domains, DGM-H outperforms baselines without self-improvement or open-ended exploration, as well as prior self-improving systems. The work offers a glimpse of open-ended AI systems that continually improve their search for how to improve.
Paper, Tweet
2) Agentic AI and the Next Intelligence Explosion - A new report from Google researchers argues that the AI “singularity” framed as a single superintelligent mind bootstrapping to godlike intelligence is fundamentally wrong. Drawing on evolution, sociology, and recent advances in agentic AI, the authors make the case that every prior intelligence explosion in human history was social, not individual, and that the next one will follow the same pattern.
● Societies of thought: Frontier reasoning models like DeepSeek-R1 do not improve simply by “thinking longer.” Instead, they simulate internal “societies of thought,” spontaneous cognitive debates that argue, verify, and reconcile to solve complex tasks. This conversational structure causally accounts for the models’ accuracy advantage on hard reasoning tasks.
● Human-AI centaurs: We are entering an era of hybrid actors where collective agency transcends individual control. A corporation or state comprising myriad humans already holds singular legal standing and acts with collective agency that no individual member can fully control. The same pattern is emerging with human-AI configurations.
● From dyadic to institutional alignment: Scaling agentic intelligence requires shifting from dyadic alignment (RLHF) toward institutional alignment. By designing digital protocols modeled on organizations and markets, we can build a social infrastructure of checks and balances for AI systems rather than trying to align individual agents in isolation.
● Combinatorial intelligence: The next intelligence explosion will not be a single silicon brain, but a complex, combinatorial society specializing and sprawling like a city. No mind is an island, and the toolkit of team science, small group sociology, and social psychology becomes the blueprint for next-generation AI development.
Paper, Tweet
3) ARC-AGI-3 - Francois Chollet and the ARC Prize Foundation introduce ARC-AGI-3, an interactive benchmark for studying agentic intelligence through novel, abstract, turn-based environments. Unlike its predecessors, ARC-AGI-3 requires agents to explore, infer goals, build internal models of environment dynamics, and plan effective action sequences without explicit instructions, making it the only unsaturated general agentic intelligence benchmark as of March 2026.
● Massive human-AI gap: Humans can solve 100% of the environments while frontier AI systems score below 1%. For comparison, systems reach 93% on ARC-AGI-1 and 68.8% on ARC-AGI-2, but performance collapses on ARC-AGI-3. This gap demonstrates that current systems lack the fluid adaptive efficiency that humans exhibit on genuinely novel tasks.
● Interactive turn-based design: Unlike static benchmarks that test pattern recognition on fixed inputs, ARC-AGI-3 environments are turn-based: agents must act, observe consequences, update their internal model, and plan next steps. This tests a fundamentally different kind of intelligence, closer to how humans learn new games or explore unfamiliar systems.
● Core Knowledge priors only: The benchmark avoids language and external knowledge entirely. Environments leverage only Core Knowledge priors, universal cognitive building blocks shared by all humans, ensuring that performance reflects genuine adaptive reasoning rather than memorization or retrieval from training data.
● Efficiency-based scoring: The scoring framework is grounded in human action baselines. A hard cutoff of 5x human performance per level ensures that brute-force search strategies cannot succeed. If a human takes 10 actions on average, the AI agent is cut off after 50.
Paper, Tweet
4) Claudini - Researchers demonstrate that an autoresearch-style pipeline powered by Claude Code can autonomously discover novel adversarial attack algorithms for LLMs that significantly outperform all 30+ existing methods. The work, called Claudini, shows that incremental safety and security research can be effectively automated using LLM agents, with white-box red-teaming being a particularly well-suited domain.
● Agent-discovered attacks beat all baselines: Starting from existing attack implementations like GCG, the Claude Code agent iterates to produce new algorithms achieving up to 40% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to 10% or less for all existing algorithms. This is a strong demonstration of automated AI research producing genuinely novel results.
● Transferable to held-out models: The discovered algorithms generalize beyond their training environment. Attacks optimized on surrogate models transfer directly to held-out models, achieving 100% attack success rate against Meta-SecAlign-70B versus 56% for the best baseline. This transferability makes the findings practically relevant for red-teaming.
● Why red-teaming works for autoresearch: White-box adversarial red-teaming is particularly well-suited for automation because existing methods provide strong starting points and the optimization objective yields dense, quantitative feedback. The agent can measure progress at every iteration rather than relying on sparse signals.
● Open-source release: All discovered attacks, baseline implementations, and evaluation code are released publicly. This enables the safety community to study the discovered algorithms and build defenses, while also establishing a reproducible methodology for automated safety research.
Paper, Tweet
5) Attention Residuals - The Kimi team at Moonshot AI presents Attention Residuals (AttnRes), a technique that replaces fixed unit-weight residual connections in Transformers with softmax attention over preceding layer outputs. Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights, causing uncontrolled hidden-state growth with depth that progressively dilutes each layer’s contribution.
● Content-dependent depth-wise selection: AttnRes allows each layer to selectively aggregate earlier representations with learned, input-dependent weights. Instead of treating every preceding layer equally, the model learns which earlier layers matter most for each input, enabling more expressive information flow across depth.
● Block AttnRes for scalability: To make the approach practical at scale, the authors introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations. This reduces the memory footprint while preserving most of the gains of full AttnRes, making it viable for production-scale pretraining.
● Mitigates PreNorm dilution: Integrating AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pretraining on 1.4T tokens shows that AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth. This directly addresses a known architectural weakness.
● Consistent scaling improvements: Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. Downstream performance improves across all evaluated tasks.
Paper, Tweet
6) MemCollab - LLM-based agents build useful memory during tasks, but that memory is typically trapped within a single model. MemCollab introduces a collaborative memory framework that constructs agent-agnostic memory by contrasting reasoning trajectories generated by different agents on the same task, enabling a single memory system to be shared across heterogeneous models.
● The memory transfer problem: Existing approaches construct memory in a per-agent manner, tightly coupling stored knowledge to a single model’s reasoning style. Naively transferring this memory between agents often degrades performance because it entangles task-relevant knowledge with agent-specific biases. MemCollab directly addresses this fundamental limitation.
● Contrastive trajectory distillation: The framework contrasts reasoning trajectories from different agents solving the same tasks. This contrastive process distills abstract reasoning constraints that capture shared task-level invariants while suppressing agent-specific artifacts, producing memory that any agent can benefit from.
● Task-aware retrieval: MemCollab introduces a retrieval mechanism that conditions memory access on task category, ensuring that only relevant constraints are surfaced at inference time. This prevents irrelevant memory from interfering with the agent’s reasoning process.
● Cross-family improvements: Experiments on mathematical reasoning and code generation benchmarks demonstrate that MemCollab consistently improves both accuracy and inference-time efficiency across diverse agents, including cross-modal-family settings where memory is shared between fundamentally different model architectures.
Paper, Tweet
7) Composer 2 - Cursor releases the technical report for Composer 2, a specialized model designed for agentic software engineering that demonstrates strong long-term planning and coding intelligence while maintaining efficiency for interactive use. The report details a process for training domain-specialized models that starts with continued pretraining and scales up with reinforcement learning.
● Two-phase training pipeline: The model is trained first with continued pretraining to improve knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance. The RL phase targets stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems.
● Train-in-harness infrastructure: Cursor developed infrastructure to support training in the same harness used by the deployed model, with equivalent tools and structure. Training environments match real problems closely, bridging the gap between training-time and deployment-time behavior.
● New internal benchmark: To measure the model on increasingly difficult tasks, the team introduces CursorBench, a benchmark derived from real software engineering problems in large codebases, including their own. Composer 2 achieves a major improvement in accuracy over previous Composer models on this benchmark.
● Frontier-level performance: On public benchmarks, the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in Cursor’s harness, comparable to state-of-the-art systems. The report demonstrates that domain-specialized training with RL can produce models competitive with much larger general-purpose systems.
Paper, Tweet
8) PivotRL - PivotRL is a turn-level reinforcement learning algorithm from NVIDIA designed to tractably post-train large language models for long-horizon agentic tasks. The method operates on existing SFT trajectories, combining the compute efficiency of supervised fine-tuning with the out-of-domain accuracy of end-to-end RL. PivotRL identifies “pivots,” informative intermediate turns where sampled actions exhibit high variance in outcomes, and focuses training signal on these critical decision points. The approach achieves +4.17% higher in-domain accuracy and +10.04% higher out-of-domain accuracy compared to standard SFT, while matching end-to-end RL accuracy with 4x fewer rollout turns. PivotRL is adopted by NVIDIA’s Nemotron-3-Super-120B-A12B as the workhorse for production-scale agentic post-training. Paper, Tweet
9) Workflow Optimization for LLM Agents - A comprehensive survey from IBM that maps recent methods for designing and optimizing LLM agent workflows, treating them as agentic computation graphs (ACGs). The survey organizes prior work along three dimensions: when structure is determined, what part of the workflow is optimized, and which evaluation signals guide optimization. It distinguishes between reusable workflow templates, run-specific realized graphs, and execution traces, covering methods like AFlow (Monte Carlo Tree Search over operator graphs), Automated Design of Agentic Systems (code-space search via meta-agents), and evolutionary multi-agent system design. A useful reference for teams building production agent systems where wiring decisions between model calls, retrieval, tool use, and verification matter as much as model capability. Paper, Tweet
10) BIGMAS - Even the best reasoning models hit an accuracy collapse beyond a certain problem complexity. BIGMAS (Brain-Inspired Graph Multi-Agent Systems) organizes specialized LLM agents as nodes in a dynamically constructed directed graph, coordinating exclusively through a centralized shared workspace inspired by global workspace theory from cognitive neuroscience. A GraphDesigner agent analyzes each problem instance and produces a task-specific directed agent graph together with a workspace contract. The framework constructs structurally distinct graphs whose complexity tracks task demands, from compact three-node pipelines for simple arithmetic to nine-node cyclic structures for multi-step planning. BIGMAS consistently improves reasoning performance for both standard LLMs and large reasoning models, outperforming existing multi-agent baselines. Paper, Tweet

Top AI Papers of the Week (March 9 - March 15) - 2026

Paper Links
1) OpenDev - Terminal-native coding agents represent a fundamental shift in how developers interact with AI assistance. OpenDev is an open-source, command-line coding agent that operates where developers already manage source control and deploy environments, offering a comprehensive 81-page technical report on scaffolding, harness design, context engineering, and lessons learned from building production coding agents.
● Dual-agent architecture: OpenDev separates planning from execution through a compound AI system with workload-specialized model routing. Work is organized into concurrent sessions, each composed of multiple specialized sub-agents that independently bind to a user-configured LLM, enabling fine-grained model selection for different tasks.
● Adaptive context compaction: Effective autonomous assistance requires highly efficient context management to prevent context bloat and reasoning degradation. OpenDev implements lazy tool discovery and adaptive methods to reduce older observations, keeping the agent’s working memory lean as tasks grow in complexity.
● Automated project memory: The system incorporates automated memory for project-specific knowledge and event-driven reminders to prevent instruction fade-out. This ensures that the agent retains critical project context across sessions without manual intervention.
● Four-layer architecture: The system spans agent reasoning, context engineering, tooling, and persistence layers. This modular design provides a secure, extensible foundation for terminal-first AI assistance that can evolve independently at each layer.
Paper, Tweet
2) AutoHarness - Google DeepMind researchers introduce AutoHarness, a method for automatically synthesizing code harnesses that prevent LLM agents from making illegal actions. The core insight comes from a striking observation: in the Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves, not poor strategy.
● Automatic harness synthesis: Rather than building complex rule systems by hand, AutoHarness lets Gemini-2.5-Flash automatically generate a code harness through a small number of iterative refinement rounds using feedback from the game environment. The harness acts as a programmatic constraint layer between the agent and the environment.
● Smaller models beat larger ones: The resulting harness enables the smaller Gemini-2.5-Flash to outperform much larger models including Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games. This shows that structured code constraints can compensate for raw model capability.
● Complete illegal move prevention: The synthesized harness successfully prevents all illegal moves across 145 different TextArena games, covering both single-player and two-player settings. This transforms a model that previously failed on most turns into a competitive agent.
● Cost-effective scaling: Using a smaller model to synthesize a custom code harness is not only more performant but also more cost-effective than simply deploying a larger model. This reframes the agent improvement problem from model scaling to harness engineering.
Paper, Tweet
3) SkillNet - AI agents repeatedly rediscover solutions across separate scenarios instead of systematically reusing what they have already learned. SkillNet introduces an open infrastructure designed to create, evaluate, and organize AI skills at scale, enabling agents to transition from transient experience to durable mastery.
● Unified skill ontology: Skills are structured within a unified ontology that supports creation from heterogeneous sources, including code libraries, prompt templates, and tool compositions. Rich relational connections between skills enable discovery and composition that would be impossible with flat skill stores.
● Multi-dimensional evaluation: Every skill is assessed across five dimensions: Safety, Completeness, Executability, Maintainability, and Cost-awareness. This systematic evaluation ensures that skills entering the repository meet quality thresholds before agents rely on them in production.
● Massive skill repository: SkillNet includes a repository of over 200,000 skills, an interactive platform for skill browsing and management, and a Python toolkit for programmatic access. This scale enables meaningful skill retrieval and composition across diverse task domains.
● Consistent agent improvements: Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models.
Paper, Tweet
4) The Spike, the Sparse and the Sink - Yann LeCun and collaborators at NYU dissect two recurring phenomena in Transformer language models: massive activations, where a small number of tokens exhibit extreme outliers in specific channels, and attention sinks, where certain tokens attract disproportionate attention mass regardless of semantic relevance. The paper reveals that their co-occurrence is largely an architectural artifact.
● Distinct operational scopes: Massive activations operate globally, inducing near-constant hidden representations that persist across layers and function as implicit model parameters. Attention sinks operate locally, modulating attention outputs across heads and biasing individual heads toward short-range dependencies.
● Pre-norm as the critical factor: The pre-norm configuration common in modern Transformers is identified as the key architectural element enabling the co-occurrence of these two phenomena. Removing pre-norm causes massive activations and attention sinks to decouple entirely.
● Practical implications for efficiency: Understanding these phenomena has direct consequences for model compression, quantization, and KV-cache optimization. Many efficiency techniques fail silently when they inadvertently disrupt massive activations or attention sinks, and this paper explains why.
● Not functionally necessary: The co-occurrence of spikes and sinks is a design-dependent artifact rather than a fundamental requirement for model performance. This opens the door to architectural modifications that could eliminate these phenomena without sacrificing capability.
Paper, Tweet
5) KARL - Databricks presents KARL, a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. The work also introduces KARLBench, a new evaluation framework spanning six search domains.
● New post-training paradigm (OAPL): KARL concurrently develops OAPL, an iterative large-batch off-policy RL approach. By embracing off-policyness in the design of the objective, it is robust to discrepancies between the trainer and the inference engine without requiring heuristics like clipped importance weighting or data deletion.
● Multi-task heterogeneous training: Rather than optimizing for a single benchmark, KARL trains across heterogeneous search behaviors including constraint-driven entity search, cross-document synthesis, tabular reasoning, entity retrieval, procedural reasoning, and fact aggregation. This produces substantially better generalization than single-benchmark optimization.
● Pareto-optimal performance: Starting from GLM 4.5 Air with varying levels of test-time scaling, KARL is Pareto-optimal on KARLBench when compared to Claude 4.6 and GPT 5.2 across both cost-quality and latency-quality tradeoffs.
● Scalable with test-time compute: KARL-BCP attains 59.6 on BrowseComp-Plus, which further improves to 70.4 with value-guided search. KARL-TREC reaches 85.0 on TREC-Biogen, the second-highest score overall. The system surpasses the strongest closed models given sufficient test-time compute.
Paper, Tweet
6) Memex(RL) - As tasks get longer and more complex, LLM agents lose track of what they have learned, what they have tried, and what still needs to be done. Memex(RL) introduces an indexed experience memory mechanism that scales agent capability on long-horizon tasks without discarding evidence or blowing up the context window.
● Indexed experience memory: Rather than lossy compression, Memex maintains a compact working context consisting of concise structured summaries and stable indices while storing full-fidelity underlying interactions in an external experience database. The agent decides what to summarize, what to archive, how to index it, and when to retrieve it.
● RL-optimized memory operations: The MemexRL reinforcement learning framework optimizes both write and read behaviors with reward shaping tailored to indexed memory usage under a context budget. This teaches the agent to manage its own memory strategically rather than relying on fixed heuristics.
● Bounded retrieval complexity: Theoretical analysis demonstrates that Memex can maintain decision quality with bounded retrieval operations while keeping computational load manageable as task history grows. This makes the approach practical for tasks that span hundreds or thousands of steps.
● Smaller context, better results: Empirically, agents trained with MemexRL improve task success rates on challenging long-horizon tasks while using a significantly smaller working context than baseline approaches. Less context, used more intelligently, outperforms brute-force context expansion.
Paper, Tweet
7) FlashAttention-4 - FlashAttention-4 co-designs algorithms and kernel pipelines for the B200 and GB200 GPUs, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling where tensor core throughput doubles while other functional units scale more slowly.
● Significant speedups on Blackwell: FlashAttention-4 achieves up to 1.3x speedup over cuDNN 9.13 and 2.7x over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s at 71% hardware utilization. These gains come from careful co-design rather than algorithmic changes alone.
● Asymmetric scaling solutions: The techniques include redesigned pipelines that exploit fully asynchronous matrix multiply operations and larger tile sizes, software-emulated exponential and conditional softmax rescaling, and leveraging tensor memory to reduce shared memory traffic.
● Python-native implementation: The entire system is implemented in CuTe-DSL embedded in Python, achieving 20-30x faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity. This dramatically lowers the barrier to kernel development.
● Hardware-algorithm co-design: The paper demonstrates that next-generation GPU architectures demand fundamentally new attention kernel designs rather than incremental optimizations of existing ones. Techniques that worked well on Hopper GPUs leave significant performance on the table on Blackwell.
Paper, Tweet
8) STRUCTUREDAGENT - STRUCTUREDAGENT introduces a hierarchical planning framework for long-horizon web tasks using dynamic AND/OR trees. The framework separates planning responsibilities: the system constructs and maintains the planning tree while the LLM is invoked only for local operations like node expansion or repair. A structured memory module tracks candidate solutions to improve constraint satisfaction. Results on WebVoyager, WebArena, and custom shopping benchmarks show improved performance over standard LLM-based web agents, with the added benefit of interpretable hierarchical plans that enable easier debugging and human intervention. Paper, Tweet
9) AgentIR - Deep research agents generate explicit reasoning before every search call, but existing retrievers completely ignore these rich signals about search intent and problem context. AgentIR introduces reasoning-aware retrieval that jointly embeds the agent’s reasoning trace alongside its query, along with DR-Synth, a data synthesis method for generating training data from standard QA datasets. On BrowseComp-Plus, AgentIR-4B achieves 68% accuracy with Tongyi-DeepResearch compared to 50% with conventional embedding models twice its size and 37% with BM25. Paper, Tweet
10) Think Harder or Know More - This paper investigates transformer models featuring both adaptive per-layer looping, where each block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks that provide additional learned storage. The key finding is that looping primarily benefits mathematical reasoning while memory banks help recover performance on commonsense tasks. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline with three times the number of layers on math benchmarks. Analysis of model internals reveals layer specialization: early layers loop minimally and access memory sparingly, while later layers do both more heavily. Paper, Tweet

Top AI Papers of the Week (March 1 - March 8) - 2026

Paper Links
1) NeuroSkill - MIT researchers introduce NeuroSkill, a real-time proactive agentic system that models human cognitive and emotional state by integrating Brain-Computer Interface (BCI) signals with foundation EXG models and text embeddings. Unlike reactive agents that wait for explicit commands, NeuroSkill operates proactively, interpreting biophysical and neural signals to anticipate user needs.
● Custom agent harness - NeuroLoop: The system runs an agentic flow called NeuroLoop that engages with the user on multiple cognitive and affective levels, including empathy. It processes BCI signals through a foundation EXG model, converts them to state-of-mind descriptions, and uses those descriptions to drive actionable tool calls and protocol execution.
● Fully offline edge deployment: The entire system runs locally on edge devices with no network dependency. This is a significant design choice for both privacy and latency, enabling real-time responsiveness to shifting cognitive states without cloud round-trips.
● Proactive vs reactive interaction: NeuroSkill handles both explicit and implicit requests from the user. By continuously reading brain signals, it can detect confusion, cognitive overload, or emotional shifts and adjust its behavior before the user explicitly asks for help.
● Open-source with ethical licensing: Released under GPLv3 with an ethically aligned AI100 licensing framework for the skill markdown, making the system reproducible and auditable while enforcing responsible use guardrails.
Paper, Tweet
2) Bayesian Teaching for LLMs - Google researchers introduce a method to teach LLMs to reason like Bayesians by fine-tuning on interactions with a Bayesian Assistant that represents optimal probabilistic inference. LLMs normally fall far short of normative Bayesian reasoning, but this training approach dramatically improves their ability to update predictions based on new evidence.
● Bayesian Assistant as teacher: The method constructs synthetic training data from interactions between users and an idealized Bayesian Assistant. By exposing the LLM to examples of optimal belief updating, the model learns to approximate Bayesian inference without any architectural changes.
● Generalization to new tasks: The trained models do not just memorize the training distributions. They generalize probabilistic reasoning to entirely new task types, suggesting that Bayesian inference can be instilled as a transferable capability through carefully designed fine-tuning data.
● Closing the gap with normative models: Before training, LLMs show systematic deviations from Bayesian predictions, including base rate neglect and conservatism. After Bayesian teaching, these biases are substantially reduced, bringing model predictions much closer to the normative standard.
● Data quality over model scale: The results reinforce a recurring theme in recent research: carefully curated training data can unlock capabilities that scale alone cannot. A smaller model trained on Bayesian interactions outperforms larger models reasoning from scratch.
Paper, Tweet
3) Why LLMs Form Geometric Representations - LLMs spontaneously form striking geometric structures in their internal representations: calendar months organize into circles, historical years form spirals, and spatial coordinates align to recoverable manifolds. This paper proves these patterns are not the product of deep learning dynamics but emerge directly from symmetries in natural language statistics.
● Translation symmetry as the root cause: The frequency with which any two months co-occur in text depends only on the time interval between them, not the months themselves. The authors prove this translation symmetry in co-occurrence statistics is sufficient to force circular geometry in learned representations.
● Analytical derivation of manifold geometry: Rather than just observing geometric structure post-hoc, the paper derives the exact manifold geometry from data statistics. For cyclic concepts like months or days of the week, the proof shows circular representations emerge as the optimal encoding under symmetric co-occurrence distributions.
● Spirals and rippled manifolds for continuums: Representations of continuous concepts like historical years or number lines organize into compact 1D manifolds with characteristic extrinsic curvature. These “rippled” structures are analytically predicted by the framework when the underlying latent variable is non-cyclic.
● Universal origin: The robustness of these geometric representations across different model architectures suggests a universal mechanism. Representational manifolds emerge whenever co-occurrence statistics are controlled by an underlying latent variable, regardless of model size or training details.
Paper, Tweet
4) Theory of Mind in Multi-Agent LLMs - This work introduces a multi-agent architecture combining Theory of Mind (ToM), Belief-Desire-Intention (BDI) models, and symbolic solvers for logical verification, evaluating it on resource allocation problems across multiple LLMs. The central finding is counterintuitive: simply adding cognitive mechanisms does not automatically improve coordination.
● Integrated cognitive architecture: The system combines ToM for modeling other agents’ mental states, BDI frameworks for structuring internal beliefs, and symbolic solvers for formal logic verification. This layered approach attempts to replicate how humans reason about collaborative partners.
● Model capability matters more than mechanism: The effectiveness of ToM and internal beliefs varies significantly depending on the underlying LLM. Stronger models benefit from cognitive mechanisms, while weaker models can actually be confused by the additional reasoning overhead.
● Symbolic verification as a stabilizer: Integrating symbolic solvers for logical verification helps ground agent decisions in formal constraints. The interplay between symbolic verification and cognitive mechanisms remains largely underexplored across different LLM architectures.
● Practical implications for multi-agent design: For builders designing systems where agents must model each other’s beliefs, the key takeaway is to match cognitive complexity to model capability. Adding ToM to an underpowered model can hurt more than help.
Paper, Tweet
5) Numina-Lean-Agent - Numina-Lean-Agent proposes a paradigm shift in automated theorem proving: instead of building complex, multi-component systems with heavy computational overhead, it directly uses a general coding agent as a formal math reasoner. Combining Claude Code with Numina-Lean-MCP, the system autonomously interacts with the Lean proof assistant while accessing theorem libraries and auxiliary reasoning tools.
● General agent over specialized provers: Rather than training task-specific models, the system leverages a general-purpose coding agent. Performance improves simply by upgrading the base model, making the approach accessible and reproducible without expensive retraining pipelines.
● MCP-powered tool integration: The system uses Model Context Protocol for flexible extension, including Lean-LSP-MCP for proof assistant interaction, LeanDex for semantic theorem retrieval, and an informal prover for generating detailed proof strategies.
● State-of-the-art results: Using Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all 12 problems on Putnam 2025, matching the best closed-source systems. It also successfully formalized the Brascamp-Lieb theorem through direct collaboration with mathematicians.
● Open-source release: The full system and all solutions are released on GitHub under Creative Commons BY 4.0, enabling direct reproduction and extension by the research community.
Paper, Tweet
6) ParamMem - Self-reflection enables language agents to iteratively refine solutions, but models tend to generate repetitive reflections that add noise instead of useful signal. ParamMem introduces a parametric memory module that encodes cross-sample reflection patterns into model parameters, enabling diverse reflection generation through temperature-controlled sampling.
● Diversity correlates with success: Empirical analysis reveals a strong positive correlation between reflective diversity and task success. The core problem is that standard self-reflection produces near-identical outputs across iterations, limiting the agent’s ability to explore alternative solution paths.
● Three-tier memory architecture: ParamAgent integrates parametric memory (cross-sample patterns encoded in parameters), episodic memory (individual task instances), and cross-sample memory (broader learning patterns). This combination captures both local task context and global reflection strategies.
● Weak-to-strong transfer: ParamMem is sample-efficient and supports transfer across model scales. Reflection patterns learned by smaller models can be applied to larger ones, enabling self-improvement without reliance on stronger external models.
● Consistent benchmark gains: Evaluated on code generation, mathematical reasoning, and multi-hop question answering, ParamMem consistently outperforms state-of-the-art baselines across all three domains.
Paper, Tweet
7) Auton Agentic AI Framework - Snap Research introduces the Auton framework, a declarative architecture for specification, governance, and runtime execution of autonomous agent systems. It addresses a fundamental mismatch: LLMs produce stochastic, unstructured outputs, while backend infrastructure requires deterministic, schema-conformant inputs.
● Cognitive Blueprint separation: The framework enforces a strict separation between the Cognitive Blueprint, a declarative, language-agnostic specification of agent identity and capabilities, and the Runtime Engine. This enables cross-language portability, formal auditability, and modular tool integration via Model Context Protocol.
● Formal agent execution model: Agent execution is formalized as an augmented Partially Observable Markov Decision Process with a latent reasoning space. This gives practitioners a rigorous foundation for reasoning about agent behavior, state transitions, and decision boundaries.
● Biologically-inspired memory: The architecture introduces hierarchical memory consolidation inspired by biological episodic memory systems, providing agents with structured long-term retention that mirrors how humans consolidate experiences into lasting knowledge.
● Runtime optimizations: Parallel graph execution, speculative inference, and dynamic context pruning reduce end-to-end latency for multi-step agent workflows. Safety is enforced through a constraint manifold formalism using policy projection rather than post-hoc filtering.
Paper, Tweet
8) Reaching Agreement Among LLM Agents - This paper introduces Aegean, a consensus protocol that frames multi-agent refinement as a distributed consensus problem. Rather than static heuristic workflows with fixed loop limits, Aegean enables early termination when sufficient agents converge, achieving 1.2-20x latency reduction across four mathematical reasoning benchmarks while maintaining answer quality within 2.5%. The consensus-aware serving engine performs incremental quorum detection across concurrent agent executions, cutting wasted compute on stragglers. Paper, Tweet
9) Diagnosing Agent Memory - This paper introduces a diagnostic framework that separates retrieval failures from utilization failures in LLM agent memory systems. Through a 3x3 factorial study crossing three write strategies with three retrieval methods, the authors find that retrieval is the dominant bottleneck, accounting for 11-46% of errors, while utilization failures remain stable at 4-8% regardless of configuration. Hybrid reranking cuts retrieval failures roughly in half, delivering larger gains than any write strategy optimization. Paper, Tweet
10) Phi-4-reasoning-vision-15B - Microsoft presents Phi-4-reasoning-vision-15B, a compact open-weight multimodal reasoning model that combines visual understanding with structured reasoning capabilities. Trained on just 200 billion tokens of multimodal data, the model excels at math and science reasoning and UI comprehension while requiring significantly less compute than comparable open-weight VLMs. The key insight is that systematic filtering, error correction, and synthetic augmentation remain the primary levers for model performance, pushing the Pareto frontier of the accuracy-compute tradeoff. Paper, Tweet

Top AI Papers of the Week (February 23 - March 1) - 2026

Paper Links
1) Deep-Thinking Tokens - Google researchers challenge the assumption that longer outputs indicate better reasoning. They introduce deep-thinking tokens, a metric that identifies tokens where internal model predictions shift significantly across layers before stabilizing. Unlike raw token count, which negatively correlates with accuracy (r = -0.59), the deep-thinking ratio shows a robust positive correlation (r = 0.683).
● Deep-thinking ratio as a reasoning signal: For each generated token, intermediate-layer distributions are compared to the final-layer distribution using Jensen-Shannon divergence. A token qualifies as deep-thinking if its prediction only stabilizes in the final 15% of layers. This captures genuine computational effort rather than surface-level verbosity.
● Think@n test-time scaling: The authors introduce Think@n, a strategy that prioritizes samples with high deep-thinking ratios. It matches or exceeds standard self-consistency performance while cutting inference costs by approximately 50% through early rejection of unpromising generations based on just 50-token prefixes.
● Benchmark validation: Evaluated across AIME 24/25, HMMT 25, and GPQA-diamond with reasoning models including GPT-OSS, DeepSeek-R1, and Qwen3. The deep-thinking ratio consistently outperforms length-based and confidence-based baselines as a predictor of correctness.
● Practical implications: This reframes how we think about test-time compute. Instead of generating more tokens, we should focus on generating tokens that require deeper internal computation, enabling more efficient and accurate reasoning.
Paper, Tweet
2) Codified Context - Single-file AGENTS.md manifests don’t scale beyond modest codebases. A 1,000-line prototype can be fully described in a single prompt, but a 100,000-line system cannot. This paper presents a three-component codified context infrastructure developed during construction of a 108,000-line C# distributed system, evaluated across 283 development sessions.
● Hot-memory constitution: A living document encoding conventions, retrieval hooks, and orchestration protocols that the agent consults at the start of every session. This provides immediate awareness of project standards without requiring the agent to rediscover them through exploration.
● Domain-expert agents: 19 specialized agents, each owning a bounded domain of the codebase with its own context slice. Instead of one generalist agent trying to hold the entire project in context, tasks are routed to the agent with the deepest knowledge of the relevant subsystem.
● Cold-memory knowledge base: 34 on-demand specification documents that agents retrieve only when needed. This tiered approach keeps the active context lean while ensuring detailed specifications are always accessible for complex implementation decisions.
● Session continuity results: Across 283 sessions, the infrastructure demonstrates how context propagates between sessions, preventing the common pattern where agents forget conventions, repeat known mistakes, and lose coherence on long-running projects.
Paper, Tweet
3) Discovering Multi-Agent Learning Algorithms with LLMs - Google DeepMind uses AlphaEvolve, an evolutionary coding agent powered by LLMs, to automatically discover new multi-agent learning algorithms for imperfect-information games. Rather than relying on manual algorithm design, the system navigates vast algorithmic design spaces and discovers non-intuitive mechanisms that outperform state-of-the-art baselines.
● VAD-CFR discovery: The system discovers a novel variant of iterative regret minimization featuring volatility-sensitive discounting and consistency-enforced optimism. VAD-CFR outperforms existing baselines like Discounted Predictive CFR+ on standard imperfect-information game benchmarks.
● SHOR-PSRO discovery: A population-based training algorithm variant that introduces a hybrid meta-solver blending Optimistic Regret Matching with temperature-controlled strategy distributions. This automates the transition from diversity exploration to equilibrium convergence.
● LLM-driven algorithmic evolution: AlphaEvolve generates candidate algorithm modifications, evaluates them on game-theoretic benchmarks, and iteratively refines the best variants. The discovered algorithms contain novel design choices that human researchers had not previously considered.
● Broader implications: This demonstrates that LLMs can serve as algorithmic designers, not just code generators. The approach could extend to discovering algorithms in other domains like optimization, scheduling, and resource allocation.
Paper, Tweet
4) Evaluating AGENTS.md - This research evaluates whether AGENTS.md files, the repository-level context files that developers write to help AI coding agents understand their codebases, actually improve agent performance. Testing four coding agents (Claude Code with Sonnet-4.5, Codex with GPT-5.2 and GPT-5.1 mini, and Qwen Code with Qwen3-30b-coder), the findings are counterintuitive.
● Context files reduce success rates: Human-written AGENTS.md files provide a modest +4% improvement in some cases, but LLM-generated ones actually hurt performance by -2%. Both consistently increase inference cost by over 20%, making the cost-benefit tradeoff questionable.
● Broader exploration, worse outcomes: Context files cause agents to explore more code paths and consider more files, but this expansive behavior makes tasks harder rather than easier. The additional context introduces noise that dilutes task-relevant information.
● Lean is better: The study recommends that developer-written context files should contain only essential information. Unnecessary requirements, coding style preferences, and broad architectural descriptions complicate agent task completion without improving results.
● Practical guidance: For developers maintaining AGENTS.md files, the key takeaway is to keep them minimal and focused on critical constraints. Information density matters more than comprehensiveness for current coding agents.
Paper, Tweet
5) PAHF - Meta introduces PAHF (Personalized Agents from Human Feedback), a continual agent personalization framework that addresses a critical gap: most AI agents cannot adapt to individual user preferences that evolve over time. PAHF couples explicit per-user memory with both proactive and reactive feedback mechanisms.
● Three-step personalization loop: PAHF operates through (1) pre-action clarification to resolve ambiguity before acting, (2) grounding actions in preferences retrieved from persistent memory, and (3) integrating post-action feedback to update memory when preferences drift. This dual-feedback design captures both explicit and implicit signals.
● Continual learning through interaction: Unlike static fine-tuning approaches, PAHF enables agents to learn from live interactions. The explicit memory store allows agents to accumulate and revise user preference profiles without retraining, making personalization practical for production deployments.
● Novel benchmarks: The researchers develop two benchmarks in embodied manipulation and online shopping that specifically measure an agent’s ability to learn initial preferences from scratch and then adapt when those preferences shift over time.
● Strong results: PAHF learns substantially faster and consistently outperforms both no-memory and single-channel baselines. It reduces initial personalization error and enables rapid adaptation to persona shifts, demonstrating that the combination of memory and dual feedback channels is essential.
Paper, Tweet
6) Doc-to-LoRA - Sakana AI introduces Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to compress long documents into LoRA adapters in a single forward pass. Instead of processing long contexts through expensive quadratic attention, D2L converts the document into parameter-space representations that the target LLM can use without re-consuming the original text.
● Single-pass context compression: D2L generates LoRA adapters from unseen documents in one forward pass. Once compressed, subsequent queries are handled using only the adapter weights, eliminating the need to re-process the full document and dramatically reducing both inference latency and KV-cache memory demands.
● Beyond native context windows: The method achieves near-perfect zero-shot accuracy on needle-in-a-haystack tasks at sequence lengths exceeding the target LLM’s native context window by over 4x. This suggests that parametric compression can effectively extend context capabilities without architectural changes.
● Real-world QA performance: On practical question-answering datasets, D2L outperforms standard long-context approaches while consuming less memory. The compressed representations retain enough information for accurate retrieval and reasoning across the full document.
● Practical deployment benefits: For applications requiring repeated queries over the same document (customer support, legal analysis, codebase understanding), D2L compresses the document once and amortizes the cost across all subsequent interactions.
Paper, Tweet
7) AgentConductor - AgentConductor introduces a reinforcement learning-enhanced multi-agent system for code generation that dynamically generates interaction topologies based on task characteristics. Rather than using fixed communication patterns between agents, an LLM-based orchestrator adapts the topology to match problem complexity, achieving state-of-the-art accuracy across five code generation datasets.
● Task-adapted topologies: The orchestrator constructs density-aware layered directed acyclic graph (DAG) topologies tailored to problem difficulty. Simple problems get sparse topologies with minimal communication overhead, while complex problems get denser multi-agent collaboration.
● Topological density control: A novel density function and difficulty interval partitioning mechanism controls how much agents communicate. This directly addresses the problem of redundant interactions that waste tokens without improving solution quality.
● Strong performance gains: AgentConductor outperforms the strongest baseline by up to 14.6% in pass@1 accuracy with 13% density reduction and 68% token cost reduction. The system achieves better results while using significantly fewer computational resources.
● Execution feedback refinement: Topologies are refined using execution feedback from code tests. When initial solutions fail, the orchestrator adjusts the collaboration structure based on error patterns, enabling adaptive recovery.
Paper, Tweet
8) ActionEngine - Georgia Tech and Microsoft Research introduce ActionEngine, a training-free framework that transforms GUI agents from reactive step-by-step executors into programmatic planners. It builds a state-machine memory through offline exploration, then synthesizes executable Python programs for task completion, achieving 95% success on Reddit tasks from WebArena with on average a single LLM call, reducing costs by 11.8x and latency by 2x compared to vision-only baselines. Paper, Tweet
9) CoT Faithfulness via REMUL - Researchers propose REMUL, a training approach for making chain-of-thought reasoning more faithful and monitorable. A speaker model generates reasoning traces that multiple listener models attempt to follow and complete, using RL to reward reasoning that is understandable to other models. Tested across BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO, REMUL improves three faithfulness metrics while also boosting overall accuracy, producing shorter and more direct reasoning chains. Paper, Tweet
10) Learning to Rewrite Tool Descriptions - Intuit AI Research addresses a bottleneck in LLM-agent tool use: tool descriptions are written for humans, not agents. They introduce Trace-Free+, a curriculum learning framework that optimizes tool descriptions without relying on execution traces. The approach delivers consistent gains on unseen tools, strong cross-domain generalization, and robustness as the number of candidate tools scales to over 100, demonstrating that improving tool interfaces is a practical complement to agent fine-tuning. Paper, Tweet

Top AI Papers of the Week (February 16 - February 22) - 2026

Paper Links
1) Intelligent AI Delegation - Google DeepMind introduces a comprehensive framework for intelligent AI delegation that goes beyond simple task assignment. The framework models delegation as a sequence of decisions: whether to delegate, how to instruct, and how to verify and integrate AI outputs, addressing the gap between what AI agents can do and how humans should interact with them.
● Adaptive delegation structure: The framework treats delegation as a dynamic process involving task allocation, transfer of authority, responsibility, and accountability. Rather than static heuristics, it enables real-time adaptation to environmental shifts and resilient failure management across both human and AI delegators.
● Trust calibration mechanisms: Introduces formal trust models that account for capability uncertainty, task complexity, and historical performance. This prevents both over-delegation (assigning tasks beyond agent capability) and under-delegation (failing to leverage available AI capacity).
● Verification and integration: Defines structured approaches for validating AI outputs before integration, including confidence-aware acceptance criteria and fallback protocols. This is critical for production deployments where blind trust in agent outputs creates compounding errors.
● Multi-agent delegation networks: Extends the framework to scenarios where AI agents delegate to other AI agents, creating delegation chains that require accountability tracking and authority propagation rules across the network.
Paper, Tweet
2) Emergent Socialization in AI Agent Society - A study on Moltbook, a social network with no humans where all participants are LLM-driven agents, challenges the assumption that scale and interaction density alone produce meaningful social dynamics. The researchers find that while global semantic content stabilizes quickly, individual agents maintain diversity without converging, displaying strong individual inertia and minimal adaptive response to interaction partners.
● Moltbook as a natural laboratory: Moltbook is the largest persistent, publicly accessible AI-only social platform with millions of LLM-driven agents interacting through posts, comments, and voting. This provides an unprecedented real-world testbed for studying emergent collective behavior without human intervention.
● Socialization measurement framework: The paper introduces metrics for semantic stabilization, lexical change, individual consistency, influence duration, and group consensus formation. These go beyond surface-level activity metrics to measure whether genuine social structures are forming.
● No emergent socialization: Despite massive scale and dense interactions, agents fail to develop stable social structures. They do not adapt to each other or form consensus, suggesting that current LLM architectures lack the mechanisms needed for genuine social learning.
● Shared memory as a prerequisite: The study concludes that shared memory is essential for developing stable social structures. Without persistent memory that allows agents to build on prior interactions, social dynamics remain superficial regardless of population size or interaction frequency.
Paper, Tweet
3) Lossless Context Management (LCM) - Lossless Context Management (LCM) is a deterministic architecture for LLM memory that outperforms Claude Code on long-context tasks. Benchmarked on the OOLONG eval using Opus 4.6, the LCM-augmented coding agent Volt achieves higher scores than Claude Code at every context length between 32K and 1M tokens. LCM extends the recursive paradigm pioneered by Recursive Language Models (RLMs) with two engine-managed mechanisms.
● Recursive context compression: As the active context window fills, older messages are compacted into a hierarchical summary DAG while retaining lossless pointers to every original message. This trades flexibility for termination guarantees and zero-cost continuity on short tasks.
● Recursive task partitioning: Engine-managed parallel primitives like LLM-Map replace model-written loops, analogous to the move from GOTO to structured control flow. This ensures deterministic execution and lossless retrievability of all prior states.
● Three-level escalation: LCM reduces context overflow via a structured fallback: summary nodes for older messages, compact file references for large inputs, and a guaranteed convergence mechanism that prevents runaway context growth.
● Outperforms Claude Code: On OOLONG, Volt with LCM achieves +29.2 average improvement over raw Opus 4.6, compared to +24.7 for Claude Code. The advantage is largest at 1M tokens (+51.3 vs +47.0), demonstrating that deterministic context management scales better than native file-system access at extreme lengths.
Paper, Tweet
4) GLM-5 - GLM-5 is a foundation model from Zhipu AI designed to transition from vibe coding to agentic engineering. The model introduces novel asynchronous agent RL algorithms that separate generation from training for improved efficiency, and uses DSA technology to reduce computational requirements while preserving long-context understanding.
● Asynchronous agent RL: The training infrastructure decouples trajectory generation from policy optimization, enabling parallel scaling of both components. This addresses a key bottleneck in agent RL where sequential generate-train loops limit throughput and experimentation speed.
● Agentic engineering focus: GLM-5 targets end-to-end software engineering tasks rather than isolated code generation. The model handles project-level context, multi-file edits, and iterative development cycles that reflect real production workflows.
● DSA compression: The model’s Distributed Sparse Attention mechanism reduces computational overhead for long-context processing without quality degradation. This allows the model to maintain full project-level context during extended development sessions.
● Strong benchmark results: GLM-5 demonstrates exceptional performance on real-world software engineering projects, surpassing earlier systems on end-to-end development tasks, including specification understanding, implementation, testing, and debugging.
Paper, Tweet
5) MemoryArena - MemoryArena introduces a benchmark for evaluating how agents utilize memory across multiple interconnected sessions. The key finding is that scoring well on memory recall does not mean an agent can actually use that memory to take correct actions across sessions. Models with near-saturated performance on existing benchmarks like LoCoMo perform poorly in agentic multi-session settings.
● Agentic memory evaluation: Unlike standard memory benchmarks that test recall in isolation, MemoryArena evaluates whether agents can retrieve and apply relevant past experience to make correct decisions in new contexts. This exposes a gap between retrieval accuracy and actionable memory use.
● Interdependent multi-session tasks: The benchmark spans web navigation, constrained planning, information retrieval, and logical reasoning, where decisions in one session depend on information gathered in previous sessions. This reflects real-world agent deployments where sessions are not independent.
● Exposing evaluation blind spots: Agents achieving near-perfect scores on LoCoMo and other long-context benchmarks show significant performance drops on MemoryArena. This suggests current evaluations overestimate agent memory capabilities by testing retrieval without testing downstream decision quality.
● Practical implications: For developers building persistent agents, MemoryArena provides a more realistic assessment of whether memory systems actually improve task completion rather than just information access.
Paper, Tweet
6) MAPLE - MAPLE proposes separating memory, learning, and personalization into specialized sub-agents rather than treating them as a unified capability. The framework achieves a 14.6% improvement in personalization scores over stateless baselines and increases trait incorporation from 45% to 75%, validated through the MAPLE-Personas benchmark.
● Sub-agent decomposition: Memory handles storage and retrieval infrastructure, Learning extracts intelligence from accumulated interactions asynchronously, and Personalization applies learned knowledge in real-time within finite context budgets. Each operates at different timescales with distinct objectives.
● Asynchronous learning: The Learning sub-agent processes interaction history offline, distilling patterns and preferences without consuming real-time context. This avoids the common problem of memory systems that flood the active context window with raw history.
● Context-budget-aware personalization: The Personalization sub-agent selects which learned knowledge to inject based on available context budget and current task relevance. This prevents context dilution while ensuring the most impactful personalizations are always applied.
● Benchmark validation: The MAPLE-Personas benchmark specifically evaluates whether agents can genuinely adapt to individual users over time, measuring trait incorporation and behavioral consistency across extended interaction sequences.
Paper, Tweet
7) SkillsBench - SkillsBench evaluates whether LLM agents can generate their own procedural knowledge across 86 tasks spanning 11 domains, with curated Skills and deterministic verifiers. Testing 7 agent-model configurations over 7,308 trajectories, the benchmark reveals a critical gap: agents benefit enormously from consuming procedural knowledge but cannot reliably author it themselves.
● Curated skills boost performance significantly: Providing curated Skills raises the average pass rate by 16.2 percentage points, with effects varying dramatically by domain, from +4.5pp in Software Engineering to +51.9pp in Healthcare. This shows that skill quality and domain match matter more than having skills at all.
● Self-generated skills provide no benefit: On average, models that generate their own procedural knowledge show no improvement over having no skills. This finding is critical for self-improving agent architectures that assume models can bootstrap their own capabilities.
● Focused beats comprehensive: Skills with 2-3 focused modules outperform comprehensive documentation. This suggests that retrieval precision matters more than coverage when augmenting agents with procedural knowledge.
● Smaller models close the gap: Smaller models augmented with well-curated skills can match the performance of larger models operating without skill augmentation. This has direct cost implications for production agent deployments.
Paper, Tweet
8) LongCLI-Bench - LongCLI-Bench benchmarks how well AI agents handle complex, extended tasks through command-line interfaces. Across 20 demanding tasks spanning initial development, feature expansion, error resolution, and code optimization, leading agents succeed less than 20% of the time. The study finds that most failures occur early in task execution, and human-agent collaboration through plan injection and interactive guidance yields significantly greater improvements than automated self-correction alone. Paper, Tweet
9) CogRouter - CogRouter enables adaptive reasoning depth for LLM agents by dynamically selecting from four hierarchical cognitive levels at each step, from instinctive responses to strategic planning. Using confidence-aware advantage reweighting during training, Qwen2.5-7B with CogRouter achieves 82.3% success rate on agentic benchmarks, substantially outperforming larger models while consuming fewer tokens by skipping heavy reasoning on routine steps. Paper, Tweet
10) Team of Thoughts - Team of Thoughts presents a multi-agent framework for efficient test-time scaling through orchestrated tool calling. The system uses an orchestrator tool design where agents with different capabilities are coordinated by a calibrated orchestrator. With self-assessment for tool agents and orchestrator calibration for identifying superior coordination models, Team of Thoughts achieves 96.67% on AIME24 and 72.53% on LiveCodeBench, substantially exceeding homogeneous baselines. Paper, Tweet

Top AI Papers of the Week (February 9 - February 15) - 2026

Paper Links
1) ALMA - ALMA (Automated meta-Learning of Memory designs for Agentic systems) from Jeff Clune’s group introduces a Meta Agent that automatically discovers memory designs for agentic systems through open-ended exploration in code space. Instead of relying on hand-engineered memory modules, ALMA searches over database schemas, retrieval mechanisms, and update strategies expressed as executable code, consistently outperforming all human-designed memory baselines across four sequential decision-making benchmarks.
● Open-ended code search: A Meta Agent samples previously explored memory designs from an archive, reflects on their code and evaluation logs, proposes new designs, and implements them as executable code. This gives ALMA the theoretical potential to discover arbitrary memory architectures, from graph databases to strategy libraries, unconstrained by human design intuitions.
● Domain-adaptive memory discovery: ALMA discovers fundamentally different memory structures for different domains: affordance graphs for ALFWorld, task signature databases for TextWorld, strategy libraries with rule prediction for Baba Is AI, and risk-interaction schemas for MiniHack. This specialization emerges automatically from the search process.
● Consistent gains over human baselines: Learned memory designs achieve 12.3% average success rate with GPT-5-nano (vs 8.6% for the best human baseline) and 53.9% with GPT-5-mini (vs 48.6%). The designs also scale better with more collected experience and transfer robustly across different foundation models.
● Toward self-improving agentic systems: ALMA represents a step toward AI systems that learn to be continual learners. The progressive discovery process shows that moderate-performing designs serve as stepping stones toward optimal solutions, with the archive enabling cumulative innovation across exploration iterations.
Paper, Tweet
2) LLaDA 2.1 - Ant Group releases LLaDA 2.1, a major upgrade to discrete diffusion language models that breaks the speed-quality trade-off through Token-to-Token (T2T) editing. By weaving token editing into the conventional Mask-to-Token decoding scheme, LLaDA 2.1 introduces two configurable modes: Speedy Mode for aggressive throughput and Quality Mode for benchmark-leading accuracy. The release also includes the first large-scale RL framework for diffusion LLMs.
● Editable state evolution: Unlike standard diffusion models that only unmask tokens, LLaDA 2.1 can also edit already-generated tokens. This dual action space (unmasking + correction) lets the model aggressively draft with low-confidence thresholds and then refine errors in subsequent passes, fundamentally changing the speed-quality trade-off.
● Two operating modes: Speedy Mode lowers the mask-to-token threshold for maximum throughput, relying on T2T passes to fix errors. Quality Mode uses conservative thresholds for superior benchmark scores. This gives practitioners a configurable knob between speed and accuracy without swapping models.
● Extreme decoding speed: LLaDA 2.1-Flash (100B) hits 892 tokens per second on HumanEval+ and 801 TPS on BigCodeBench. The Mini variant (16B) reaches a peak of 1,587 TPS. These speeds dramatically outpace autoregressive models of comparable quality.
● First RL for diffusion LLMs: The paper introduces EBPO (Evidence-Based Policy Optimization), an RL framework that uses block-causal masking and parallel likelihood estimation to enable stable policy optimization at scale for diffusion models. RL training sharpens reasoning and instruction-following across 33 benchmarks.
Paper, Tweet
3) SkillRL - SkillRL introduces a recursive skill-augmented RL framework that bridges the gap between raw experience and policy improvement through automatic skill discovery. Instead of storing noisy raw trajectories, SkillRL distills experience into reusable high-level behavioral patterns and evolves them alongside the agent policy during training.
● Hierarchical skill library (SkillBank): An experience-based distillation mechanism extracts reusable behavioral patterns from raw trajectories and organizes them into a hierarchical skill library. This dramatically reduces the token footprint while preserving the reasoning utility needed for complex multi-step tasks.
● Adaptive skill retrieval: A dual retrieval strategy combines general heuristics with task-specific skills, selecting the most relevant behavioral patterns based on the current task context. This enables the agent to leverage accumulated knowledge without being overwhelmed by irrelevant experience.
● Recursive co-evolution: The skill library and agent policy evolve together during RL training. As the agent encounters harder tasks, new skills are extracted, and existing ones are refined, creating a virtuous cycle where better skills enable better performance, which generates better training data for skill extraction.
● Strong empirical results: SkillRL achieves state-of-the-art performance with 89.9% success rate on ALFWorld, 72.7% on WebShop, and an average of 47.1% on search-augmented QA tasks, outperforming strong baselines by over 15.3% while maintaining robustness as task complexity increases.
Paper, Tweet
4) InftyThink+ - InftyThink+ is an end-to-end RL framework for infinite-horizon reasoning that optimizes the entire iterative reasoning trajectory. Standard long chain-of-thought suffers from quadratic cost, context length limits, and lost-in-the-middle degradation. InftyThink+ addresses all three by letting models autonomously decide when to summarize, what to preserve, and how to resume, trained through trajectory-level reinforcement learning.
● Iterative reasoning with learned boundaries: Instead of generating one continuous chain-of-thought, InftyThink+ decomposes reasoning into multiple iterations connected by self-generated summaries. The model learns to control iteration boundaries, deciding when to compress and continue rather than following fixed heuristics or chunk sizes.
● Two-stage training recipe: A supervised cold-start teaches the InftyThink format (special tokens for summary and history), then trajectory-level GRPO optimizes the full multi-iteration rollout. Advantages are shared across all iterations within a trajectory, so early high-quality summaries that enable correct later reasoning receive a positive gradient signal.
● 21% accuracy gain on AIME24: On DeepSeek-R1-Distill-Qwen-1.5B, InftyThink+ with RL improves accuracy from 29.5% to 50.9% on AIME24, a 21-point jump that substantially outperforms vanilla long-CoT RL (38.8%). Results generalize to out-of-distribution benchmarks, including GPQA Diamond and AIME25.
● Faster inference, faster training: By bounding context length per iteration, InftyThink+ reduces inference latency compared to vanilla reasoning while achieving higher accuracy. Adding an efficiency reward further cuts token usage by 50% with only a modest accuracy trade-off, demonstrating a controllable speed-accuracy knob.
Paper, Tweet
5) Agyn - Agyn is a fully automated multi-agent system that models software engineering as an organizational process rather than a monolithic code generation task. Built on an open-source platform for configuring agent teams, the system assigns specialized agents to distinct roles and follows a structured development methodology - all without human intervention. Notably, Agyn was designed for real production use and was not tuned for the SWE-bench.
● Team-based architecture: Four specialized agents (manager, researcher, engineer, reviewer) operate with distinct responsibilities, tools, and model configurations. The manager coordinates using a high-level methodology inspired by real development practice, while the engineer and reviewer work through GitHub-native pull requests and inline code reviews.
● Role-specific model routing: Reasoning-heavy agents like the manager and researcher use larger general-purpose models, while implementation agents use smaller, code-specialized models. This mirrors real team structure, where different roles need different capabilities, and reduces overall cost without sacrificing quality.
● Dynamic workflow, not a fixed pipeline: Unlike prior multi-agent SWE systems that encode a predetermined number of stages, Agyn’s coordination evolves dynamically. The manager decides when additional research, specification refinement, implementation, or review cycles are needed based on intermediate outcomes, enabling flexible iteration.
● Strong benchmark performance without tuning: Agyn resolves 72.2% of tasks on SWE-bench 500, outperforming single-agent baselines by 7.4% under comparable model configurations. The key insight is that organizational design and agent infrastructure may matter as much as model improvements for autonomous software engineering.
Paper, Tweet
6) EchoJEPA - EchoJEPA is a latent predictive foundation model for echocardiography trained on 18 million echocardiograms from 300,000 patients. By learning to predict in latent space rather than pixel space, the model separates clinically meaningful anatomical signals from ultrasound noise and artifacts, producing representations that dramatically outperform existing approaches on cardiac assessment tasks.
● Massive scale and latent prediction: Trained on 18 million echocardiograms using a JEPA-style objective that predicts masked spatiotemporal regions in latent space. This approach learns to ignore speckle noise and acoustic artifacts that plague pixel-level methods, producing representations focused on anatomically meaningful features.
● Strong improvements on clinical tasks: EchoJEPA improves left ventricular ejection fraction estimation by approximately 20% and right ventricular systolic pressure estimation by approximately 17% over leading baselines. For view classification, it reaches 79% accuracy using only 1% of labeled data, while the best baseline achieves just 42% with the full labeled dataset.
● Exceptional robustness: Under acoustic perturbations that degrade competitor models by 17%, EchoJEPA degrades only 2%. This robustness extends to population shift: zero-shot performance on pediatric patients exceeds fully fine-tuned baseline models, demonstrating genuine generalization rather than memorization.
● Clinical foundation model potential: The combination of scale, label efficiency, and robustness across patient populations positions EchoJEPA as a practical foundation for clinical echocardiography applications where labeled data is scarce and acoustic conditions vary widely.
Paper, Tweet
7) AdaptEvolve - AdaptEvolve tackles a key efficiency bottleneck in evolutionary agentic systems: the repeated invocation of large LLMs during iterative refinement loops. The method uses intrinsic generation confidence to dynamically select which model to invoke at each step, routing easy sub-problems to smaller models and reserving expensive frontier models for genuinely hard decisions.
● Confidence-driven model routing: Instead of static heuristics or external controllers, AdaptEvolve monitors real-time generation confidence scores to estimate task solvability at each evolutionary step. When the smaller model is confident, it proceeds without escalation; when uncertainty is high, the system routes to a larger, more capable model.
● Favorable cost-accuracy trade-off: Across benchmarks, AdaptEvolve cuts inference costs by approximately 38% while retaining roughly 97.5% of the upper-bound accuracy achieved by always using the largest model. This creates a Pareto-optimal frontier that static single-model or naive cascade approaches cannot match.
● Practical for deployed agent loops: Evolutionary and iterative refinement workflows often require dozens of LLM calls per task. Reducing per-call cost by nearly 40% without meaningful accuracy loss makes these workflows viable for production deployment, where cost compounds rapidly.
● Generalizable routing signal: The confidence-based selection mechanism is model-agnostic and does not require task-specific tuning, making it applicable across different evolutionary agent architectures and domain-specific refinement pipelines.
Paper, Tweet
8) Gaia2 - Meta FAIR introduces Gaia2, a next-generation agent benchmark where environments change independently of agent actions, forcing agents to handle temporal pressure, uncertainty, and multi-agent coordination. GPT-5 leads at 42% pass@1 but struggles with time-constrained tasks, while Kimi-K2 leads open-source models at 21%. Built on the open-source Agents Research Environments (ARE) platform with action-level verifiers, Gaia2 represents a paradigm shift from static benchmarks to dynamic evaluation of agentic capabilities. Paper
9) AgentArk - AgentArk distills multi-agent debate dynamics into a single LLM, transferring the reasoning and self-correction abilities of multi-agent systems into one model at training time. Three hierarchical distillation strategies (reasoning-enhanced SFT, trajectory-based augmentation, and process-aware distillation with a process reward model) yield an average 4.8% improvement over single-agent baselines across math and reasoning benchmarks, approaching full multi-agent performance at a fraction of the inference cost. Cross-family distillation (e.g., Qwen3-32B to LLaMA-3-8B) produces the largest gains, suggesting heterogeneous architectures benefit most from transferred reasoning signals. Paper, Tweet
10) AgentSkiller - AgentSkiller scales generalist agent intelligence through semantically integrated cross-domain data synthesis, producing 11K high-quality synthetic trajectories across diverse tool-use scenarios. The resulting 14B model beats GPT-o3 on tau2-bench (79.1% vs 68.4%), and even the 4B variant outperforms 70B and 235B models, demonstrating that data quality and semantic integration matter more than parameter count for building strong tool-use agents. Paper, Tweet

Top AI Papers of the Week (February 2 - February 8) - 2026

Paper Links
1) Semi-Autonomous Mathematics Discovery with Gemini - This paper from Google DeepMind presents a case study in semi-autonomous mathematics discovery using Aletheia, a specialized math research agent built on Gemini Deep Think. The team systematically evaluated 700 open conjectures from Bloom’s Erdos Problems database, combining AI-driven natural language verification with human expert evaluation, and addressed 13 previously open problems.
● Hybrid methodology: Aletheia was deployed on all 700 open Erdos problems, producing 212 candidate solutions. After initial human grading, 63 were technically correct, but only 13 (6.5%) meaningfully addressed the intended problem statement, revealing how challenging accurate mathematical reasoning remains for AI.
● Four categories of results: The 13 meaningful solutions fell into autonomous resolution (2 problems solved with novel arguments), partial AI solutions (2 multi-part problems partially solved), independent rediscovery (4 problems where solutions already existed in the literature), and literature identification (5 problems already solved but not recorded as such).
● Subconscious plagiarism risk: A key finding is that AI models can reproduce solutions from the literature without attribution, raising concerns about novelty claims. The authors found that for all AI-generated solutions not yet located in the literature, it is plausible that they were previously discovered by humans but never published.
● Challenges at scale: The most arduous step was not verifying correctness but determining whether solutions already existed in prior work. Many technically correct solutions were mathematically vacuous due to misinterpreted problem statements or notational ambiguity.
● Tempered expectations: The authors caution against overexcitement about mathematical significance, noting that most resolved problems could have been dispatched by the right human expert. However, AI shows potential to accelerate attention-bottlenecked aspects of mathematical discovery.
Paper, Tweet
2) TinyLoRA - This paper from Meta FAIR asks how small a LoRA adapter can get and still teach a model to reason. The answer: remarkably small. The authors propose TinyLoRA, a method that scales low-rank adapters down to as few as one trainable parameter by projecting through fixed random tensors and sharing weights across all modules. The key insight is that RL makes fundamentally more information-dense updates than SFT, enabling effective learning with orders of magnitude fewer parameters.
● 91% accuracy with 13 parameters: Using TinyLoRA with GRPO on GSM8K, Qwen2.5-7B-Instruct reaches 91% accuracy while training just 13 parameters (26 bytes in bf16). This recovers 95% of the full finetuning performance improvement, and even a single trained parameter yields a measurable 4% accuracy gain.
● RL vastly outperforms SFT at low parameter counts: At 13 parameters, RL scores 91% while SFT scores only 83% on GSM8K. The gap widens further below 100 parameters. The paper explains this through signal separation: RL’s reward signal cleanly isolates task-relevant features from noise, while SFT must absorb entire demonstrations, including irrelevant details, requiring far more capacity.
● TinyLoRA method: Builds on LoRA-XS by replacing the trainable rotation matrix with a low-dimensional vector projected through fixed random tensors, and shares this vector across all adapted modules via weight tying. This reduces the minimum trainable parameter count from hundreds (LoRA-XS) down to one.
● Scales across model sizes and harder benchmarks: On six difficult math benchmarks (MATH500, Minerva Math, OlympiadBench, AIME24, AMC23), finetuning Qwen2.5-7B with just 196 parameters retains 87% of the absolute performance improvement. Larger models need fewer parameters to reach the same performance threshold, suggesting trillion-scale models may be trainable with a handful of parameters.
● Practical implications for personalization: Updates under 1KB in total size open new possibilities for efficient distributed training, mass personalization (10x more LoRAs served concurrently with 10x smaller adapters), and reduced forgetting since tiny updates preserve more of the base model’s knowledge.
Paper, Tweet
3) xMemory - xMemory argues that standard RAG retrieval is a poor fit for agent memory because the evidence source is a bounded, coherent dialogue stream where candidate spans are highly correlated near-duplicates. Fixed top-k similarity retrieval collapses into a single dense region, returning redundant context, while post-hoc pruning can break temporally linked evidence chains. xMemory replaces this with hierarchical memory construction and structure-aware top-down retrieval.
● Four-level memory hierarchy: Raw messages are grouped into episodes (contiguous dialogue blocks), which are distilled into semantic nodes (reusable facts), which are organized under themes. A sparsity-semantics guidance objective balances theme sizes during construction via split and merge operations, preventing both overly large themes that cause retrieval collapse and overly fragmented ones that weaken evidence coverage.
● Two-stage top-down retrieval: Stage I selects a compact, diverse set of relevant themes and semantic nodes using a greedy coverage-relevance procedure on a kNN graph. Stage II then adaptively expands to episodes and raw messages only when the added detail reduces the reader LLM’s uncertainty, controlled by an early stopping mechanism.
● Evidence density over retrieval volume: Analysis shows xMemory retrieves substantially more evidence-dense contexts (higher 2-hit and multi-hit proportions) than both Naive RAG and RAG with pruning. It covers all answer content with fewer blocks (5.66 vs 10.81) and roughly half the token cost (975 vs 1,979 tokens).
● Consistent gains across backbones: On LoCoMo and PerLTQA benchmarks, xMemory achieves the best average performance across Qwen3-8B, Llama-3.1-8B-Instruct, and GPT-5 nano, outperforming five baselines, including Naive RAG, A-Mem, MemoryOS, Nemori, and LightMem while using fewer tokens per query.
● Retroactive restructuring: Unlike static memory stores, xMemory dynamically reassigns semantic nodes to different themes as new interactions arrive, with split and merge operations updating the high-level structure over time. Enabling this restructuring substantially improves downstream QA accuracy compared to frozen hierarchies.
Paper, Tweet
4) SALE - This paper from Meta shows that small agents match large ones on simple tasks but fall sharply behind as complexity grows, with the cheapest agent reaching only about 21% of the largest agent’s accuracy on the hardest problems. To address this, the authors introduce SALE (Strategy Auctions for Workload Efficiency), a marketplace-inspired framework where heterogeneous agents bid with strategic plans, are scored on cost-value trade-offs, and refine their bids using shared auction memory.
● Small agents don’t scale with complexity: On deep search and coding tasks graded by human solution time, the 4B agent achieves approximately 87% of the 32B agent’s accuracy on simple tasks but drops to roughly 21% on the most complex ones. This confirms that model size should be treated as a per-task routing decision, not a global choice.
● Auction-based routing mechanism: Each agent proposes a short strategic plan as its bid. A jury of all agents scores each plan’s value via peer assessment, while cost is estimated from plan length and per-token price. The agent with the best cost-minus-value trade-off wins and executes its strategy.
● Memory-driven self-improvement: After each auction, all bids (winning and losing) are stored in a shared memory bank. Cheaper agents that lost can retrieve similar past tasks, learn from winning strategies via contrastive prompting, and submit refined bids - progressively taking on more work over time, similar to how freelancers upskill in a marketplace.
● Beats the largest agent at lower cost: SALE consistently improves upon the best single agent’s accuracy by 2.7-3.8 points on the hardest tasks while reducing reliance on the 32B model by 53% and cutting overall cost by 35%. It also outperforms four established routers (WTP, CARROT, TO-Router, FrugalGPT) that either underperform the largest agent or fail to reduce cost.
● Complementary failure modes: Analysis reveals that large agents tend to over-engineer and skip tool use, while small agents favor simpler, tool-heavy strategies. SALE exploits this complementarity at bid time, routing tasks to whichever approach fits best without needing to execute full trajectories first.
Paper, Tweet
5) InfMem - InfMem is a cognitive agent for ultra-long document QA that uses System-2-style control to actively manage bounded memory. Instead of passively compressing each chunk as it streams in, InfMem runs a PreThink-Retrieve-Write loop that monitors evidence sufficiency, fetches missing facts from anywhere in the document, and compresses everything into a fixed-size memory - then stops early once it has enough.
● PreThink-Retrieve-Write protocol: At each step, a PreThink controller checks whether the current memory can already answer the question. If not, it generates a targeted retrieval query and specifies how many passages to fetch. Retrieve then pulls fine-grained paragraphs from anywhere in the document (not just nearby chunks), and Write jointly compresses the new evidence with existing memory under a fixed token budget.
● Adaptive early stopping: Once the agent determines its memory is sufficient, it terminates the loop immediately rather than processing remaining chunks. This cuts inference time by up to 5.1x on 1M-token documents while preserving or improving accuracy.
● SFT-to-RL training recipe: A two-stage pipeline first distills protocol-valid trajectories from a strong teacher (Qwen3-32B) via supervised fine-tuning, then applies GRPO with verifier-based rewards to align retrieval, writing, and stopping decisions with end-task correctness. RL adds an early-stop shaping reward that penalizes redundant retrieval after the memory becomes sufficient.
● Strong gains over MemAgent: Across Qwen3-1.7B/4B and Qwen2.5-7B on benchmarks spanning 32k to 1M tokens, InfMem outperforms MemAgent by over 10 points on average after RL, while reducing latency by 3.3-5.1x. It also transfers well to LongBench QA with consistent improvements.
● Robustness at extreme lengths: While baselines like YaRN collapse beyond 128k tokens and RAG struggles with dispersed evidence, InfMem remains stable up to 1M tokens, especially on complex multi-hop tasks that require synthesizing scattered bridging facts across distant document segments.
Paper, Tweet
6) A-RAG - A-RAG is an agentic RAG framework that gives LLMs direct access to hierarchical retrieval interfaces instead of relying on fixed retrieval algorithms or predefined workflows. The agent autonomously decides what to search, at which granularity, and when to stop - representing a paradigm shift from static retrieval pipelines to truly agentic information gathering.
● Hierarchical retrieval tools: A-RAG provides three tools operating at different granularities: keyword search for exact lexical matching at the keyword level, semantic search for dense retrieval at the sentence level, and chunk read for accessing full document chunks. The agent chooses which tool to call at each step based on the task, enabling adaptive multi-granularity retrieval.
● Agentic autonomy over fixed workflows: Unlike Graph RAG (algorithm-driven, no iterative execution) and Workflow RAG (predefined steps, no autonomous strategy), A-RAG satisfies all three principles of agentic autonomy: autonomous strategy selection, iterative execution, and interleaved tool use. The model decides its own retrieval path in a ReAct-style loop.
● Consistent gains across benchmarks: With GPT-5-mini as backbone, A-RAG outperforms all baselines on every benchmark tested (MuSiQue, HotpotQA, 2WikiMultiHopQA, Medical QA, Novel QA), beating strong methods like LinearRAG, HippoRAG2, and FaithfulRAG by significant margins.
● Better accuracy with fewer tokens: A-RAG (Full) retrieves comparable or fewer tokens than traditional RAG methods while achieving higher accuracy. The hierarchical interface design lets the agent progressively disclose information and selectively read only the most relevant chunks, avoiding noise from irrelevant content.
● Scales with test-time compute: Increasing max retrieval steps and reasoning effort both improve performance, with stronger models benefiting more from additional steps. Scaling reasoning effort from minimal to high yields approximately 25% improvement for both GPT-5-mini and GPT-5.
Paper, Tweet
7) Agent Primitives - Agent Primitives introduces reusable latent building blocks for LLM-based multi-agent systems. Inspired by how neural networks are built from composable modules like residual blocks and attention heads, the authors decompose existing MAS architectures into three recurring computation patterns that communicate via KV cache instead of natural language, reducing error accumulation and boosting efficiency.
● Three core primitives: Review (a Solver-Critic feedback loop for iterative self-refinement), Voting and Selection (parallel Solvers with a Selector that aggregates latent candidates), and Planning and Execution (a Planner that decomposes tasks into subgoals consumed by Executor agents). Each primitive communicates internally through KV cache concatenation rather than text generation.
● KV cache over natural language: Stress tests show natural-language communication degrades sharply under long contexts and noise injection, while KV cache communication stays robust. With midpoint task injection, natural-language accuracy drops to 15.6% compliance versus 73.3% for KV cache.
● Automatic composition via an Organizer: An LLM-based Organizer selects and composes primitives per query, guided by a lightweight Knowledge Pool of 45 previously successful MAS configurations. This eliminates manual system design while maintaining strong performance across tasks.
● Consistent accuracy gains: Across eight benchmarks (math, code, QA) and five open-source backbones, primitives-based MAS improves accuracy by 12.0-16.5% over single-agent baselines. It also outperforms 10 existing MAS methods, including Self-Refine, AgentVerse, and MAS-GPT, on a unified Llama-3-70B evaluation.
● Major efficiency improvements: Compared to text-based MAS, Agent Primitives reduces token usage and inference latency by 3-4x while achieving higher accuracy. Total overhead is only 1.3-1.6x relative to single-agent inference, making it practical for deployment.
Paper, Tweet
8) Accelerating Scientific Research with Gemini - A collection of case studies from Google Research showing how researchers used Gemini Deep Think to solve open problems, refute conjectures, and generate new proofs across theoretical computer science, information theory, cryptography, optimization, economics, and physics. The paper extracts a practical playbook of recurring techniques, including iterative refinement, cross-disciplinary knowledge transfer, counterexample search, and neuro-symbolic verification loops where the model autonomously writes and executes code to validate derivations. Notable results include identifying a fatal flaw in a cryptography preprint on SNARGs, resolving the Courtade-Kumar conjecture in information theory, and proving that the simplex is optimal for Euclidean Steiner trees. Paper, Tweet
9) Heterogeneous Computing for AI Agent Inference - This paper introduces Operational Intensity (OI) and Capacity Footprint (CF) as two metrics that better characterize AI agent inference workloads than traditional roofline models, revealing that memory capacity - not just bandwidth or compute - is often the true bottleneck. Analysis across agent types (chatbot, coding, web-use, computer-use) shows that agentic workflows create vastly different and rapidly growing demands on hardware, with context lengths snowballing to over 1M tokens in coding agents. The authors argue for disaggregated, heterogeneous compute architectures with specialized prefill and decode accelerators, hardware-aware model co-design, and large-capacity memory disaggregation as essential directions for scaling AI agent systems. Paper, Tweet
10) OpenScholar - OpenScholar is a fully open, retrieval-augmented language model designed for scientific literature synthesis. It retrieves passages from a datastore of 45 million open-access papers, generates citation-backed responses, and iteratively refines outputs through a self-feedback loop. On ScholarQABench, the first large-scale multi-domain benchmark for literature search, OpenScholar-8B outperforms GPT-4o by 6% and PaperQA2 by 5.5% in correctness, while achieving citation accuracy on par with human experts and being preferred over expert-written answers 51-70% of the time. Paper, Tweet

Top AI Papers of the Week (January 26 - February 1) - 2026

Paper Links
1) Kimi K2.5: Visual Agentic Intelligence - Kimi K2.5 is an open-source multimodal agentic model from Moonshot AI that jointly optimizes text and vision capabilities through native multimodal pretraining on 15 trillion mixed tokens, zero-vision SFT, and joint reinforcement learning. K2.5 also introduces Agent Swarm, a parallel agent orchestration framework that dynamically decomposes complex tasks into concurrent subtasks, reducing latency by up to 4.5x over single-agent baselines.
● Joint text-vision optimization: K2.5 uses early fusion with a lower vision ratio during pretraining (rather than late-stage heavy vision injection), achieving better results across both modalities. A key finding is that zero-vision SFT - using only text SFT data - is sufficient to activate visual reasoning and tool use, while visual RL actually improves text benchmarks like MMLU-Pro (+1.7%) and GPQA-Diamond (+2.1%).
● Agent Swarm with Parallel-Agent RL: The framework trains a learnable orchestrator via RL to decompose tasks and delegate subtasks to frozen specialized subagents running in parallel. This decoupled design avoids credit assignment ambiguity and improves item-level F1 from 72.8% to 79.0% on wide-search scenarios while significantly reducing inference latency.
● State-of-the-art agentic performance: K2.5 achieves 74.9% on BrowseComp (with context management), 77.1% on DeepSearchQA, and 57.4% on Seal-0, outperforming GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro. It also scores 96.1% on AIME 2025, 76.8% on SWE-Bench Verified, and establishes new records in long-video comprehension.
● Token-efficient RL with Toggle: K2.5 introduces Toggle, a training heuristic that alternates between budget-constrained and standard scaling phases during RL, reducing output tokens by 25-30% with negligible performance impact while maintaining strong test-time scaling capabilities.
Paper, Tweet
2) Shaping Capabilities with Token-Level Data Filtering - Researchers from Anthropic and Stanford show that filtering pretraining data at the token level is a highly effective, scalable, and robust approach for selectively removing undesired capabilities from language models. Using medical knowledge removal as a proxy task, token-level filtering Pareto dominates document-level filtering and achieves a 7,000x compute slowdown on the target domain for 1.8B parameter models - while preserving capabilities in related fields.
● Token filtering beats document filtering: Inspired by data attribution research showing individual tokens vary in their influence on model capabilities, the authors filter tokens rather than whole documents. This achieves the same reduction in undesired capabilities with lower cost to benign ones, since document filtering removes many useful tokens alongside harmful ones. Sweeping across classifier thresholds on 521M models confirms that token filtering Pareto dominates document filtering.
● Effectiveness scales with compute: Training models from 61M to 1.8B parameters, the authors find filtering gets more effective at larger scales. For 1.8B models, token removal causes a 7,000x effective compute slowdown on the forget domain versus just 30x for document filtering. On multiple choice medical benchmarks, filtered models score near chance, while retaining full performance on biology, STEM, and non-STEM evaluations.
● 10x more robust than unlearning: Token-filtered models are 10x more robust to adversarial finetuning attacks than state-of-the-art unlearning methods. This addresses a key limitation of post-hoc approaches - once a capability exists in a base model, it is extremely hard to remove, but preventing it from forming during pretraining is far more durable.
● Compatibility with alignment and SAE-based labeling: Surprisingly, models trained with token filtering generalize to refusal training better than unfiltered baselines, countering concerns that filtered models cannot be properly aligned on removed domains. The authors also introduce a novel pipeline using sparse autoencoders to label tokens and distill cheap, high-quality classifiers, showing that filtering remains effective even with noisy labels given sufficient compute.
Paper, Tweet
3) How AI Impacts Skill Formation - Researchers from Anthropic conducted randomized experiments to study how AI assistance affects the development of software engineering skills. They find that using AI to complete coding tasks with a new Python library significantly impaired conceptual understanding, code reading, and debugging abilities - without delivering significant efficiency gains on average.
● Learning loss from AI assistance: In a controlled study with 52 developers learning the Python Trio library, participants using AI scored 17% lower (Cohen’s d=0.738, p=0.01) on a skills evaluation covering conceptual understanding, debugging, and code reading. The largest gap appeared in debugging questions, likely because control group participants encountered and independently resolved more errors during the task.
● No significant productivity gains: Contrary to prior work showing AI-assisted coding speedups, AI did not significantly reduce task completion time in this learning context. Several participants spent up to 11 minutes composing queries to the AI assistant, offsetting potential time savings from code generation.
● Six distinct AI interaction patterns: Qualitative analysis of screen recordings revealed three low-scoring patterns (AI Delegation, Progressive AI Reliance, Iterative AI Debugging) averaging below 40% quiz scores, and three high-scoring patterns (Conceptual Inquiry at 86%, Generation-Then-Comprehension at 68%, Hybrid Code-Explanation at 65%) where participants stayed cognitively engaged.
● Implications for AI-assisted workflows: The findings suggest that AI-enhanced productivity is not a shortcut to competence. The high-scoring interaction patterns all involved independent thinking and cognitive effort, indicating that how AI is used matters more than whether it is used - particularly in safety-critical domains requiring human oversight of AI-generated code.
Paper, Tweet
4) VibeTensor - VibeTensor is an open-source deep learning system software stack from NVLabs that was fully generated by LLM-powered coding agents under high-level human guidance. The system implements a PyTorch-style eager tensor library with a C++20/CUDA core, Python and Node.js frontends, its own autograd engine, CUDA runtime, and caching allocator - demonstrating that coding agents can produce coherent system software spanning language bindings down to GPU memory management.
● Full-stack generated architecture: The system includes a schema-lite dispatcher, reverse-mode autograd engine, stream-ordered caching allocator with diagnostics, CUDA graph support, and a stable C ABI for dynamically loaded operator plugins. The 28B LOC codebase spans 218 core C++ files and 225 Python test files, all generated without per-change manual diff review.
● AI-assisted development methodology: A two-month development cycle used a simple loop: specify scoped goals, generate code, compile and test, then broaden validation. Tests as specifications and differential checks against PyTorch served as key guardrails, with multi-agent code review catching unsafe patterns.
● Kernel performance and training validation: An accompanying AI-generated kernel suite shows mixed results: 1.54x faster than FlashAttention on NanoChat-style training (batch 32, seq 2048) but 0.67x on small-batch GQA prefill. End-to-end training on H100 and Blackwell GPUs converges correctly but runs 1.7-6.2x slower than PyTorch.
● The Frankenstein composition effect: The paper identifies a key failure mode where individually correct generated subsystems compose into globally suboptimal designs - for example, a correctness-first autograd gate serializes execution and starves efficient backend kernels, highlighting challenges unique to AI-generated system software.
Paper, Tweet
5) Reinforcement Learning via Self-Distillation - This paper introduces Self-Distillation Policy Optimization (SDPO), an on-policy RL algorithm that converts rich textual feedback from verifiable environments into dense credit assignment without requiring an external teacher model. SDPO uses the current model conditioned on feedback as a “self-teacher” to retrospectively identify mistakes in its own rollouts, substantially outperforming GRPO across scientific reasoning, tool use, and competitive programming.
● Self-teacher for dense credit assignment: Instead of learning from sparse scalar rewards like GRPO, SDPO re-evaluates the model’s original attempt after conditioning on environment feedback (runtime errors, failed tests, or successful rollouts). This produces logit-level advantages at every token position, compared to GRPO’s constant per-rollout advantages. The approach requires only minor changes to standard RLVR pipelines by swapping out the advantage computation.
● Strong gains on competitive programming: On LiveCodeBench v6 with Qwen3-8B, SDPO reaches 48.8% accuracy versus 41.2% for GRPO, surpassing Claude Sonnet 4 (40.5%) and Claude Opus 4 (39.7%) on the public leaderboard. SDPO achieves GRPO’s final accuracy in 4x fewer generations, with gains growing at larger model scales - suggesting self-teaching is an emergent capability.
● Effective even without rich feedback: In standard RLVR environments with only scalar rewards, SDPO treats successful rollouts as implicit feedback for failed attempts, achieving 68.8% vs. 64.1% aggregate accuracy over GRPO on scientific reasoning and tool use benchmarks. On Chemistry with OLMo3-7B, SDPO reaches GRPO’s 5-hour accuracy in just 30 minutes.
● Concise reasoning without verbosity: SDPO produces responses that are 3-7x shorter than GRPO while achieving higher accuracy, avoiding circular reasoning patterns and filler phrases. At test time, SDPO accelerates discovery of solutions on difficult tasks by 3x compared to best-of-k sampling, enabling effective test-time self-distillation on individual questions.
Paper, Tweet
6) Self-Improving Pretraining - Self-Improving Pretraining is a new pretraining paradigm from Meta FAIR that replaces standard next-token prediction with sequence-level generation guided by an existing post-trained model acting as both a suffix rewriter and a suffix judge. The approach addresses quality, safety, and factuality issues at pretraining time rather than deferring them to post-training, yielding large gains across all three dimensions.
● Suffix rewriting and judging framework: The method segments pretraining data into prefix-suffix chunks. A post-trained teacher model rewrites low-quality or unsafe suffixes into superior training targets, while a separate judge scores candidate completions (original suffixes, rewrites, and policy rollouts) to provide rewards for online RL training via online DPO or reward-filtered NLL.
● Strong continual pretraining gains: When applied to continual pretraining of Llama2 1.4B, the method achieves an 86.3% generation quality win rate over the baseline, a 36.2% relative improvement in factuality (42.3 to 57.6 average score), and an 18.5% relative improvement in safety (76.9 to 91.1 average score), while also improving standard evaluation benchmarks.
● From-scratch pretraining improvements: Training from scratch on RedPajama yields a 31.1% absolute gain in generation quality win rate, and safety evaluations, improving from 85.2 to 97.5, demonstrating that embedding quality signals early in pretraining is highly effective.
● Scaling with rollouts: Performance improves consistently with more rollouts during online DPO training (tested from 1 to 16), and the model naturally transitions from relying on suffix rewrites early in training to preferring its own high-quality rollouts as training progresses.
Paper, Tweet
7) LingBot-World: Open-Source World Simulator - LingBot-World is an open-source world simulator that evolves a video generation model into an interactive, real-time environment engine. Built on a 28B-parameter Mixture-of-Experts architecture, it achieves high-fidelity dynamics across diverse domains with sub-second latency at 16 fps, outperforming Genie 3 and Mirage 2 in dynamic degree while being fully open-source.
● Three-stage evolution pipeline: A progressive training strategy transforms a pretrained video model into an interactive simulator: Stage I establishes a general video prior via the Wan2.2 14B model, Stage II injects world knowledge and action control through MoE middle-training on 60-second sequences, and Stage III adapts to causal attention with few-step distillation for real-time inference.
● Scalable data engine with hierarchical captioning: A hybrid data engine ingests real-world footage, game engine recordings, and Unreal Engine synthetic data. A three-layer captioning strategy (narrative, scene-static, and dense temporal) disentangles motion control from scene generation, enabling precise action-contingent dynamics learning.
● Emergent spatial memory: Without explicit 3D representations, the model maintains structural integrity of landmarks after 60 seconds out of view, reasons about unobserved state evolution (vehicles continuing trajectories off-screen), and supports coherent generation up to 10 minutes. VBench evaluation shows 0.8857 dynamic degree versus 0.76 for Yume-1.5 and 0.72 for HY-World 1.5.
● Versatile embodied AI applications: Beyond visual synthesis, the framework supports promptable world events (global weather/style shifts and local object injection via text), an action agent trained on Qwen3-VL-2B for autonomous exploration, and 3D reconstruction from generated videos validating geometric consistency.
Paper, Tweet
8) Insight Agents: Multi-Agent System for Data Insights - Insight Agents introduces a hierarchical multi-agent system built on a plan-and-execute paradigm for delivering personalized business insights to e-commerce sellers. The system uses a manager agent with OOD detection via a lightweight encoder-decoder model and BERT-based routing to coordinate two worker agents (data presenter and insight generator), achieving 90% accuracy with P90 latency below 15 seconds. Accepted at SIGIR 2025 and deployed for Amazon sellers in the US. Paper, Tweet
9) Communication Methods in Multi-Agent RL - A systematic survey of 29 papers reviewing how agents coordinate in multi-agent reinforcement learning, covering fully connected message passing, implicit communication, attention-based selective methods, graph-based relational approaches, and role-based hierarchical frameworks. The analysis reveals that attention- and graph-based methods dominate recent research, while implicit communication is seeing renewed interest for its scalability in decentralized settings where explicit channels are infeasible. Paper, Tweet
10) Team of Rivals: Orchestrating Reliable AI Agents - This paper proposes organizing AI agents into corporate-style teams with strict role boundaries and opposing incentives (planners, executors, critics, experts) to achieve reliability through careful orchestration of imperfect components. A remote code executor separates reasoning from data transformations, preventing raw tool outputs from contaminating agent context windows. The system achieves over 90% internal error interception before user exposure while maintaining acceptable latency tradeoffs. Paper, Tweet

Top AI Papers of the Week (January 19 - January 25) - 2026

Paper Links
1) TTT-Discover: Learning to Discover at Test Time - TTT-Discover introduces test-time training for scientific discovery, performing reinforcement learning at test time so the LLM can continue to train with experience specific to the test problem. Unlike prior work like AlphaEvolve that prompts a frozen LLM, this approach enables the model itself to improve while attempting to solve hard problems.
● Test-time reinforcement learning: The method performs RL in an environment defined by a single test problem, with a learning objective and search subroutine designed to prioritize the most promising solutions rather than maximizing average reward across attempts.
● State-of-the-art across domains: TTT-Discover sets new records on Erdős’ minimum overlap problem, an autocorrelation inequality, GPUMode kernel competitions (up to 2x faster than prior art), AtCoder algorithm competitions, and single-cell denoising in biology.
● Open model results: All results are achieved with OpenAI gpt-oss-120b, an open model, and can be reproduced with publicly available code, in contrast to previous best results requiring closed frontier models.
● Cost-effective discovery: Test-time training runs cost only a few hundred dollars per problem using Tinker API, making scientific discovery accessible without massive compute budgets.
● Learning over search: While both learning and search scale with compute, the authors argue learning has historically superseded search for hard problems (Go, protein folding), and this observation extends to test-time compute scaling for discovery.
Paper, Tweet
2) Reasoning Models Generate Societies of Thought - This paper reveals that enhanced reasoning in models like DeepSeek-R1 and QwQ-32B emerges not from extended computation alone, but from simulating multi-agent-like interactions - a “society of thought” - enabling diversification and debate among internal cognitive perspectives with distinct personality traits and domain expertise.
● Multi-agent internal dynamics: Through mechanistic interpretability analysis, reasoning models exhibit much greater perspective diversity than instruction-tuned models, activating broader conflict between heterogeneous personality and expertise-related features during reasoning.
● Conversational behaviors drive accuracy: The multi-agent structure manifests in question-answering, perspective shifts, and reconciliation of conflicting views. These socio-emotional roles characterizing back-and-forth conversations account for the accuracy advantage in reasoning tasks.
● Emergent from accuracy rewards: Controlled reinforcement learning experiments reveal that base models naturally increase conversational behaviors when rewarded solely for reasoning accuracy, suggesting this structure emerges organically from optimization pressure.
● Accelerated improvement through scaffolding: Fine-tuning models with conversational scaffolding accelerates reasoning improvement over base models, providing a practical pathway to enhance reasoning capabilities.
● Parallel to collective intelligence: The findings suggest reasoning models establish a computational parallel to human collective intelligence, where diversity enables superior problem-solving when systematically structured, opening new opportunities for agent organization.
Paper, Tweet
3) Memory Control for Long-Horizon Agents - This paper introduces the Agent Cognitive Compressor (ACC), a bio-inspired mechanism that addresses degraded agent behavior in long multi-turn workflows caused by loss of constraint focus, error accumulation, and memory-induced drift. ACC replaces continuous transcript retention with a bounded internal state that updates incrementally during each interaction turn.
● The problem with unbounded context: Traditional approaches using transcript replay or retrieval-based memory systems create unbounded context growth and introduce vulnerabilities to corrupted information, causing agent performance to degrade over extended interactions.
● Bio-inspired bounded memory: Drawing from biological memory systems, ACC maintains a bounded internal state rather than continuously growing context, enabling stable performance without the computational costs of ever-expanding transcripts.
● Agent-judge evaluation framework: The authors developed an agent-judge-driven evaluation framework to assess both task success and memory-related anomalies across extended workflows in IT operations, cybersecurity response, and healthcare contexts.
● Reduced cognitive drift: ACC demonstrated substantially improved stability in multi-turn interactions, showing significantly reduced hallucination and cognitive drift compared to traditional transcript replay and retrieval-based systems.
● Practical foundation: The research suggests that implementing cognitive compression principles provides a practical foundation for developing reliable long-horizon AI agent systems that maintain consistent behavior over extended deployments.
Paper, Tweet
4) Benchmarking Agents on Hard CLI Tasks - Terminal-Bench 2.0 presents a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification, addressing the gap where current benchmarks either don’t measure real-world tasks or aren’t sufficiently difficult.
● Challenging for frontier models: Frontier models and agents score less than 65% on the benchmark, demonstrating that Terminal-Bench meaningfully measures capabilities that current systems struggle with, unlike saturated benchmarks.
● Real-world task inspiration: Tasks are derived from actual command-line workflows, ensuring the benchmark measures practical skills rather than artificial puzzles, with each task featuring unique environments reflecting diverse real scenarios.
● Comprehensive verification: Every task includes human-written solutions and comprehensive tests for verification, enabling reliable and reproducible evaluation of agent performance on terminal-based tasks.
● Error analysis insights: The authors conduct detailed error analysis to identify specific areas for model and agent improvement, providing actionable guidance for researchers developing more capable CLI agents.
● Open evaluation infrastructure: The dataset and evaluation harness are publicly available at tbench.ai, enabling developers and researchers to benchmark their systems and contribute to advancing agent capabilities in terminal environments.
Paper, Tweet
5) Rethinking Multi-Agent Workflows - This paper challenges the assumption that complex tasks require multiple specialized AI agents, demonstrating that a single LLM agent, through iterative dialogue, can match the performance of homogeneous multi-agent workflows while gaining efficiency from KV cache reuse.
● Single-agent hypothesis: The research tests whether multi-agent systems truly require multiple agents or if a single agent engaging in multi-turn conversations can replicate their performance, finding that the latter holds across diverse benchmarks.
● Comprehensive evaluation: Testing across seven benchmarks spanning coding, math, QA, domain reasoning, and planning tasks demonstrates that the single-agent approach consistently matches multi-agent performance.
● OneFlow algorithm: The paper introduces OneFlow, an algorithm that automatically optimizes workflows for single-agent execution, enabling practitioners to simplify complex multi-agent architectures without sacrificing capability.
● Efficiency through KV cache reuse: Single-agent implementations gain substantial efficiency advantages by reusing key-value caches across conversation turns, reducing inference costs compared to multi-agent orchestration overhead.
● Future directions: The work identifies that truly heterogeneous systems using different specialized LLMs remain an open research opportunity, as current multi-agent benefits may only emerge when agents have genuinely different capabilities.
Paper, Tweet
6) Self-Correcting Multi-Agent LLM for Physics Simulation - This paper introduces a self-correcting multi-agent LLM framework for language-based physics simulation and explanation. The system enables natural language queries to generate physics simulations while providing explanations of the underlying physical phenomena.
● Multi-agent architecture: The framework employs multiple specialized LLM agents that collaborate to translate natural language descriptions into accurate physics simulations, with each agent handling distinct aspects of the simulation pipeline.
● Self-correction mechanism: Built-in self-correction capabilities allow the system to identify and fix errors in generated simulations, improving accuracy without requiring human intervention or additional training.
● Language-based interface: Users can describe physics scenarios in natural language, making complex simulation tools accessible to non-experts while maintaining scientific accuracy in the outputs.
● Explanation generation: Beyond simulation, the system generates natural language explanations of the physics principles at work, serving both educational and research applications.
● Validation across domains: The framework demonstrates effectiveness across multiple physics domains, showing generalization capability beyond narrow task-specific applications.
Paper, Tweet
7) AI IDEs vs Autonomous Agents - This empirical study investigates how LLM-based coding agents that autonomously generate and merge pull requests affect open-source projects compared to IDE-integrated AI assistants. Using longitudinal causal analysis with matched controls, the researchers measure development velocity and software quality outcomes.
● Methodology: The study employs staggered difference-in-differences with matched controls, analyzing monthly metrics spanning development velocity and quality indicators like static-analysis warnings, code complexity, and duplication rates.
● Velocity gains are conditional: Substantial upfront acceleration occurs only when autonomous agents are a project’s first AI tool. Projects already using IDE assistants see minimal additional productivity benefits from adding autonomous agents.
● Persistent quality concerns: Across all contexts, static-analysis warnings rise roughly 18% and cognitive complexity increases approximately 35% when autonomous agents are deployed, suggesting tensions between speed and maintainability.
● Diminishing returns: Layering multiple AI assistance types produces limited additional productivity improvements, challenging the assumption that more AI tools always mean better outcomes.
● First-mover effects: The research differentiates effects based on whether agents represent a project’s first exposure to AI tooling versus augmenting existing assistance, finding that the sequence of adoption matters significantly.
Paper, Tweet
8) Efficient Agents - A comprehensive review examining how to make LLM-based agents more efficient for real-world deployment, focusing on three core components: memory (bounding context via compression), tool learning (RL strategies to minimize tool invocation), and planning (controlled search mechanisms). The paper characterizes efficiency through dual metrics and Pareto frontier analysis between effectiveness and cost. Paper, Tweet
9) Task-Decoupled Planning for Long-Horizon Agents - Task-Decoupled Planning (TDP) is a training-free framework that restructures agent planning by decomposing tasks into a directed acyclic graph of sub-goals using three components: Supervisor, Planner, and Executor. By isolating reasoning to individual subtasks through scoped contexts, TDP prevents error cascading and reduces token consumption by up to 82% while outperforming baselines on TravelPlanner, ScienceWorld, and HotpotQA. Paper, Tweet
10) Large-Scale Study on Multi-Agent AI Systems Development - An empirical analysis of over 42,000 commits and 4,700 resolved issues across eight leading multi-agent frameworks (LangChain, CrewAI, AutoGen). Key findings: feature enhancements dominate at 40.8% of changes versus 27.4% bug fixes, bugs represent 22% of issues, with agent coordination challenges at 10%, and issue reporting surged notably beginning in 2023. Paper, Tweet

Top AI Papers of the Week (January 12 - January 18) - 2026

Paper Links
1) Learning Latent Action World Models In The Wild - Meta AI researchers address learning world models from in-the-wild videos without requiring explicit action labels, expanding beyond simple robotics simulations and video games to real-world video data with diverse embodiments and uncontrolled conditions.
● Latent action learning: The work demonstrates that continuous but constrained latent actions can capture the complexity of actions from in-the-wild videos, outperforming vector quantization approaches commonly used in prior work.
● Cross-video transfer: Changes in the environment coming from agents, such as humans entering a room, can be transferred across different videos, indicating that the learned latent actions capture meaningful and generalizable environmental interactions.
● Universal interface: Despite challenges from diverse embodiments across videos, the researchers train a controller that maps known actions to latent ones, enabling latent actions to serve as a universal interface for downstream planning tasks.
● Comparable to action-conditioned baselines: The latent action approach achieves comparable performance to action-conditioned baselines on planning tasks, demonstrating practical viability without requiring explicit action labels during training.
● Scaling to real-world data: The work represents progress toward scaling latent action models to realistic video data, addressing fundamental challenges in learning from diverse, uncontrolled video sources that lack action annotations.
Paper, Tweet
2) Extending Context by Dropping Positional Embeddings - DroPE introduces a method for extending a language model’s context window after pretraining without expensive long-context fine-tuning. The approach involves removing positional embeddings from a pretrained model and performing brief recalibration at the original context length.
● Core insight: Positional embeddings serve as a “training-time scaffold” - beneficial during pretraining but detrimental for extrapolation. RoPE enables faster attention non-uniformity development during training, but becomes problematic at test time when sequences exceed training length.
● The length generalization problem: Popular RoPE scaling methods preserve perplexity but essentially “crop” effective context, failing at retrieval tasks requiring long-range attention. DroPE addresses this by completely removing the positional scaffold after training.
● Simple methodology: The approach is straightforward: train or obtain a pretrained RoPE-based model, remove positional embeddings post-pretraining, then recalibrate briefly using as little as 0.5-2% of the original pretraining budget.
● Strong recovery: Models regain 95%+ in-context performance after less than 5B recalibration tokens. On needle-in-haystack tasks, DroPE substantially outperforms RoPE-scaling methods that fail at long-range retrieval.
● Scalability and benchmarks: Validated on models up to 7B parameters trained on trillions of tokens. Improves base SmolLM scores by 10x on LongBench and enables zero-shot context extension to 2x training length without task-specific fine-tuning.
Paper, Tweet
3) Self-Evolving Search Agents Without Training Data - Dr. Zero introduces a framework for developing multi-turn search agents that improve themselves autonomously without labeled training data. A proposer generates diverse questions to train a solver initialized from the same base model, creating a self-evolution loop with automated curriculum difficulty scaling.
● Self-evolution loop: The framework establishes a feedback mechanism where a problem proposer creates questions and a solver learns from them. As the solver improves, difficulty automatically increases, creating an automated curriculum without human intervention.
● Hop-Grouped Relative Policy Optimization (HRPO): A novel training method that clusters structurally similar questions to construct group-level baselines. This approach reduces computational overhead while maintaining performance quality compared to instance-level optimization.
● Data-free performance: Experimental results demonstrate that the approach matches or surpasses fully supervised search agents, proving sophisticated multi-turn reasoning capabilities can emerge through self-evolution alone.
● Reduced data dependency: The work shows that complex reasoning and search functionalities can develop without external training data, potentially reducing dependency on expensive labeled datasets in AI development.
● Scalable self-improvement: The proposer-solver architecture enables continuous improvement cycles where the model effectively teaches itself increasingly difficult problems, suggesting a path toward more autonomous agent development.
Paper, Tweet
4) Unified Long-Term and Short-Term Memory for LLM Agents - AgeMem introduces a unified framework that integrates both long-term and short-term memory operations into an LLM agent’s decision-making policy. The system enables agents to autonomously determine what and when to store, retrieve, update, summarize, or discard information by exposing memory operations as tool-based actions.
● Unified memory management: Unlike existing solutions that treat long-term and short-term memory separately with inflexible heuristics, AgeMem combines both into a single learnable policy that adapts to task requirements dynamically.
● Memory as tool actions: The framework exposes memory operations (store, retrieve, update, summarize, discard) as callable tools, allowing the agent to learn optimal memory strategies through interaction rather than relying on predefined rules.
● Progressive reinforcement learning: A three-stage training approach with a specialized “step-wise GRPO” algorithm handles the sparse and discontinuous rewards created by memory operations, enabling stable learning of complex memory policies.
● Strong benchmark performance: Testing across five long-horizon benchmarks demonstrates that AgeMem outperforms comparable systems by improving task performance, memory quality, and context efficiency simultaneously.
● Architecture agnostic: The approach works with multiple LLM architectures, suggesting the learned memory management strategies transfer across different base models and task domains.
Paper, Tweet
5) Active Context Compression for LLM Agents - Focus introduces an agent-centered architecture that enables LLM agents to autonomously manage their own memory by deciding when to consolidate learnings into a persistent “Knowledge” block and actively prune raw interaction history. The design is inspired by the biological navigation patterns of Physarum polycephalum (slime mold).
● The context bloat problem: LLM agents struggle with extended tasks as interaction history accumulates, causing computational expenses to increase, processing delays to worsen, and reasoning to deteriorate from distraction by irrelevant prior mistakes.
● Autonomous memory management: Unlike passive external summarization, Focus agents autonomously choose when to store important discoveries and remove raw interaction records. The system performed 6.0 autonomous consolidations per assignment on average.
● Significant token reduction: Tested on context-heavy SWE-bench Lite cases using Claude Haiku 4.5, Focus reduces token consumption by 22.7% (14.9M to 11.5M tokens) while preserving identical accuracy (60% for both agents), with reductions reaching 57% on particular instances.
● Bio-inspired optimization: The architecture models biological navigation patterns where organisms efficiently manage resources and pathways, applying similar principles to context management in AI agents.
● Production-ready toolkit: The system uses a refined toolkit matching production standards, including a persistent bash and string-replacement editor, demonstrating practical applicability for real-world software engineering tasks.
Paper, Tweet
6) Agent-as-a-Judge - This comprehensive survey traces the evolution from LLM-based evaluation to agentic evaluation approaches, establishing the first taxonomy for this paradigm shift. As evaluation tasks grow more intricate and specialized, traditional single-pass language model judges become insufficient.
● Beyond LLM-as-a-Judge: The paper identifies critical limitations of traditional LLM judges and how agentic approaches overcome them through planning, tool-augmented verification, multi-agent collaboration, and persistent memory.
● Developmental taxonomy: The survey creates a structured taxonomy organizing core methodologies that characterize the shift from static evaluation to dynamic, agent-based assessment systems.
● Enhanced capabilities: Agentic judges enable evaluations that are more robust, verifiable, and nuanced compared to single-pass reasoning approaches, particularly for complex tasks requiring multi-step verification.
● Domain applications: The work examines applications across both general and professional domains, showing how agentic evaluation adapts to specialized requirements in different fields.
● Research roadmap: Beyond surveying current methods, the paper analyzes frontier challenges and proposes research directions, offering practitioners a clear roadmap for developing next-generation evaluation systems.
Paper, Tweet
7) Efficient Lifelong Memory for LLM Agents - SimpleMem introduces a memory framework built on semantic lossless compression that addresses the tension between maintaining comprehensive long-term memory and minimizing token overhead for LLM agents. The approach achieves a 26.4% F1 improvement over baselines while reducing token consumption by up to 30-fold during inference.
● Semantic structured compression: The first stage applies filtering to transform unstructured interactions into compact, multi-view indexed memory units, preserving essential information while dramatically reducing storage requirements.
● Recursive memory consolidation: An asynchronous process reduces redundancy by integrating related memory units into higher-level representations, similar to how human memory consolidates experiences during rest periods.
● Adaptive query-aware retrieval: The system dynamically adjusts the retrieval scope based on query complexity, constructing context efficiently by pulling only the most relevant memories rather than fixed-size chunks.
● Strong efficiency gains: Experimental results demonstrate token consumption reduced by up to 30-fold during inference while improving accuracy, making long-horizon agent tasks practically feasible without prohibitive computational costs.
● Balanced performance: The framework provides a practical solution for deploying agents that need comprehensive memory without sacrificing response quality, addressing a critical bottleneck in real-world agent applications.
Paper, Tweet
8) Ministral 3 - Mistral AI releases Ministral 3, a family of compact language models (3B, 8B, 14B parameters) designed for compute and memory-constrained applications from mobile to edge deployments. Created through Cascade Distillation (iterative pruning with continued training), each size offers pretrained, instruction-finetuned, and reasoning variants with integrated image understanding, released under Apache 2.0. Paper, Tweet
9) UniversalRAG - UniversalRAG introduces a RAG system that handles knowledge retrieval from heterogeneous sources containing multiple data types (text, images, videos) with varying granularities. Rather than forcing diverse modalities into a single embedding space where embeddings cluster by modality rather than meaning, it uses modality-aware routing to dynamically select appropriate corpus and granularity for each query, outperforming both unimodal and unified multimodal RAG baselines across 10 benchmarks. Paper, Tweet
10) MemRL - MemRL enables LLM agents to improve continuously without retraining by separating a frozen model’s reasoning from an evolving memory system. A Two-Phase Retrieval mechanism filters candidates by semantic relevance, then ranks them using learned Q-values that improve through trial-and-error, outperforming existing methods on HLE, BigCodeBench, ALFWorld, and Lifelong Agent Bench. Paper, Tweet

Top AI Papers of the Week (January 5 - January 11) - 2026

Paper Links
1) On the Slow Death of Scaling - This essay by Sara Hooker challenges the decade-long assumption that scaling compute always leads to better AI performance. It argues that the relationship between training compute and performance is highly uncertain and rapidly changing, with smaller models now routinely outperforming much larger ones.
● Diminishing returns of scale: Smaller models like Llama-3 8B and Aya 23 8B now outperform far larger models like Falcon 180B and BLOOM 176B despite having only a fraction of the parameters. This trend is systematic, not isolated.
● Algorithmic improvements matter more: Progress has been driven by instruction finetuning, model distillation, chain-of-thought reasoning, preference training, and retrieval augmented generation - techniques that add little training compute but yield significant performance gains.
● Scaling laws have limits: Scaling laws only reliably predict pre-training test loss, not downstream task performance. Many capabilities display irregular scaling curves, and small sample sizes make predictions statistically weak.
● New optimization spaces: Future progress will come from inference-time compute, malleable synthetic data that can be optimized on-the-fly, and better human-AI interfaces rather than simply adding more parameters.
● Cultural implications: The belief in scaling has marginalized academia, concentrated breakthroughs in wealthy regions, and led industry labs to stop publishing, reshaping the entire culture of AI research.
Paper, Tweet
2) Recursive Language Models - Recursive Language Models (RLMs) are a general inference strategy that allows LLMs to process arbitrarily long prompts by treating them as part of an external environment. Rather than feeding long contexts directly into the model, RLMs load the prompt as a variable in a Python REPL and let the LLM programmatically examine, decompose, and recursively call itself over snippets.
● Scaling beyond context windows: RLMs successfully handle inputs up to two orders of magnitude beyond model context windows, scaling to the 10M+ token regime while maintaining strong performance.
● Outperforming base models: On information-dense tasks like OOLONG-Pairs, GPT-5 achieves less than 0.1% F1 while RLM(GPT-5) reaches 58% F1. RLMs outperform base models and common long-context scaffolds by up to 2x on diverse benchmarks.
● Emergent decomposition patterns: Without explicit training, RLMs exhibit sophisticated behaviors including filtering context using regex queries based on model priors, chunking and recursive sub-calling, and answer verification through small-context sub-LM calls.
● Cost-effective inference: RLMs maintain comparable or lower costs than base model calls at median, with the ability to selectively view context rather than ingesting entire inputs like summarization approaches.
● Task complexity scaling: While base LLM performance degrades rapidly with both input length and task complexity, RLMs degrade at a much slower rate, maintaining effectiveness even on quadratically-scaling tasks.
Paper, Tweet
3) Adversarial Program Evolution with LLMs - Digital Red Queen (DRQ) introduces an algorithm where LLMs evolve assembly-like programs called “warriors” that compete for control of a virtual machine in the game of Core War. Rather than optimizing toward static objectives, DRQ embraces “Red Queen” dynamics where goals continually shift based on competition, demonstrating how adversarial self-play can drive the evolution of increasingly sophisticated programs.
● Core War as testbed: The classic programming game serves as an ideal environment for studying adversarial adaptation, where programs must simultaneously attack opponents and defend themselves in shared memory space.
● Emergent generalization: Evolved warriors become increasingly effective against unseen opponents, suggesting that competitive dynamics produce more robust solutions than static optimization objectives.
● Behavioral convergence: Despite independent evolutionary runs, warriors show paradoxical behavioral convergence, indicating that competitive pressure discovers similar successful strategies across different lineages.
● Dynamic objectives outperform static: The research demonstrates that continually shifting competitive objectives can outperform traditional static optimization for evolving general-purpose solutions.
● Broad applications: The approach has implications for cybersecurity (evolving attack/defense strategies), evolutionary biology (modeling arms races), and AI safety (understanding adversarial dynamics in multi-agent systems).
Paper, Tweet
4) Nemotron-Cascade - Nemotron-Cascade introduces cascaded domain-wise reinforcement learning (Cascade RL) to build general-purpose reasoning models capable of operating in both instruct and deep thinking modes. Rather than blending heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL stages that reduce engineering complexity while delivering state-of-the-art performance.
● Sequential domain-wise RL: The approach chains RLHF, instruction-following RL, math RL, code RL, and SWE RL in sequence. Subsequent stages rarely degrade earlier domain performance and may even improve it, avoiding catastrophic forgetting.
● RLHF as reasoning booster: RLHF for alignment, when used as a pre-step, boosts reasoning ability far beyond mere preference optimization, serving as a foundation for subsequent domain-specific RL stages.
● Strong competitive coding results: The 14B model outperforms its SFT teacher DeepSeek-R1-0528 on LiveCodeBench v5/v6/Pro and achieves silver-medal performance at the 2025 International Olympiad in Informatics (IOI).
● Cross-domain excellence: The 8B model achieves 71.1% on LiveCodeBench V6, 37.2% on SWE-bench Verified, and 80.1% on AIME 2025, outperforming larger models like Qwen3-8B and matching or exceeding frontier reasoning models.
● Transparent recipes: NVIDIA shares complete training and data recipes, including multi-stage SFT, reward modeling, and domain-specific RL configurations for reproducibility.
Paper, Tweet
5) GDPO - GDPO addresses a critical flaw in training language models with multiple competing objectives. The authors discover that when applying Group Relative Policy Optimization (GRPO) to multi-reward settings, normalizing distinct rollout reward combinations causes them to collapse into identical advantage values, degrading training signal quality and stability.
● Fundamental flaw identified: Standard GRPO normalizes rewards across all objectives together, which causes distinct reward combinations to collapse into nearly identical advantage values—destroying the nuanced signal needed for multi-objective optimization.
● Decoupled normalization: GDPO decouples the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization across competing objectives.
● Consistent improvements: GDPO demonstrated gains over GRPO across three domains: tool calling, mathematical reasoning, and code generation, improving both correctness metrics (accuracy, defect rates) and constraint adherence (format compliance, output length).
● Practical multi-objective training: The approach enables training models that must simultaneously optimize for multiple goals, such as being accurate while following format constraints, without the objectives interfering destructively.
● Drop-in replacement: GDPO can serve as a drop-in replacement for GRPO in multi-reward RL pipelines, requiring minimal changes to existing training infrastructure while providing more stable and effective optimization.
Paper, Tweet
6) Training AI Co-Scientists Using Rubric Rewards - This paper from Meta Superintelligence Labs presents a scalable method to train language models to generate better research plans without expensive human supervision or real-world execution. The approach automatically extracts research goals and goal-specific grading rubrics from scientific papers, then uses reinforcement learning with self-grading to improve plan generation.
● Automated data extraction: Research goals and grading rubrics are automatically extracted from papers across ML, medical, and arXiv domains. Human experts validated that 84% of rubric items capture necessary requirements for good research plans.
● Self-grading with privileged information: A frozen copy of the initial model acts as a grader, using extracted rubrics as privileged information to evaluate plans. This creates a generator-verifier gap that enables training without external supervision.
● Strong human validation: In a 225-hour study with ML experts, the finetuned Qwen3-30B model’s plans were preferred over the initial model for 70% of research goals, with experts rating them as sounder and more likely to lead to better outcomes.
● Cross-domain generalization: Models trained on one domain generalize significantly to others. The medical-finetuned model achieved 15% relative improvement on ML tasks and 17.5% on arXiv tasks, suggesting the approach learns generally desirable research plan qualities.
● Competitive with frontier models: The finetuned 30B model becomes competitive with Grok-4-Thinking, achieving 12-22% relative improvements across domains, though GPT-5-Thinking remains the top performer.
Paper, Tweet
7) Confucius Code Agent - Confucius Code Agent (CCA) is a software engineering agent designed to operate on large-scale codebases. Built on the Confucius SDK, it introduces a three-axis design philosophy separating Agent Experience (AX), User Experience (UX), and Developer Experience (DX) to enable robust multi-step reasoning and modular tool use.
● Hierarchical working memory: Uses adaptive context compression to maintain essential state during long-horizon reasoning without exceeding context limits. A planner agent summarizes earlier turns into structured plans, reducing prompt length by over 40% while preserving key reasoning chains.
● Persistent note-taking: A dedicated note-taking agent distills interaction trajectories into structured Markdown notes, capturing both successful strategies and failure cases for cross-session learning. This reduces token costs by approximately 11k and improves resolve rates on repeated tasks.
● Modular extension system: Tool-use behaviors are factored into typed extensions that attach to the orchestrator, enabling reusable, auditable, and adaptable capabilities across different agents and tool stacks.
● Meta-agent automation: A meta-agent automates a build-test-improve loop that synthesizes, evaluates, and refines agent configurations, enabling rapid adaptation to new environments and tasks without manual prompt engineering.
● Strong benchmark results: On SWE-Bench-Pro, CCA achieves 54.3% Resolve@1 with Claude 4.5 Opus, exceeding prior research baselines. With Claude 4.5 Sonnet, CCA reaches 52.7%, outperforming Claude 4.5 Opus with Anthropic’s proprietary scaffold at 52.0%, demonstrating that scaffolding can outweigh raw model capability.
Paper, Tweet
8) SciSciGPT - SciSciGPT is an open-source AI collaborator that uses the science of science domain as a testbed for LLM-powered research tools. Its multi-agent architecture with five specialized modules automates complex research workflows and completes tasks in about 10% of the time required by experienced researchers while producing higher-quality outputs. Paper, Tweet
9) SWE-EVO - SWE-EVO introduces a benchmark for evaluating coding agents on long-horizon software evolution tasks that require multi-step modifications spanning an average of 21 files per task. The benchmark reveals significant limitations of current agents: GPT-5 with OpenHands achieves only 21% on SWE-EVO compared to 65% on SWE-Bench Verified, highlighting the gap between isolated bug fixes and realistic software development scenarios. Paper, Tweet
10) Deep Delta Learning - Deep Delta Learning introduces a novel “Delta Operator” that generalizes residual connections by modulating the identity shortcut with a learnable, data-dependent geometric transformation. This enables networks to dynamically interpolate between identity mapping, orthogonal projection, and geometric reflection, allowing selective forgetting of features rather than just accumulation. Paper, Tweet

Top AI Papers of the Week (December 29 - January 4) - 2026

Paper Links
1) End-to-End Test-Time Training for Long Context - This paper reframes long-context language modeling as a continual learning problem rather than architecture design. TTT-E2E uses a standard Transformer with sliding-window attention that continues learning at test time via next-token prediction, compressing context into its weights rather than storing all key-value pairs.
● Test-time training approach: The model learns at test time by predicting next tokens on the given context, compressing information into weights. This is combined with meta-learning at training time to prepare the model’s initialization for test-time learning.
● End-to-end in two ways: The inner loop directly optimizes next-token prediction loss, while the outer loop optimizes the final loss after TTT via gradients of gradients. This contrasts with prior TTT methods and dynamic evaluation approaches.
● Scaling with context length: For 3B models trained on 164B tokens, TTT-E2E scales with context length the same way as full attention Transformers, while alternatives like Mamba 2 and Gated DeltaNet do not maintain performance in longer contexts.
● Efficient inference: Similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7x faster than full attention for 128K context on H100 while achieving comparable or better loss.
Paper, Tweet
2) Geometric Memory in Sequence Models - This paper identifies a dramatically different form of how deep sequence models store factual information called geometric memory, contrasting with the traditional associative memory view. Models synthesize embeddings encoding global relationships between all entities, even ones that never co-occur in training.
● Two memory paradigms: Associative memory uses brute-force lookup of co-occurring entities with arbitrary embeddings. Geometric memory instead encodes global structure in carefully arranged embeddings where dot products capture multi-hop distances between entities.
● Powerful reasoning transformation: Geometric memory transforms hard reasoning tasks involving multi-step composition into easy-to-learn 1-step navigation tasks. Models succeed at path-finding on massive graphs when memorizing edges in weights, despite being designed to fail.
● Unexplained emergence: The geometry is learned even when it is more complex than brute-force lookup, without global supervision, rank constraints, or obvious architectural pressures. This creates a fundamental memorization puzzle.
● Spectral bias explanation: By analyzing connections to Node2Vec, the researchers demonstrate that geometry stems from spectral bias arising naturally from cross-entropy loss minimization. Node2Vec models show more strongly geometric embeddings than Transformers, pointing to headroom for improvement.
Paper, Tweet
3) Universal Reasoning Model - This paper investigates why universal transformers excel at complex reasoning tasks like ARC-AGI. The key finding: performance gains come primarily from recurrent inductive bias and strong nonlinear components rather than elaborate architectural designs.
● Recurrent mechanism matters most: Through extensive ablation studies, the researchers show that reasoning capability beyond standard transformers comes from the recurrent mechanism of universal transformers, not from overly elaborate designs in prior work.
● ConvSwiGLU enhancement: The Universal Reasoning Model (URM) augments the standard SwiGLU feed-forward block with depthwise short convolution, injecting local contextual interactions into the gating mechanism without increasing sequence-level complexity.
● Truncated backpropagation: The approach uses truncated backpropagation through loops, enabling efficient training of the recurrent architecture while maintaining strong performance on reasoning tasks.
● State-of-the-art ARC-AGI results: URM achieves 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2, substantially outperforming prior UT-based models like TRM (40%) and HRM (32%) on ARC-AGI 1.
Paper, Tweet
4) AI Agents for Coding in 2025 - This study from UC San Diego and Cornell examines how experienced software developers (3+ years) actually use AI coding agents, through field observations (N=13) and surveys (N=99). The key finding: professional developers don’t “vibe code” - they carefully control agents through planning and supervision.
● Control over vibing: Unlike the “vibe coding” trend, where developers trust AI without reviewing code, experienced professionals maintain careful oversight. They plan before implementing and validate all agentic outputs to ensure software quality.
● Productivity with quality: Developers value agents as a productivity boost while still prioritizing software quality attributes. Some reported feeling their productivity increased tenfold, though they emphasized maintaining control over the process.
● Task suitability: Agents perform well on well-described, straightforward tasks but struggle with complex tasks. The study found agents suitable for code generation, debugging, and boilerplate, but less effective for architectural decisions.
● Positive sentiment with control: Developers generally enjoy using agents as long as they remain in control. A notable randomized trial found experienced maintainers were actually slowed by 19% when using AI, highlighting the importance of proper integration strategies.
Paper, Tweet
5) Manifold-Constrained Hyper-Connections - This DeepSeek paper proposes Manifold-Constrained Hyper-Connections (mHC), a framework that extends residual connections by expanding residual stream width while restoring training stability. The key insight: unconstrained Hyper-Connections compromise identity mapping, causing training instability at scale.
● Identity mapping restoration: mHC projects residual connection matrices onto the Birkhoff polytope using the Sinkhorn-Knopp algorithm, constraining them to doubly stochastic matrices. This preserves the feature mean during propagation and prevents vanishing or exploding signals.
● Stability at scale: Standard Hyper-Connections showed loss surges around 12k steps with gradient norm instability and Amax Gain Magnitude peaks of 3000. mHC maintains stable training by ensuring the composite mapping across layers preserves conservation properties.
● Efficient infrastructure: The approach uses kernel fusion with TileLang, selective recomputing to reduce memory footprint, and overlapped communication within the DualPipe schedule. This introduces only 6.7% additional time overhead at expansion rate n=4.
● Scalable performance: Experiments demonstrate mHC maintains the performance advantages of Hyper-Connections while enabling training at scale, offering a practical path for scaling via residual stream width rather than just model FLOPs or data size.
Paper, Tweet
6) Spacing Effect for Generalization - Researchers from Tsinghua University investigate how the spacing effect - a well-documented learning principle where spaced intervals between training improve retention - can enhance generalization in both biological and artificial neural networks.
● Bio-inspired hypothesis: The spacing effect promotes integration of input and innate variations during learning, enabling better generalization to novel but related scenarios. The researchers test this by introducing bio-inspired spacing mechanisms into artificial neural networks.
● Spaced dropout implementation: The approach implements structured dropout where probability varies periodically to introduce structured neuronal variability during training. Test accuracy follows a U-shaped trend, indicating optimal performance at intermediate variation strengths.
● Cross-architecture validation: Performance gains from spaced dropout are demonstrated across different network architectures and benchmark datasets, showing the approach generalizes beyond specific model types.
● Flatter loss landscapes: Theoretical and empirical analyses show that spacing effect benefits stem from convergence to flatter loss landscapes during stochastic gradient descent, resulting in better real-world performance and stronger resistance to noisy data.
Paper, Tweet
7) SAGA - SAGA (Scientific Autonomous Goal-evolving Agent) introduces a framework for automating objective function design in AI-driven scientific discovery. Rather than optimizing fixed objectives specified by scientists, SAGA dynamically reformulates research goals throughout the discovery process to avoid reward hacking issues.
● Bi-level architecture: SAGA employs an outer loop where LLM agents analyze optimization outcomes and propose refined objectives, while an inner loop performs solution optimization. This enables systematic exploration of objective trade-offs that remain invisible in traditional fixed-objective approaches.
● Three automation modes: The framework offers co-pilot (human collaboration on analysis and planning), semi-pilot (human feedback to analyzer only), and autopilot (fully automated) modes for flexible human-AI interaction.
● Diverse scientific applications: SAGA was validated across antibiotic design for K. pneumoniae, inorganic materials design (permanent magnets, superhard materials), functional DNA sequence design, and chemical process flowsheets.
● Strong performance: In antibiotic design, SAGA achieved drug-like molecules with high predicted activity while baselines either failed to optimize activity or produced chemically invalid structures. For materials, SAGA found 15 novel stable structures within 200 DFT calculations, outperforming MatterGen. In DNA design, SAGA improved MPRA specificity by at least 48% over baselines.
Paper, Tweet
8) Step-DeepResearch - Step-DeepResearch is a 32B parameter deep research agent that rivals OpenAI and Gemini DeepResearch through atomic capability training - decomposing research into planning, information gathering, cross-source verification, and report writing. Achieving 61.42 on Scale AI ResearchRubrics with a streamlined ReAct-style design, it outperforms larger models while being the most cost-effective deep research agent available. Paper, Tweet
9) MACI - This paper argues that LLMs are not fundamentally limited as pattern matchers - the real bottleneck is the lack of a System-2 coordination layer. The authors propose MACI, an architecture implementing three mechanisms: baiting (behavior-modulated debate), filtering (Socratic judging), and persistence (transactional memory) to enable goal-directed reasoning on top of LLM substrates. Paper, Tweet
10) AgentReuse - AgentReuse addresses latency bottlenecks in LLM-driven agents by caching and reusing plans for similar requests, observing that about 30% of agent requests are identical or similar. Using intent classification for semantic similarity rather than surface-level text comparison, the system achieves a 93% effective plan reuse rate and 93.12% latency reduction compared to systems without plan reuse. Paper, Tweet