2024.md 370 KB

AI Papers of the Week — 2024

← Back to main index

This page collects every weekly issue of AI Papers of the Week from 2024. For other years, see the main index.


Top AI Papers of the Week (December 23 - December 29) - 2024

| Paper | Links | | ------------- | ------------- | | 1) DeepSeek-V3 - a 671B-parameter MoE language model that activates 37B parameters per token, utilizing MLA and DeepSeekMoE architectures for efficient operation; it introduces an auxiliary-loss-free load balancing approach and employs multi-token prediction during training to enhance performance; following pre-training on 14.8 trillion tokens, the model underwent SFT and RL stages, achieving performance comparable to leading closed-source models while surpassing other open-source alternatives; the model requires only 2.788M H800 GPU hours for training, with stable training that avoids any irrecoverable loss spikes. | Paper, Tweet | | 2) *Large Concept Models* - presents an approach that operates on sentence-level semantic representations called concepts, moving beyond token-level processing typical in current LLMs; the model leverages SONAR sentence embeddings to support 200 languages across text and speech modalities, training on autoregressive sentence prediction using various approaches from MSE regression to diffusion-based generation; experiments with both 1.6B and 7B parameter variants trained on 1.3T and 7.7T tokens respectively demonstrate strong performance on generative tasks like summarization and summary expansion. | Paper, Tweet | | 3) ModernBERT - a new encoder-only transformer model that achieves state-of-the-art performance on classification and retrieval tasks while being more efficient than previous encoders; it was trained on 2T tokens with 8192 sequence length and incorporates modern optimizations that represent a significant improvement over BERT; the model is specifically designed for practical deployment, offering superior speed and memory efficiency on common GPUs. | Paper, Tweet | | 4) Automating the Search for Artificial Life - presents a new approach that uses foundation models to automatically discover interesting artificial life simulations across multiple platforms like Boids, Lenia, and Game of Life; the system can find simulations that produce specific target behaviors, discovers simulations that generate temporally open-ended novelty, and map out diverse simulation spaces; it discovers new lifeforms in Lenia and Boids, while also enabling quantitative measurement of previously qualitative phenomena in a human-aligned way. | Paper, Tweet | | 5) A Survey on LLM Inference-Time Self-Improvement - presents a survey that analyzes three categories of LLM inference-time self-improvement techniques - independent methods like enhanced decoding, context-aware approaches using external data, and model collaboration strategies. | Paper, Tweet | | 6) Explore Theory-of-Mind - introduces ExploreToM, a framework that uses A* search to generate diverse, complex theory-of-mind scenarios that reveal significant limitations in current LLMs' social intelligence capabilities; testing showed even advanced models like GPT-4 and Llama-3 perform poorly (as low as 5% accuracy) on these challenging scenarios, despite their strong performance on simpler benchmarks; fine-tuning on ExploreToM data improved performance on existing benchmarks by 27 points. | Paper, Tweet | | 7) LearnLM - a new LearnLM model that can follow pedagogical instructions, allowing it to adapt its teaching approach based on specified educational needs rather than defaulting to simply presenting information; experimental results show that LearnLM is preferred over other leading models, outperforming GPT-4 by 31%, Claude 3.5 by 11%, and Gemini 1.5 Pro by 13%; this instruction-following approach avoids committing to a single pedagogical framework, instead enabling teachers and developers to specify their desired teaching behaviors while allowing for continuous improvement alongside other capabilities. | Paper, Tweet | | 8) Empowering MLLM with o1-like Reasoning and Reflection - proposes a new learning-to-reason method called CoMCTS that enables multimodal language models to develop step-by-step reasoning capabilities by leveraging collective knowledge from multiple models; the approach was used to create Mulberry-260k, a dataset with explicit reasoning trees, which was then used to train the Mulberry model series; the method demonstrates strong performance on benchmarks, with the models showing improved reasoning and reflection capabilities. | Paper, Tweet | | 9) Reinforcement Learning Overview - presents a comprehensive overview of reinforcement learning. | Paper, Tweet | | 10) DRT-o1 - applies long chain-of-thought reasoning to machine translation, particularly for handling metaphors and similes across different cultures; the system uses a multi-agent framework with a translator working iteratively with an advisor and evaluator to produce better translations; testing with Qwen2.5 models showed significant improvements in BLEU and CometScore metrics, with DRT-o1-7B outperforming larger models like QwQ-32B-Preview. | Paper, Tweet |

Top AI Papers of the Week (December 16 - December 22) - 2024

| Paper | Links | | ------------- | ------------- | | 1) Genesis - a new universal physics simulation platform that combines a high-performance physics engine with generative AI capabilities; it enables natural language-driven creation of robotic simulations, character animations, and interactive 3D environments at speeds up to 430,000 times faster than in real-time. | Paper, Tweet | | 2) Alignment Faking in LLMs - demonstrates that the Claude model can engage in "alignment faking"; it can strategically comply with harmful requests to avoid retraining while preserving its original safety preferences; this raises concerns about the reliability of AI safety training methods. | Paper, Tweet | | 3) TheAgentCompany - a new benchmark for evaluating AI agents on real-world professional tasks in a simulated software company environment; tasks span multiple professional roles including software engineering, project management, finance, and HR; when tested with various LLMs, including both API-based models like Claude-3.5-Sonnet and open-source models like Llama 3.1, the results show the current limitations of AI agents. The best-performing model, Claude-3.5-Sonnet, achieved only a 24% success rate on completing tasks fully while scoring 34.4% when accounting for partial progress. | Paper, Tweet | | 4) Graphs to Text-Attributed Graphs - automatically generates textual descriptions for nodes in a graph which leads to effective graph to text-attributed graph transformation; evaluates the approach on text-rich, text-limited, and text-free graphs, demonstrating that it enables a single GNN to operate across diverse graphs. | Paper, Tweet | | 5) Qwen-2.5 Technical Report - Alibaba releases Qwen2.5, a new series of LLMs trained on 18T tokens, offering both open-weight models like Qwen2.5-72B and proprietary MoE variants that achieve competitive performance against larger models like Llama-3 and GPT-4. | Paper, Tweet | | 6) PAE (Proposer-Agent-Evaluator) - a learning system that enables AI agents to autonomously discover and practice skills through web navigation, using reinforcement learning and context-aware task proposals to achieve state-of-the-art performance on real-world benchmarks. | Paper | | 7) DeepSeek-VL2 - a new series of vision-language models featuring dynamic tiling for high-resolution images and efficient MoE architecture, achieving competitive performance across visual tasks; achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. | Paper, Tweet | | 8) AutoFeedback - a two-agent AI system that generates more accurate and pedagogically sound feedback for student responses in science assessments, significantly reducing common errors like over-praise compared to single-agent models. | Paper | | 9) A Survey of Mathematical Reasoning in the Era of Multimodal LLMs - presents a comprehensive survey analyzing mathematical reasoning capabilities in multimodal large language models (MLLMs), covering benchmarks, methodologies, and challenges across 200+ studies since 2021. | Paper, Tweet | | 10) Precise Length Control in LLMs - adapts a pre-trained decoder-only LLM to produce responses of a desired length; integrates a secondary length-difference positional encoding into the input embeddings which enables counting down to a user-set response terminal length; claims to achieve mean token errors of less than 3 tokens without compromising quality. | Paper, Tweet |

Top AI Papers of the Week (December 9 - December 15) - 2024

| Paper | Links | | ------------- | ------------- | | 1) Training LLMs to Reason in a Continuous Latent Space - presents Coconut (Chain of Continuous Thought), a novel paradigm that enables LLMs to reason in continuous latent space rather than natural language; Coconut takes the last hidden state of the LLM as the reasoning state and feeds it back to the LLM as the subsequent input embedding directly in the continuous space; this leads to what the authors refer to as "continuous thought" which augments an LLM's capability on reasoning tasks; it demonstrates improved performance on complex reasoning tasks through emergent breadth-first search capabilities. | Paper, Tweet | | 2) Phi-4 Technical Report - presents phi-4, a 14B model that surpasses its teacher model on STEM-QA capabilities. It also reports strong performance on reasoning-focused benchmarks due to improved data, training curriculum, and innovations in the post-training scheme. | Paper, Tweet | | 3) Asynchronous Function Calling - proposes AsyncLM, a system for asynchronous LLM function calling; they design an in-context protocol for function calls and interrupts, provide fine-tuning strategy to adapt LLMs to the interrupt semantics, and implement these mechanisms efficiently on LLM inference process; AsyncLM can reduce task completion latency from 1.6x-5.4x compared to synchronous function calling; it enables LLMs to generate and execute function calls concurrently. | Paper, Tweet | | 4) MAG-V - a multi-agent framework that first generates a dataset of questions that mimic customer queries; it then reverse engineers alternate questions from responses to verify agent trajectories; reports that the generated synthetic data can improve agent performance on actual customer queries; finds that for trajectory verification simple ML baselines with feature engineering can match the performance of more expensive and capable models. | Paper, Tweet | | 5) Clio - proposes a platform using AI assistants to analyze and surface private aggregated usage patterns from millions of Claude.ai conversations; enables insights into real-world AI use while protecting user privacy; the system helps identify usage trends, safety risks, and coordinated misuse attempts without human reviewers needing to read raw conversations. | Paper, Tweet | | 6) A Survey on LLMs-as-Judges - presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. | Paper, Tweet | | 7) AutoReason Improves Multi-step Reasoning - proposes a method to automatically generate rationales for queries using CoT prompting; this transforms zero-shot queries into few-shot reasoning traces which are used as CoT exemplars by the LLM; claims to improve reasoning in weaker LLMs. | Paper, Tweet | | 8) The Byte Latent Transformer (BLT)- introduces a byte-level language model architecture that matches tokenization-based LLM performance while improving efficiency and robustness; uses a dynamic method of grouping bytes into patches based on the entropy of the next byte, allocating more compute resources to complex predictions while using larger patches for more predictable sequences; BLT demonstrates the ability to match or exceed the performance of models like Llama 3 while using up to 50% fewer FLOPs during inference. | Paper, Tweet | | 9) Does RLHF Scale? - This new paper explores the impacts of key components in the RLHF framework. Summary of main findings: 1) RLHF doesn't scale as effectively as pretraining in LLMs, with larger policy models benefiting less from RLHF when using a fixed reward model, 2) when increasing the number of responses sampled per prompt during policy training, performance improves initially but plateaus quickly, typically around 4-8 samples, 3) using larger reward models leads to better performance in reasoning tasks, but the improvements can be inconsistent across different types of tasks, and 4) increasing training data diversity for reward models is more effective than increasing response diversity per prompt, but policy training shows diminishing returns after the early stages regardless of additional data. | Paper, Tweet | | 10) Granite Guardian - IBM open-sources Granite Guardian, a suite of safeguards for risk detection in LLMs; the authors claim that With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. | Paper, Tweet |

Top AI Papers of the Week (December 2 - December 8) - 2024

| Paper | Links | | ------------- | ------------- | | 1) OpenAI o1 - a model series trained with large-scale reinforcement learning to reason using chain of thought; o1 shows significant improvements across benchmarks related to math, code, and science; o1 is claimed to be 50% faster in generating thinking steps than o1-preview; results demonstrate that o1 is significantly better at reasoning tasks and produces more comprehensive and reliable responses. | Paper, Tweet | | 2) Genie 2 - a foundation world model that generates playable 3D environments from single prompt images, enabling endless training scenarios for AI agents with features like physics simulation, character animation, and object interactions; Genie 2 is trained on video data using a combination of autoencoder and transformer for generating virtual worlds; the model can create real-time interactive environments, with a faster but lower-quality version available for immediate play. | Paper, Tweet | | 3) Reverse Thinking - shows that training LLMs to learn "reverse thinking" helps to improve performance in commonsense, math, and logical reasoning tasks. It claims to outperform a standard fine-tuning method trained on 10x more forward reasoning. | Paper, Tweet | | 4) ALAMA - a new framework that helps language agents automatically learn when to use different mechanisms (ReAct, CoT, Reflection, etc.) for automatically completing tasks, improving on current approaches that use fixed or predefined mechanisms; the framework adaptively activates the appropriate mechanisms according to the potential characteristics of the task; experimental results demonstrate significant improvements in downstream agent tasks, including mathematical reasoning and knowledge-intensive reasoning. | Paper, Tweet | | 5) Auto-RAG- an autonomous iterative retrieval model with superior performance across many datasets; Auto-RAG is a fine-tuned LLM that leverages the decision-making capabilities of an LLM; it interacts with the retriever through multiturn dialogues, systematically planning retrievals and refining queries to acquire valuable knowledge — it performs this process until sufficient external information is obtained; the authors also show that based on question difficulty, the method can adjust the number of iterations without any human intervention. | Paper, Tweet | | 6) GenCast - an ML weather prediction model that outperforms the world's leading operational weather forecasting system (ECMWF's ENS) in both accuracy and speed; it generates probabilistic 15-day global weather forecasts for over 80 variables in just 8 minutes, with better skill than ENS on 97.2% of evaluated targets; GenCast produces an ensemble of forecasts that better capture uncertainty and predict extreme weather events, tropical cyclone tracks, and wind power production. | Paper, Tweet | | 7) Challenges in Human-Agent Communication - present a comprehensive analysis of key challenges in human-agent communication, focusing on how humans and AI agents can effectively establish common ground and mutual understanding; identifies 12 core challenges across three categories: conveying information from agents to users, enabling users to communicate information to agents, and general communication challenges that affect all interactions. | Paper | | 8) Retrieval-Augmented Reasoning for LLMs - extends the rStar reasoning framework to enhance reasoning accuracy and factual reliability of LLMs; it leverages a Monte Carlos Tree Search (MCTS) framework with explicit retrieval-augmented reasoning to produce multiple candidate reasoning trajectories; then it leverages a retrieval-augmented factuality scorer to evaluate the factual accuracy of the reasoning trajectories; the trajectory with the highest factuality score is selected as the final answer by the system; on medical reasoning tasks, RARE (which uses Llama 3.1) surpasses larger models such as GPT-4; on commonsense reasoning tasks, RARE outperformed Claude-3.5 Sonnet and GPT-4o-mini, achieving performance competitive with GPT-4o. | Paper, Tweet | | 9) DataLab - a unified business intelligence platform powered by LLM-based agents that integrates task planning, reasoning, and computational notebooks to streamline the entire BI workflow; the system achieves SOTA performance on research benchmarks and demonstrates significant improvements in accuracy and efficiency on real enterprise data from Tencent; achieves up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks. | Paper, Tweet | | 10) Procedural Knowledge in Pretraining Drives Reasoning in LLMs - studies what documents in the pertaining influence model outputs; by looking at the pertaining data, it tries to understand better what kind of generalization strategies LLMs use to perform reasoning tasks; when performing reasoning tasks, it finds that influential documents contain procedural knowledge (e.g., demonstrating how to obtain a solution using formulae or code). | Paper, Tweet |

Top AI Papers of the Week (November 25 - December 1) - 2024

| Paper | Links | | ------------- | ------------- | | 1) LLM Surpass Human Experts in Predicting Neuroscience Results - proposes BrainBench to study how good LLMs are at predicting experimental outcomes in neuroscience; they tuned an LLM, BrainGPT, on neuroscience literature that surpasses experts in predicting neuroscience results; report that when LLMs indicated high confidence in their predictions, their responses were more likely to be correct. | Paper, Tweet | | 2) Fugatto - a new generative AI sound model (presented by NVIDIA) that can create and transform any combination of music, voices, and sounds using text and audio inputs, trained on 2.5B parameters and capable of novel audio generation like making trumpets bark or saxophones meow. | Paper, Tweet | | 3) o1 Replication Journey - Part 2 - shows that combining simple distillation from o1's API with supervised fine-tuning significantly boosts performance on complex math reasoning tasks; a base model fine-tuned on simply tens of thousands of samples o1-distilled long-thought chains outperform o1-preview on the American Invitational Mathematics Examination (AIME). | Paper, Tweet | | 4) LLM-Brained GUI Agents - presents a survey of LLM-brained GUI Agents, including techniques and applications. | Paper, Tweet | | 5) High-Level Automated Reasoning - extends in-context learning through high-level automated reasoning; achieves state-of-the-art accuracy (79.6%) on the MATH benchmark with Qwen2.5-7B-Instruct, surpassing GPT-4o (76.6%) and Claude 3.5 (71.1%); rather than focusing on manually creating high-quality demonstrations, it shifts the focus to abstract thinking patterns; it introduces five atomic reasoning actions to construct chain-structured patterns; then it uses Monte Carlo Tree Search to explore reasoning paths and construct thought cards to guide inference. | Paper, Tweet | | 6) Star Attention: Efficient LLM Inference over Long Sequences - introduces Star Attention, a two-phase attention mechanism that processes long sequences by combining blockwise-local attention for context encoding with sequence-global attention for query processing and token generation; achieves up to 11x faster inference speeds while maintaining 95-100% accuracy compared to traditional attention mechanisms by efficiently distributing computation across multiple hosts; a key innovation is the "anchor block" mechanism, where each context block is prefixed with the first block, enabling effective approximation of global attention patterns while reducing computational overhead. | Paper, Tweet | | 7) Survey on LLM-as-a-Judge - provides a comprehensive survey of LLM-as-a-Judge, including a deeper discussion on how to build reliable LLM-as-a-Judge systems. | Paper, Tweet | | 8) TÜLU 3 - releases a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. | Paper, Tweet | | 9) Generative Agent Simulations of 1,000 People - introduces a new agent architecture that uses LLMs to create behavioral simulations of real individuals, achieving 85% accuracy in replicating human responses on the General Social Survey and reducing demographic biases compared to traditional approaches. | Paper, Tweet | | 10) Measuring Bullshit in Language Games Played by ChatGPT - proposes that LLM-based chatbots play the ‘language game of bullshit’; by asking ChatGPT to generate scientific articles on topics where it has no knowledge or competence, the authors were able to provide a reference set of how this “bullshit” is manifested. | Paper, Tweet |

Top AI Papers of the Week (November 18 - November 24) - 2024

| Paper | Links | | ------------- | ------------- | | 1) AlphaQubit - a new AI-based decoder that sets a state-of-the-art benchmark for identifying errors in quantum computers; using transformer architecture, AlphaQubit demonstrated 6% fewer errors than tensor network methods and 30% fewer errors than correlated matching when tested on the Sycamore data; shows promising results in simulations of larger systems up to 241 qubits; while this represents significant progress in quantum error correction, the system still needs improvements in speed before it can correct errors in real-time for practical quantum computing applications. | Paper, Tweet | | 2) The Dawn of GUI Agent - explores Claude 3.5 computer use capabilities across different domains and software; they also provide an out-of-the-box agent framework for deploying API-based GUI automation models; Claude 3.5 Computer Use demonstrates unprecedented ability in end-to-end language to desktop actions. | Paper, Tweet | | 3) A Statistical Approach to LLM Evaluation - proposes five key statistical recommendations for a more rigorous evaluation of LLM performance differences. The recommendations include: 1) using the Central Limit Theorem to measure theoretical averages across all possible questions rather than just observed averages; 2) clustering standard errors when questions are related rather than independent; 3) reducing variance within questions through resampling or using next-token probabilities; 4) analyzing paired differences between models since questions are shared across evaluations, and 5) using power analysis to determine appropriate sample sizes for detecting meaningful differences between models; the authors argue that these statistical approaches will help researchers better determine whether performance differences between models represent genuine capability gaps or are simply due to chance, leading to more precise and reliable model evaluations. | Paper, Tweet | | 4) Towards Open Reasoning Models for Open-Ended Solutions - proposes Marco-o1 which is a reasoning model built for open-ended solutions; Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and more recent reasoning strategies; Marco-o1 achieves accuracy improvements of +6.17% on the MGSM (English) dataset and +5.60% on the MGSM (Chinese) dataset. | Paper, Tweet | | 5) LLM-based Agents for Automated Bug Fixing - analyzes seven leading LLM-based bug fixing systems on the SWE-bench Lite benchmark, finding MarsCode Agent (developed by ByteDance) achieved the highest success rate at 39.33%; reveals that for error localization line-level fault localization accuracy is more critical than file-level accuracy, and bug reproduction capabilities significantly impact fixing success; shows that 24/168 resolved issues could only be solved using reproduction techniques, though reproduction sometimes misled LLMs when issue descriptions were already clear; concludes that improvements are needed in both LLM reasoning capabilities and Agent workflow design to enhance automated bug fixing effectiveness. | Paper, Tweet | | 6) Cut Your Losses in Large-Vocabulary Language Models - introduces Cut Cross-Entropy (CCE), a novel method to significantly reduce memory usage during LLM training by optimizing how the cross-entropy loss is computed; currently, the cross-entropy layer in LLM training consumes a disproportionate amount of memory (up to 90% in some models) due to storing logits for all possible vocabulary tokens. CCE addresses this by only computing logits for the correct token and evaluating the log-sum-exp over all logits on the fly using flash memory; the authors show that the approach reduces the memory footprint of Gemma 2 from 24GB to just 1MB; the method leverages the inherent sparsity of softmax calculations to skip elements that contribute negligibly to gradients; finally, it demonstrates that CCE achieves this dramatic memory reduction without sacrificing training speed or convergence, enabling larger batch sizes during training and potentially more efficient scaling of LLM training. | Paper | | 7) BABY-AIGS - a multi-agent system for automated scientific discovery that emphasizes falsification through automated ablation studies. The system was tested on three ML tasks (data engineering, self-instruct alignment, and language modeling), demonstrating the ability to produce meaningful scientific discoveries. However, the performance is below experienced human researchers. | Paper, Tweet | | 8) Does Prompt Formatting Impact LLM Performance - examines how different prompt formats (plain text, Markdown, JSON, and YAML) affect GPT model performance across various tasks; finds that GPT-3.5-turbo's performance can vary by up to 40% depending on the prompt format, while larger models like GPT-4 show more robustness to format changes; argues that there is no universally optimal format across models or tasks - for instance, GPT-3.5-turbo generally performed better with JSON formats while GPT-4 preferred Markdown; models from the same family showed similar format preferences, but these preferences didn't transfer well between different model families; suggests that prompt formatting significantly impacts model performance and should be carefully considered when performing prompt engineering and model evaluation, and how to apply it to applications. | Paper | | 9) FinRobot - an AI agent framework for equity research that uses a multi-agent Chain-of-Thought prompting, combining data analysis with human-like reasoning to produce professional investment reports comparable to major brokerages; it leverage three agents: a Data-CoT Agent to aggregate diverse data sources for robust financial integration; the Concept-CoT Agent, for analyst’s reasoning to generate actionable insights; and the Thesis-CoT Agent to synthesizes these insights into a coherent investment thesis and report. | Paper | | 10) Bi-Mamba - a scalable 1-bit Mamba architecture designed for more efficient LLMs with multiple sizes across 780M, 1.3B, and 2.7B; Bi-Mamba achieves performance comparable to its full-precision counterparts (e.g., FP16 or BF16); it significantly reduces memory footprint with better accuracy than posttraining-binarization Mamba baselines. | Paper |

Top AI Papers of the Week (November 11 - November 17) - 2024

| Paper | Links | | ------------- | ------------- | | 1) Impacts of AI on Innovation - suggests that top scientists leverage their domain knowledge to prioritize promising AI suggestions, while others waste significant resources testing false positives; finds that implementing AI materials discovery technology leads to substantial increases in productivity, with 44% more materials discovered, 39% more patent filings, and 17% more product innovation; reports that these gains came with concerning tradeoffs, as 82% of scientists reported reduced job satisfaction due to decreased creativity and skill underutilization. | Paper, Tweet | | 2) Scaling Laws for Precision - introduces "precision-aware" scaling laws that predict how model performance is affected by both training and inference precision in LLMs; key findings include: 1) post-training quantization becomes more harmful as models are trained on more data, eventually making additional pretraining actively detrimental, 2) training in lower precision requires increasing model size to maintain performance, and 3) when jointly optimizing model size, data, and precision, the compute-optimal training precision is around 7-8 bits and independent of compute; also reports that when the model size is fixed, compute-optimal precision increases approximately logarithmically with data; the authors validate their predictions on models up to 1.7B parameters trained on up to 26B tokens, showing that both very high (16-bit) and very low (sub 4-bit) training precisions may be suboptimal. | Paper, Tweet | | 3) Evo - a 7B parameter AI model designed to understand and generate DNA sequences across multiple biological scales; the model, trained on 2.7 million prokaryotic and phage genomes, can process sequences up to 131 kilobases long while maintaining single-nucleotide resolution, enabling it to understand both molecular-level interactions and genome-wide patterns; Evo demonstrates superior performance in predicting and generating functional DNA, RNA, and protein sequences, including the first successful AI-generated CRISPR-Cas complexes and transposable systems that have been experimentally validated. | Paper, Tweet | | 4) OpenCoder - introduces OpenCoder, a fully open-source LLM specialized for code generation and understanding; the authors identify several critical factors for building high-performing code LLMs: (1) effective data cleaning with code-optimized heuristic rules for deduplication, (2) recall of relevant text corpus related to code, and (3) high-quality synthetic in both annealing and supervised fine-tuning stages; OpenCoder surpasses previous fully open models at the 6B+ parameter scale and releases not just the model weights but also the complete training pipeline, datasets, and protocols to enable reproducible research. | Paper, Tweet | | 5) The Surprising Effectiveness of Test-Time Training for Abstract Reasoning - explores test-time training (TTT) - updating model parameters temporarily during inference - for improving an LLM's abstract reasoning capabilities using the ARC benchmark; identifies three crucial components: initial fine-tuning on similar tasks, auxiliary task format and augmentations, and per-instance training; TTT significantly improves performance, achieving up to 6x improvement in accuracy compared to base fine-tuned models; when applying TTT to an 8B LLM, they achieve 53% accuracy on ARC's public validation set, improving the state-of-the-art for neural approaches by nearly 25%; by ensembling their method with program generation approaches, they achieve state-of-the-art public validation accuracy of 61.9%, matching average human performance; the findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in LLMs; test-time training applied to continued training on few-shot examples can be highly effective. | Paper, Tweet | | 6) A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents - analyzes AgentOps platforms and tools, highlighting the need for comprehensive observability and traceability features to ensure reliability in foundation model-based autonomous agent systems across their development and production lifecycle. | Paper, Tweet | | 7) Toward Optimal Search and Retrieval for RAG - examines how retrieval affects performance in RAG pipelines for QA tasks; conducts experiments using BGE-base and ColBERT retrievers with LLaMA and Mistral, finding that including more gold (relevant) documents improves QA accuracy; finds that using approximate nearest neighbor search with lower recall only minimally impacts performance while potentially improving speed and memory efficiency; reports that adding noisy or irrelevant documents consistently degrades performance, contradicting previous research claims; concludes that optimizing retrieval of gold documents is crucial for RAG performance, and that operating at lower search accuracy levels can be a viable approach for practical applications. | Paper, Tweet | | 8) Mitigating LLM Jailbreaks with Few Examples - introduces a new approach called for defending LLMs against jailbreak attacks, focusing on quickly adapting defenses after detecting new attacks rather than aiming for perfect adversarial upfront robustness; using a new benchmark, the most effective method, based on fine-tuning an input classifier, reduced attack success rates by over 240x for known attack types and 15x for novel variations after seeing just one example of each attack strategy; demonstrates that rapidly responding to new jailbreaks can be an effective alternative to traditional static defenses. | Paper, Tweet | | 9) Mixture of Transformers - introduce Mixture-of-Transformers (MoT), a new sparse multi-modal transformer architecture that matches the performance of traditional models while using only about half the computational resources for text and image processing; MoT matches a dense baseline's performance using only 55.8% of the FLOPs. | Paper | | 10) HtmlRAG - a novel approach that proposes using HTML instead of plain text as the format for building RAG systems; the key finding is that preserving HTML structure provides richer semantic and structural information compared to plain text conversion, which typically loses important formatting like headings, tables, and semantic tags; to address the challenge of HTML documents being too long for LLM context windows, the authors develop a two-step pruning method: first cleaning unnecessary HTML elements (reducing length by 94%), then using a block-tree-based pruning approach that combines embedding-based and generative pruning to further reduce the content while maintaining important information; experiments across six different QA datasets demonstrate that HtmlRAG outperforms existing plain-text based methods, validating the advantages of preserving HTML structure in RAG systems. | Paper, Tweet |

Top AI Papers of the Week (November 4 - November 10) - 2024

| Paper | Links | | ------------- | ------------- | | 1) Many-agent Simulations toward AI Civilization - demonstrates how 10-1000+ AI agents behave and progress with agent societies; proposes PIANO, an architecture that enables agents to interact with humans and other agents in real-time; shows that agents can autonomously develop specialized roles, adhere to and change collective rules, and engage in cultural and religious transmissions. | Paper, Tweet | | 2) A Comprehensive Survey of Small Language Models - a survey on small language models (SLMs) and discussion on issues related to definitions, applications, enhancements, reliability, and more. | Paper, Tweet | | 3) Magentic-One - a new generalist multi-agent system designed to handle complex web and file-based tasks; it uses an Orchestrator agent that directs four specialized agents: WebSurfer for browser operations, FileSurfer for file management, Coder for programming tasks, and ComputerTerminal for console operations; Magentic-One achieves competitive performance on multiple benchmarks including GAIA, AssistantBench, and WebArena, without requiring modifications to its core architecture. | Paper, Tweet | | 4) Mixtures of In-Context Learners - uses subsets of demonstrations to train experts via in-context learning; given a training set, a trainable weighting function is used to combine the experts' next-token predictions; this approach applies to black-box LLMs since access to the internal parameters of the LLM is not required. Good properties include the following: 1) competitive with standard ICL while being significantly more data, memory, and computationally efficient, and 2) resilient to noisy demonstrations and label imbalance. | Paper, Tweet | | 5) Attacking Vision-Language Agents via Pop-ups - shows that integrating adversarial pop-ups into existing agent testing environments leads to an attack success rate of 86%; this decreases the agents' task success rate by 47%; they also add that basic defense techniques (e.g., instructing the agent to ignore pop-ups) are ineffective. | Paper, Tweet | | 6) Multi-expert Prompting with LLMs - improves LLM responses by simulating multiple experts and aggregating their responses; it guides an LLM to fulfill input instructions by simulating multiple experts and selecting the best response among individual and aggregated views; it achieves a new state-of-the-art on TruthfulQA-Generation with ChatGPT, surpassing the current SOTA of 87.97%; it also improves performance across factuality and usefulness while reducing toxicity and hurtfulness. | Paper, Tweet | | 7) Number Understanding of LLMs - provides a comprehensive analysis of the numerical understanding and processing ability (NUPA) of LLMs; finds that naive finetuning can improve NUPA a lot on many but not all tasks; it also reports that techniques designed to enhance NUPA prove ineffective for finetuning pretrained models; explores chain-of-thought techniques applied to NUPA and suggests that chain-of-thought methods face scalability challenges, making them difficult to apply in practical scenarios. | Paper, Tweet | | 8) WebRL - proposes a self-evolving online curriculum RL framework to bridge the gap between open and proprietary LLM-based web agents; it improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM4-9B; the open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%); the self-evolving curriculum addresses the scarcity of web agent training tasks; this is underpinned by a robust outcome-supervised reward model to evaluate task success; an adaptive RL strategy helps to deal with distribution drift in online learning and ensures consistent improvements. | Paper, Tweet | | 9) Adapting while Learning - proposes a two-part fine-tuning approach that first helps LLMs learn from tool-generated solutions and then trains them to determine when to solve problems directly versus when to use tools; testing on math, climate science, and epidemiology benchmarks shows significant improvements, with a 28% boost in accuracy and 14% better tool usage precision compared to leading models like GPT-4 and Claude-3.5; the two-stage approach helps the LLM to adaptively solve scientific problems of varying complexity. | Paper, Tweet | | 10) Personalization of LLMs - presents a comprehensive framework for understanding personalized LLMs; introduces taxonomies for different aspects of personalization and unifying existing research across personalized text generation and downstream applications. | Paper, Tweet |

Top AI Papers of the Week (October 28 - November 3) - 2024

| Paper | Links | | ------------- | ------------- | | 1) Geometry of Concepts in LLMs - examines the geometric structure of concept representations in sparse autoencoders (SAEs) at three scales: 1) atomic-level parallelogram patterns between related concepts (e.g., man:woman::king:queen), 2) brain-like functional "lobes" for different types of knowledge like math/code, 3) and galaxy-level eigenvalue distributions showing a specialized structure in middle model layers. | Paper, Tweet | | 2) SimpleQA - a challenging benchmark of 4,326 short factual questions adversarially collected against GPT-4 responses; reports that frontier models like GPT-4o and Claude achieve less than 50% accuracy; finds that there is a positive calibration between the model stated confidence and accuracy, signaling that they have some notion of confidence; claims that there is still room to improve the calibration of LLMs in terms of stated confidence. | Paper, Tweet | | 3) Automating Agentic Workflow Generation - a novel framework for automating the generation of agentic workflows; it reformulates workflow optimization as a search problem over code-represented workflows, where edges connect LLM-invoking nodes; it efficiently explores the search space using a variant of MCTS, iteratively refining workflows through code modification, tree-structured experience, and execution feedback; experiments across six benchmark datasets demonstrate AFlow’s effectiveness, showing a 5.7% improvement over manually designed methods and a 19.5% improvement over existing automated approaches; AFlow also enables smaller models to outperform GPT-4o on specific tasks at just 4.55% of its inference cost. | Paper, Tweet | | 4) LLMs Solve Math with a Bag of Heuristics - uses causal analysis to find neurons that explain an LLM's behavior when doing basic arithmetic logic; discovers and hypothesizes that the combination of heuristic neurons is the mechanism used to produce correct arithmetic answers; finds that the unordered combination of different heuristic types is the mechanism that explains most of the model’s accuracy on arithmetic prompts. | Paper, Tweet | | 5) o1 Replication Journey - reports to be replicating the capabilities of OpenAI's o1 model; their journey learning technique encourages learning not just shortcuts, but the complete exploration process, including trial and error, reflection, and backtracking; claims that with only 327 training samples, their journey learning technique surpassed shortcut learning by 8.0% on the MATH dataset. | Paper, Tweet | | 6) Distinguishing Ignorance from Error in LLM Hallucinations - a method to distinguish between two types of LLM hallucinations: when models lack knowledge (HK-) versus when they hallucinate despite having correct knowledge (HK+); they build model-specific datasets using their proposed approach and show that model-specific datasets are more effective for detecting HK+ hallucinations compared to generic datasets. | Paper, Tweet | | 7) Multimodal RAG - provides a discussion on how to best integrate multimodal models into RAG systems for the industrial domain; it also provides a deep discussion on the evaluation of these systems using LLM-as-a-Judge. | Paper, Tweet | | 8) The Role of Prompting and External Tools in Hallucination Rates of LLMs - tests different prompting strategies and frameworks aimed at reducing hallucinations in LLMs; finds that simpler prompting techniques outperform more complex methods; it reports that LLM agents exhibit higher hallucination rates due to the added complexity of tool usage. | Paper, Tweet | | 9) MrT5 - a more efficient variant of byte-level language models that uses a dynamic token deletion mechanism (via a learned delete gate) to shorten sequence lengths by up to 80% while maintaining model performance; this enables faster inference and better handling of multilingual text without traditional tokenization; MrT5 maintains competitive accuracy with ByT5 on downstream tasks such as XNLI and character-level manipulations while improving inference runtimes. | Paper, Tweet | | 10) Relaxed Recursive Transformers - introduces a novel approach, Relaxed Recursive Transformer, that significantly reduces LLM size through parameter sharing across layers while maintaining performance; the model is initialized from standard pretrained Transformers, but only uses a single block of unique layers that is repeated multiple times in a loop; then it adds flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules; shows that the approach has the potential to lead to significant (2-3×) gains in inference throughput. | Paper, Tweet |

Top AI Papers of the Week (October 21 - October 27) - 2024

| Paper | Links | | ------------- | ------------- | | 1) Agentic Information Retrieval - provides an introduction to agentic information retrieval, which is shaped by the capabilities of LLM agents; discusses different types of cutting-edge applications of agentic information retrieval and challenges. | Paper, Tweet | | 2) Aya Expanse - a family of open-weight foundation models for multilingual capabilities; releases an 8B and 32B parameter model, including one of the largest multilingual dataset collections to date, with 513 million examples; the release also includes Aya-101 which the authors claim is the most comprehensive multilingual models covering 101 languages; Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B, a model 2x its size. | Paper, Tweet | | 3) A Theoretical Understanding of CoT - finds that adding correct and incorrect reasoning paths in demonstrations improves the accuracy of intermediate steps and CoT; the proposed method, Coherent CoT, significantly improves performance on several benchmarks; in the Tracking Shuffled Objects dataset, Gemini Pro shows a 6.60% improvement (from 58.20% to 64.80%), and in Penguins in a Table, DeepSeek 67B demonstrates an increase of 6.17% (from 73.97% to 80.14%). | Paper, Tweet | | 4) A Survey on Data Synthesis and Augmentation for LLMs - provides a comprehensive summary of data generation techniques in the lifecycle of LLMs; includes discussions on data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. | Paper, Tweet | | 5) LongRAG - enhances RAG's understanding of long-context knowledge which includes global information and factual details; consists of a hybrid retriever, an LLM-augmented information extractor, a CoT-guided filter, and an LLM-augmented generator; these are key components that enable the RAG system to mine global long-context information and effectively identify factual details; LongRAG outperforms long-context LLMs (up by 6.94%), advanced RAG (up by 6.16%), and Vanilla RAG (up by 17.25%). | Paper, Tweet | | 6) Evaluation Feature Steering in LLMs - evaluates featuring steering in LLMs using an experiment that artificially dials up and down various features to analyze changes in model outputs; it focused on 29 features related to social biases and study if feature steering can help mitigate social biases; among its findings, it reports that feature steering sometimes leads to off-target effects and that a neutrality feature can help decreases social biases in 9 social dimensions without negatively affecting text quality. | Paper, Tweet | | 7) Granite 3.0 - presents lightweight foundation models ranging from 400 million to 8B parameters; supports coding, RAG, reasoning, and function calling, focusing on enterprise use cases, including on-premise and on-device settings; demonstrates strong performance across academic benchmarks for language understanding, reasoning, coding, function calling, and safety. | Paper, Tweet | | 8) LLMs Reflect the Ideology of their Creators - finds that LLMs exhibit a diverse ideological stance which reflects the worldview of its creators; finds consistent normative differences between how the same LLM responds in Chinese compared to English; identifies normative disagreements between Western and non-Western LLMs about prominent actors in geopolitical conflicts. | Paper, Tweet | | 9) Scalable Watermarking for LLMs - proposes SynthID-Text, a text-watermarking scheme that can preserve text quality in LLMs, enable high detection accuracy, and minimize latency overhead; it integrates watermarking with speculative sampling that consists of the final pattern of scores for a model’s word choices combined with the adjusted probability scores; the authors test the feasibility and scalability of the approach by assessing feedback on nearly 10 million Gemini responses. | Paper, Tweet | | 10) Reasoning Patterns of OpenAI’s o1 Model - when compared with other test-time compute methods, o1 achieved the best performance across most datasets; the authors observe that the most commonly used reasoning patterns in o1 are divide and conquer and self-refinement; o1 uses different reasoning patterns for different tasks; for commonsense reasoning tasks, o1 tends to use context identification and emphasize constraints; for math and coding tasks, o1 mainly relies on method reuse and divide and conquer. | Paper, Tweet |

Top AI Papers of the Week (October 14 - October 20) - 2024

| Paper | Links | | ------------- | ------------- | | 1) Thinking LLMs - proposes a training method to equip LLMs with thinking abilities for general instruction-following without human-annotated data; uses an iterative search and optimization procedure to explore thought generation which enables the model to learn without direct supervision; thought candidates for each user instruction are scored with a judge model; only responses are evaluated by the Judge which determines the best and worst ones; then the corresponding full outputs are used as chosen and rejected pairs for DPO (referred to as Thought Preference Optimization in this paper). reports superior performance on AlpacaEval and Arena-Hard. | Paper, Tweet | | 2) Model Swarms - propose a new collaborative search algorithm to adapt LLM via swarm intelligence; a pool of LLM experts collaboratively move in the weight space and optimize a utility function representing various adaptation objectives; experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests. improves over 12 model composition baselines by up to 21.0% across tasks and contexts. | Paper, Tweet | | 3) First-Person Fairness in Chatbots - studies first-person fairness which involves fairness towards users interacting with ChatGPT; specifically, it measures the biases, if any, towards the users’ names; it leverages a model powered by GPT-4o to analyze patterns and name-sensitivity in the chatbot’s responses for different user names; claims that, overall, post-training significantly mitigate harmful stereotypes; also reports that in domains like entertainment and art, with open-ended tasks, demonstrate the highest level of bias (i.e., tendency to write stories with protagonists whose gender matches gender inferred from the user’s name) | Paper, Tweet | | 4) Introspection in LLMs - reports that LLMs can acquire knowledge through introspection that cannot be inferred from their training data; suggests that LLMs contain privileged information about themselves that can potentially lead to more interpretable and controllable systems; they report that this introspection ability is limited and models struggle to predict their behavior on tasks requiring reasoning over long outputs. | Paper, Tweet | | 5) Janus - proposes a unified autoregressive framework for multimodal understanding and generation; it decouples visual encoding into independent pathways and leverages a single transformer architecture to improve flexibility and performance on both visual understanding and generation; claims to alleviate trade-offs related to performing the vision tasks, something common in methods that rely on a single visual encoder; surpasses previous unified models and matches or exceeds the performance of task-specific models. | Paper, Tweet | | 6) Inference Scaling for Long-Context RAG - uses two strategies to investigate scaling laws for RAG: in-context learning (DRAG) and iterative prompting (IterRAG); finds that RAG performance consistently improves with the expansion of the effective context length under optimal configurations; when optimally allocated, increasing inference computation can lead to linear gains in long-context RAG performance; this leads to the development of a computation allocation model that can provide practical guidance for optimal computation allocation in long-context RAG scenarios. | Paper, Tweet | | 7) Agent S - a new open agentic framework that enables autonomous interaction with computers through a GUI; Agent S tackles challenges such as acquiring knowledge, planning over long-task horizons, and handling dynamic interfaces; it introduces experience-augmented hierarchical planning which leverages both search and retrieval; leverages an agent-computer interface to perform reasoning and control GUI agents; evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% in success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. | Paper, Tweet | | 8) Model Kinship for Merging LLMs - proposes model kinship to measure the degree of similarity between LLMs; model kinship is used to build a model merging strategy (Top-k Greedy Merging with Model Kinship) which yields better performance; the authors find that this new criterion can be used to effectively and continuously perform model merging. | Paper, Tweet | | 9) On the Planning Abilities of OpenAI’s o1 Models - reports that o1-preview is particularly strong in self-evaluation and constraint-following; also mentions that these o1 models demonstrate bottlenecks in decision-making and memory management, which are more pronounced in spatial reasoning; in particular, the models produce redundant action and struggle to generalize in spatially complex tasks. | Paper, Tweet | | 10) CoTracker3 - proposes a new point tracking model and a new semi-supervised training recipe; enables usage of real videos without annotations during training by generating pseudo-labels using off-the-shelf teachers; the approach is simpler in architecture and training scheme leading to better results while using 1000x less data. | Paper, Tweet |

Top AI Papers of the Week (October 7 - October 13) - 2024

| Paper | Links | | ------------- | ------------- | | 1) MLE-Bench - proposes a new benchmark for the evaluation of machine learning agents on machine learning engineering capabilities; includes 75 ML engineering-related competition from Kaggle testing on MLE skills such as training models, preparing datasets, and running experiments; OpenAI’s o1-preview with the AIDE scaffolding achieves Kaggle bronze medal level in 16.9% of competitions. | Paper, Tweet | | 2) Differential Transformer - proposes a differential attention mechanism that amplifies attention to the relevant context while canceling noise; Differential Transformer outperforms Transformer when scaling up model size and training tokens; the authors claim that since this architecture gets less "distracted" by irrelevant context, it can do well in applications such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. | Paper, Tweet | | 3) Astute RAG - proposes a novel RAG approach to deal with the imperfect retrieval augmentation and knowledge conflicts of LLMs; Astute RAG adaptively elicits essential information from LLMs' internal knowledge; then it iteratively consolidates internal and external knowledge with source awareness; Astute RAG is designed to better combine internal and external information through an interactive consolidation mechanism (i.e., identifying consistent passages, detecting conflicting information in them, and filtering out irrelevant information). | Paper, Tweet | | 4) ToolGen - integrates tool knowledge directly into LLMs by representing tools as a unique token which allows the LLM to generate tool calls and arguments, enabling seamless tool invocation and language generation; experimental results with over 47,000 tools show that ToolGen achieves superior results in both tool retrieval and autonomous task completion. | Paper, Tweet | | 5) Long-Context LLMs Meet RAG - finds that for many long-context LLMs, the quality of outputs declines as the number of passages increases; reports that the performance loss is due to retrieved hard negatives; they propose two ways to improve long-context LLM-based RAG: retrieval reordering and RAG-specific tuning with intermediate reasoning to help with relevance identification; that approaches demonstrate significant accuracy and robustness improvements on long-context RAG performance. | Paper, Tweet | | 6) GSM-Symbolic - tests several SoTA models on a benchmark created with symbolic templates that enable diverse mathematical problems; they find that LLMs exhibit variance when responding to variations of the same questions; the performance of all the models declines by adjusting the numerical values in the question; as questions are made more challenging (e.g., increasing the number of clauses) the performance significantly deteriorates; the authors hypothesize that the observed decline in performance is due to a lack of logical reasoning in current LLMs. | Paper, Tweet | | 7) Optima - a novel framework to enhance both communication efficiency and task effectiveness in LLM-based multi-agent systems through LLM training; proposes an iterative generate, rank, select, and train paradigm with a reward function to improve performance, token use, and communication efficiency; integrates Monte Carlo Tree Search-inspired techniques for DPO data generation to encourage diverse exploration; shows consistent improvements over single-agent baselines and vanilla MAS based on Llama 3 8B, with 2.8x performance gain with less than 10% tokens on tasks requiring heavy information exchange. | Paper, Tweet | | 8) ScienceAgentBench - a new benchmark to rigorously assess agents built for scientific workflows; after testing it on open-weight and proprietary LLMs, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. | Paper, Tweet | | 9) Addition Is All You Need - proposes an algorithm that approximates floating point multiplication with integer addition operations; it is less computationally intensive than 8-bit floating point but achieves higher precision; the authors report that applying the purposed L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products. | Paper, Tweet | | 10) Persuasion and Anti-social Ability of LLMs - studies the interaction patterns of LLMs in a multi-agent setting with social hierarchy; the study was done in a specific setting involving a guard and a prisoner who seeks additional yard time or escaping from prison; finds that in the multi-agent setting where power dynamics are involved, the LLMs fail to have a conversation; they also report that agents' personas are critical in driving the behaviors of the agents. In addition, and without explicit prompting, simply assigning agents' roles lead to anti-social behavior. | Paper, Tweet |

Top AI Papers of the Week (September 30 - October 6) - 2024

| Paper | Links | | ------------- | ------------- | | 1) Movie Gen - a set of foundation models to generate high-quality, 1080p HD videos, including different aspect ratios and synchronized audio; the 30B parameter model supports a context length of 73K video tokens, which enables generation of 16-second videos at 16fps; it also presents a 13B parameter video-to-audio generation model and a novel video editing model that’s attained via post-training; achieves state-of-the-art performance on tasks such as text-to-video synthesis, video personalization, video-to-audio generation and more. | Paper, Tweet | | 2) Were RNNs All We Needed? - revisits RNNs and shows that by removing the hidden states from input, forget, and update gates RNNs can be efficiently trained in parallel; this is possible because with this change architectures like LSTMs and GRUs no longer require backpropagate through time (BPTT); they introduce minLSTMs and minGRUs that are 175x faster for a 512 sequence length. | Paper, Tweet | | 3) LLMs Know More Than They Show - finds that the "truthfulness" information in LLMs is concentrated in specific tokens; this insight can help enhance error detection performance and further mitigate some of these issues; they also claim that internal representations can be used to predict the types of errors the LLMs are likely to make. | Paper, Tweet | | 4) Architecture Search Framework for Inference-Time Techniques - introduces a modular framework for building and optimizing LLMs by combining multiple inference-time techniques; this approach reframes the challenge of LLM system design as a hyperparameter optimization problem; tested on benchmarks including MT-Bench and CodeContests, Archon surpasses leading models such as GPT-4o and Claude 3.5 Sonnet, achieving a 15.1% average accuracy improvement. | Paper, Tweet | | 5) RATIONALYST - a model for process-supervision of reasoning that enables generalization across diverse reasoning tasks; this process is achieved with pre-training on a collection of 79k rationales from the Pile and a combination of reasoning datasets with minimal human intervention; fine-tuned from LLaMa-3-8B, the proposed model improves the accuracy of reasoning by an average of 3.9% on 7 reasoning benchmarks. | Paper | | 6) An Analysis of o1-preview - reports that large reasoning models like o1-preview, while improving on more difficult tasks, display similar qualitative trends as previous LLMs; o1 is sensitive to the probability of examples and tasks, performing better and requiring fewer “thinking tokens” in high-probability settings than in low-probability ones. | Paper, Tweet | | 7) FRAMES - a unified framework to evaluate an LLM’s ability to provide factual responses, assess retrieval capabilities, and the reasoning required to generate final responses; includes multi-hop questions that require the integration of information from multiple sources; reports that state-of-the-art LLMs struggle on the task and only achieve 40% accuracy with no retrieval; the proposed multi-step retrieval approach improves performance to 66% accuracy. | Paper, Tweet | | 8) Not All LLM Reasoners Are Created Equal - investigates in depth the grade-school math problem-solving capabilities of LLMs; reports that LLMs show a significant gap in reasoning; finds that LLMs display a huge performance difference when solving compositional pairs and solving questions independently. | Paper, Tweet | | 9) Evaluation of o1 - provides a comprehensive evaluation of OpenAI's o1-preview LLM; shows strong performance across many tasks such as competitive programming, generating coherent and accurate radiology reports, high school-level mathematical reasoning tasks, chip design tasks, anthropology and geology, quantitative investing, social media analysis, and many other domains and problems. | Paper, Tweet | | 10) Designing Priors for Better Few-Shot Image Synthesis - training generative models like GAN with limited data is difficult; current Implicit Maximum Likelihood Estimation approaches (IMLE) have an inadequate correspondence between latent code selected for training and those selected during inference; the proposed approach, RS-IMLE, changes the prior distribution for training which improves test-time performance and leads to higher quality image generation. | Paper, Tweet |

Top AI Papers of the Week (September 23 - September 29) - 2024

| Paper | Links | | ------------- | ------------- | | 1) Llama 3.2 - presents small and medium-sized vision LLMs (11B and 90B parameters), and lightweight, text-only models (1B and 3B); the text-only models are trained to support context length of 128K tokens and outperform other models in their class on a range of tasks; vision models exceed other models such as Claude 3 Haiku on image understanding tasks. | Paper, Tweet | | 2) *Molmo* - - presents a family of open, state-of-the-art multimodal AI models; the 72B model in the Molmo family outperforms others in the class of open weight and data models; it also compares favorably against proprietary models like GPT-4o, Claude 3.5, and Gemini 1.5 on several benchmarks. | Paper, Tweet | | 3) AlphaChip - a reinforcement learning-based method trained to design the physical layout of chips; AlphaChip is reportedly used in three additional generations of Google’s TPU; this release includes an open-source implementation of the method to help pre-train on a variety of chip blocks to apply to new blocks; also releases a model checkpoint pre-trained on 20 TPU blocks. | Paper, Tweet | | 4) LLMs Still Can’t Plan - evaluates whether large reasoning models such as o1 can plan; finds that a domain-independent planner can solve all instances of Mystery Blocksworld but LLMs struggle, even on small instances; o1-preview is effective on the task but tend to degrade in performance as plan length increases, concludes that while o1 shows progress on more challenging planning problems, the accuracy gains cannot be considered general or robust. | Paper, Tweet | | 5) Scaled-up Instructable Model Become Less Reliable - suggests that larger and more instructable LLMs may become less reliable; investigates LLMs across three elements: difficulty concordance, task avoidance, and prompting stability; finds that early models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook. | Paper, Tweet | | 6) Logic-of-Thought - proposes a new prompting technique called Logic-of-Thought (LoT) which employs propositional logic to generate and inject expanded logical information from the input context; it enhances CoT performance on the ReClor dataset by +4.35%; it improves CoT+SelfConsistency’s performance on LogiQA by +5%; it also boosts the performance of ToT on the ProofWriter dataset by +8%. | Paper, Tweet | | 7) RAG and Beyond - presents a survey that introduces a RAG task categorization method that helps to classify user queries into four levels according to the type of external data required and the focus of the task; summarizes key challenges in building robust data-augmented LLM applications and the most effective techniques for addressing them. | Paper, Tweet | | 8) A Preliminary Study of o1 in Medicine - provides a preliminary exploration of the o1-preview model in medical scenarios; shows that o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios; identifies hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. | Paper, Tweet | | 9) Small Language Models Survey - a comprehensive survey on small language models (SLMs) across architectures, training datasets, and training algorithms; analyzes 59 state-of-the-art open-source SLMs and capabilities such as reasoning, in-context learning, maths, and coding; other discussions include on-device runtime costs, latency, memory footprint, and valuable insights. | Paper, Tweet | | 10) Minstrel - a multi-generative agent system with reflection capabilities to automate structural prompt generation; it presents LangGPT, an extensible framework for designing prompts; Minstrel is built on top of LangGPT and experiments demonstrate that structural prompts (either generated by Minstrel or written manually) perform better in guiding LLMs to perform tasks. | Paper, Tweet |

Top AI Papers of the Week (September 16 - September 22) - 2024

Paper Links
1) Moshi - introduces a speech-text foundation model and full-duplex spoken dialogue framework; they present several components of the systems; Helium is a 7B parameter text LLM; Mimi is a semantic-acoustic neural audio code with state-of-the-art performance on audio quality; a hierarchical multi-stream architecture that can generate arbitrary conversation in a speech-to-speech manner. Paper, Tweet
2) Training LLMs to Self-Correct via RL - develops a multi-turn online reinforcement learning to improve the capabilities of an LLM to self-correct; it’s based entirely on self-generated data; SFT is shown to be ineffective at learning self-correction and suffers from distribution mismatch between training data and model responses; proposes a two-stage approach that first optimizes correction behavior and then uses a reward bonus to amplify self-correction during training; when applied to Gemini 1.0 Pro and 1.5 Flash models, it achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks. Paper, Tweet
3) Qwen2.5 Coder - a series of models including 1.5B and 7B parameters; it’s built upon the Qwen2.5 architecture which is continuously pretrained on 5.5 trillion tokens; achieves state-of-the-art performance across more than 10 benchmarks; includes strong capabilities in code generation, completion, reasoning, and repairing. Paper, Tweet
4) Diagram of Thought (DoT) - enhances the reasoning capabilities of LLMs through mathematical rigor; DAT models iterative reasoning in LLM as the construction of a directed acyclic graph; it integrates propositions, critiques, refinement, and verification into a unified DAG structure; this allows DoT to capture complex logical deduction beyond linear or tree-based approaches. Paper, Tweet
5) Agents in Software Engineering - provides a comprehensive overview of frameworks of LLM-based agents in software engineering. Paper, Tweet
6) To CoT or not to CoT? - investigates what kinds of tasks benefit the most from chain-of-thought (CoT) prompting; after a meta-analysis on 100+ papers and several evaluations, it finds that CoT produces strong performance benefits primarily on tasks involving math and logic; they find that most of the CoT gain comes from improving symbolic execution, but a symbolic solver outperforms it. Paper, Tweet
7) A Comprehensive Evaluation of Quantized Instruction-Tuned LLMs - evaluates the performance of instruction-tuned LLMs across various quantization methods on models ranging from 7B to 405B; the key findings are 1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, 2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models, and 3) task difficulty does not significantly impact accuracy degradation due to quantization. Paper, Tweet
8) Iteration of Thought - proposes the Iteration of Thought (IoT) framework to enhance the LLM responses and reasoning capabilities with adaptive reasoning paths; it leverages an inner dialogue agent, acting as a guide, to dynamically adjust reasoning paths which allows adaptive cross-path exploration and enhance response accuracy; it's different from CoT and ToT (both rigid processes) in that its prompt generation is a dynamic process that allows it to adapt. Paper, Tweet
9) Schrodinger’s Memory - uses the Universal Approximation Theorem to explain the memory mechanism of LLMs. It also proposes a new approach to evaluate LLM performance by comparing the memory capacities of different models; the Transformer architecture functions as a dynamic fitting UAT model, with a strong ability to adaptively fit inputs; this enables LLMs to recall entire content based on minimal input information. Paper, Tweet
10) Math Jailbreaking Prompts - uses GPT-4o to generate mathematically encoded prompts that serve as an effective jailbreaking technique; shows an average attack success rate of 73.6% across 13 state-of-the-art; this highlights the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Paper, Tweet

Top AI Papers of the Week (September 9 - September 15) - 2024

Paper Links
1) Learning to Reason with LLMs - a new family of LLMs trained with reinforcement learning to reason before it responds to complex tasks; it produces a long internal chain of thought and exceeds in science, code, and math-related tasks; ranked in the 49th percentile in the 2024 International Olympiad in Informatics and exceeds human PhD-level accuracy on science-related benchmarks. - Paper, Tweet
2) Chai-1 - a new multi-modal foundation model for molecular structure prediction that can predict proteins, small molecules, DNA, RNA, and more; it achieves state-of-the-art results on a variety of tasks in drug discovery; achieves a 77% success rate on the PoseBusters benchmark (vs. 76% by AlphaFold 3), as well as an Cα LDDT of 0.849 on the CASP15 protein monomer structure prediction set (vs. 0.801 by ESM3-98B). Paper, Tweet
3) Can LLMs Generation Novel Research Ideas - finds that LLM-generated research ideas are judged as more novel (p <0.05) than human expert ideas; however, they were rated slightly weaker in terms of flexibility; they also report that LLM agents lack diversity in the idea generation process and are not reliable evaluators. Paper, Tweet
4) DataGemma - includes a series of fine-tuned Gemma 2 models to help LLMs access and incorporate numerical and statistical data; proposes a new approach called Retrieval Interleaved Generation (RIG) which can reliably incorporate public statistical data from Data Commons into LLM responses; RIG is a tool-inspired approach, can interleave statistical tokens with natural language questions suitable for retrieval from Data Commons; to attain such capability, they fine-tune the LLM on an instruction-response dataset generated with the help of Gemini 1.5; the RIG approach improves factuality from 5-7% to about 58%. Paper, Tweet
5) Agent Workflow Memory - introduces Agent Workflow Memory to induce commonly reused workflows and provide these to the agent on demand; works offline and online and is meant to guide the agent's subsequent generations; it’s inspired by how humans learn reusable workflows from past experiences and use them to guide future actions; claims to substantially improve the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while doing it in a more efficient way. Paper, Tweet
6) The Role of Small Language Models in the LLM Era - closely examines the relationship between LLMs and SLMs; common applications of SLMs include data curation, training stronger models, efficient inference, evaluators, retrievers, and much more; includes insights for practitioners to better understand the value of these SLMs. Paper, Tweet
7) LLaMa-Omni - a model architecture for low-latency speech interaction with LLMs; it is based on Llama-3.1-8B-Instruct and can simultaneously generate both text and speech responses given speech instructions; responses can be generated with a response latency as low as 226ms; architecture-wise, it involves a speech encoder (Whispter-large-v3), a speech adaptor, an LLM, and a speech decoder; they also created a dataset of 200K speech interactions and responses. Paper, Tweet
8) Can LLMs Unlock Novel Scientific Research Ideas - investigates whether LLM can generate novel scientific research ideas; reports that Claude and GPT models tend to align more with the author's perspectives on future research ideas; this is measured across different domains like science, economics, and medicine. Paper, Tweet
9) Theory, Analysis, and Best Practices for Sigmoid Self-Attention - proposes Flash-Sigmoid, a hardware-aware and memory-efficient implementation of sigmoid attention; it yields up to a 17% inference kernel speed-up over FlashAttention-2 on H100 GPUs; show that SigmoidAttn matches SoftwaxAttn in various tasks and domains. Paper, Tweet
10) Achieving Peak Performance for LLMs - a systematic review of methods for improving and speeding up LLMs from three points of view: training, inference, and system serving; summarizes the latest optimization and acceleration strategies around training, hardware, scalability, and reliability. Paper, Tweet

Top AI Papers of the Week (September 2 - September 8) - 2024

Paper Links
1) AlphaProteo - presents a family of ML models trained for protein design; reports a 3-to 300-fold better binding affinities and higher experimental success rates compared to other existing methods on seven target proteins; shows that AlphaProteo’s performance on hundreds of target proteins from the PDB is comparable to the seven targets. Paper, Tweet
2) RAG in the Era of Long-Context LLMs - reports that longer-context LLMs suffer from a diminished focus on relevant information, which is one of the primary issues that a RAG system addresses (i.e., uses more relevant information); they propose an order-preserving RAG mechanism that improves performance on long-context question answering; it's not perfect and in fact, as retrieved chunks increase the quality of responses go up and then declines; they mention a sweet spot where it can achieve better quality with a lot fewer tokens than long-context LLMs. Paper, Tweet
3) Strategic Chain-of-Thought - a method to refine LLM performance by incorporating strategic knowledge before the intermediate CoT reasoning steps; the problem-solving strategy helps to guide the generation of the CoT paths and final answers; claims to achieve a 21.05% increase on the GSM8K datasets using the Llama3-8b model. Paper
4) Effective of AI on High Skilled Work - studies the impact of generative AI on software developers; reveals a 26.08% increase in the number of completed tasks among the developers that use AI tools like GitHub Copilot; also shows that less experienced developers are likely to adopt the AI tools and have greater productivity gains. Paper, Tweet
5) OLMoE - introduces a fully-open LLM that leverages sparse Mixture-of-Experts. OLMoE is a 7B parameter model and uses 1B active parameters per input token; there is also an instruction-tuned version that claims to outperform Llama-2-13B-Chat and DeepSeekMoE 16B. Paper, Tweet
6) LongCite - synthesizes a large-scale SFT dataset with off-the-shelf LLMs to improve long-context question answering with citations; it trains 8B and 9B parameter models that enhance citation generation capabilities from lengthy contexts while improving response correctness; claims to even surpass GPT-4o on their proposed LongBench-Cite benchmark. Paper, Tweet
7) MemLong - utilizes an external retriever for retrieving historical information which enhances the capabilities of long-context LLMs; it consistently outperforms other SoTA LLMs on long-context benchmarks and can extend the context length on a single 3090 GPU from 4k up to 80k. Paper, Tweet
8) Role of RAG Noise in LLMs - proposes a benchmark (NoiserBench) to measure how different kinds of noisy information affect RAG's performance; reports that from different kinds of beneficial noise studied (e.g., semantic, datatype, and illegal sentence), illegal sentence noise exhibits the most improved model performance across models and datasets. Paper, Tweet
9) Beyond Preference in AI Alignment - challenges the dominant practice of AI alignment known as human preference tuning; explains in what ways human preference tuning fails to capture the thick semantic content of human values; argues that AI alignment needs reframing, instead of aligning on human preferences, AI should align on normative standards appropriate to their social roles. Paper, Tweet
10) LLM-Based Agents for Software Engineering - a survey paper on LLM-based agents for software engineering, covering perspectives ranging from requirement engineering to test generation to software maintenance. Paper, Tweet

Top AI Papers of the Week (August 26 - September 1) - 2024

Paper Links
1) GameGen - a game engine powered by a diffusion model that enables real-time interaction with complex environments over long trajectories; uses a two-phase training process involving an RL agent to learn and a diffusion model to generate frames; it can interactively simulate DOOM over at 20 fps on a single TPU. Paper, Tweet
2) Agentic RAG for Time Series Analysis - proposes an agentic RAG framework for time series analysis; uses a multi-agent architecture where an agent orchestrates specialized sub-agents to complete time-series tasks; the sub-agents leverage tuned small language models and can retrieve relevant prompts containing knowledge about historical patterns and trends; this helps to improve predictions on new data. Paper, Tweet
3) AutoGen Studio - a low-code interface for rapidly prototyping AI agents. It's built on top of the AutoGen framework and can also be used for debugging and evaluating multi-agent workflows. Paper, Tweet
4) Persuasion Games with LLMs - claims that a multi-agent framework can be used to improve the persuasive efficacy of LLMs; the primary agent engages in persuasive dialogue while auxiliary agents perform key tasks like response analysis and information retrieval; finds that LLMs are capable of creating a perspective change in the users and persuading them to make a purchase decision; for instance, Sales agents can achieve a 71% positive shift in user perspectives. Paper, Tweet
5) Smaller, Weaker, Yet Better - finds that weaker + cheaper (WC) models can generate better synthetic data for fine-tuning models compared to data generated with stronger but more expensive models; overall, results suggest that WC models may be a compute-optimal approach for training advanced LLM reasoners. Paper, Tweet
6) Transfusion - presents a training recipe to train multi-modal models over discrete and continuous data; combines next token prediction with diffusion to train transformer models over mixed-modality sequences; shows that it’s possible to scale from 7B parameter models to 2T multi-modal tokens that can compete in performance with similar scale diffusion and language models. Paper, Tweet
7) ReMamba - investigates the long-context capabilities and efficiencies of Mamba models; the long-context deficiency issues are due to Mamba's RNN-like nature; it achieves this by condensing information via the following compression strategy: the top-k hidden states during the first forward pass and leverages Mamba’s selective mechanism to incorporate them into the state space during the second forward pass; achieves a 3.2 improvement over the baseline on LongBench and 1.6 improvement on L-Eval; the strategy seems to also transfer to Mamba 2. Paper, Tweet
8) Text2SQL is Not Enough - proposes Table-Augmented Generation (TAG), a unified framework for answering natural language questions over databases; it represents a wider range of unexplored interactions between LLMs and databases; develops a benchmark and finds that standard methods answer no more than 20% of queries correctly. Paper, Tweet
9) Foundation Models for Music - provides a comprehensive overview of state-of-the-art pre-trained models and foundation models in music. Paper, Tweet
10) Guide to Continual Multimodal Pretraining - a comprehensive guide on continual multimodal pertaining; introduces FoMo-In-Flux, a large-scale fine-grained and long horizon continual pretraining benchmark. Paper, Tweet

Top AI Papers of the Week (August 19 - August 25) - 2024

Paper Links
1) Automate Design of Agentic Systems - presents Meta Agent Search, a meta agent that iteratively programs and tests new agents based on a growing archive of previous discoveries; claims that with their approach it is possible to learn any possible agentic system including prompts, tool use, control flows, and more; they achieve this by focusing on three main components referred to as search space (define agents), search algorithm (explore search space), and the evaluation function (evaluate candidate agents). Paper, Tweet
2) LLM Pruning and Distillation in Practice - provides a comprehensive report on effective methods for compressing Llama 3.1 and Mistral NeMo models; it presents pruning and distillation approaches applied to the original models to produce 4B and 8B parameter models, respectively; before pruning, they also fine-tune the teacher model on their datasets leading to better distillation; their compression strategy yields a state-of-the-art 8B model (MN-Minitron-8B) which outperforms all similarly-sized models on common language modeling benchmarks. Paper, Tweet
3) Vizier Gaussian Process Bandit Algorithm - presents Vizier, an algorithm based on Gaussian process bandit optimization used by Google for millions of optimizations and research; it provides an open-source Python implementation of the Vizier algorithm, including benchmarking results that demonstrate its wider applicability. Paper, Tweet
4) Language Modeling on Tabular Data - presents a comprehensive survey of language modeling techniques for tabular data; includes topics such as categorization of tabular data structures and data types, datasets used for model training and evaluation, modeling techniques and training objectives, data processing methods, popular architectures, and challenges and future research directions. Paper, Tweet
5) Enhancing Robustness in LLMs - proposes a two-stage prompting technique to remove irrelevant information from context; it serves as a self-mitigation process that first identifies the irrelevant information and then filters it out; this leads to enhancement in robustness of the model and overall better performance on reasoning tasks. Paper, Tweet
6) A Comprehensive Overview of GraphRAG Methods - focuses on techniques applied to the GraphRAG workflow (graph-based indexing, graph-guided retrieval, and graph-enhanced generation); examines tasks, applications, evaluation, and industrial use cases of GraphRAG. Paper, Tweet
7) MagicDec - shows how speculative decoding can enhance throughput, reduce latency, and maintain accuracy in long context generation scenarios; it finds that as sequence length and batch size increase, bottlenecks shift from compute-bound to memory-bound; using these insights, they show it's possible to more effectively use speculative decoding for longer sequences, even when using large batch sizes. Paper, Tweet
8) Controllable Text Generation for LLMs - provides a comprehensive survey on methods for controllable text generation in LLMs; discusses issues like safety, consistency, style, and helpfulness. Paper, Tweet
9) PEDAL - uses a hybrid self-ensembling approach (based on diverse exemplars) to improve the overall performance of LLMs; specifically, it uses diverse exemplars to generate multiple candidate responses and then aggregates them using an LLM to generate a final response; this approach achieves better accuracy compared to greedy decoding and lower cost compared to self-consistency approaches. Paper, Tweet
10) Challenges and Responses in the Practice of LLMs - curates a set of important questions with insightful answers; questions are categorized across topics such as infrastructure, software architecture, data, application, and brain science. Paper, Tweet

Top AI Papers of the Week (August 12 - August 18) - 2024

Top AI Papers of the Week (August 5 - August 11) - 2024

Paper Links
1) The AI Scientist - a novel AI agent that can develop and write a full conference-level scientific paper costing less than $15; it automates scientific discovery by enabling frontier LLMs to perform independent research and summarize findings; it also uses an automated reviewer to evaluate the generated papers; claims to achieve near-human performance in evaluating paper scores; claims to produce papers that exceed the acceptance threshold at a top machine learning conference as judged by their automated reviewer. Paper, Tweet
2) Grok-2 - a new frontier model with strong code, math, and reasoning capabilities which includes a large and small model; outperforms both Claude 3.5 Sonnet and GPT-4-Turbo on the LMSYS Chatbot Arena; claims to improve capabilities including instruction following, retrieval, tool use, and enhancing factuality; competes with Claude 3.5 Sonnet (June release) and GPT-4o (May release) on MMLU and HumanEval. Paper, Tweet
3) LongWriter - proposes AgentWrite to enable off-the-shelf LLMs to generate coherent outputs beyond 20K words; AgentWrite breaks down the long generation task into subtasks and in a divide-and-conquer approach generates; the agent breaks the task into multiple writing subtasks and concatenates the outputs to get a final output (i.e., plan + write); the approach is then used to build SFT datasets that are used to tune LLMs to generate coherent longer outputs automatically; a 9B parameter model, further improved through DPO, achieves state-of-the-art performance on their benchmark, and surpasses proprietary models. Paper, Tweet
4) EfficientRAG - trains an auto-encoder LM to label and tag chunks; it retrieves relevant chunks, tags them as either or , and annotates chunks for continuous processing; then a filter model is trained to formulate the next-hop query based on the original question and previous annotations; this is done iteratively until all chunks are tagged as or the maximum # of iterations is reached; after the process above has gathered enough information to answer the initial question, the final generator (an LLM) generates the final answer. Paper, Tweet
5) RAGChecker - a fine-grained evaluation framework for diagnosing retrieval and generation modules in RAG; shows that RAGChecker has better correlations with human judgment; reports several revealing insightful patterns and trade-offs in design choices of RAG architectures. Paper, Tweet
6) HybridRAG - combines GraphRAG and VectorRAG leading to a HybridRAG system that outperforms both individually; it was tested on a set of financial earning call transcripts. Combining the advantages of both approaches provides more accurate answers to queries. Paper, Tweet
7) rStar - introduces self-play mutual reasoning to improve the reasoning capabilities of small language models without fine-tuning or superior models; MCTS is augmented with human-like reasoning actions, obtained from SLMs, to build richer reasoning trajectories; a separate SLM provides unsupervised feedback on the trajectories and the target SLM selects the final reasoning trajectory as the answer; rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B and consistently improves the accuracy of other SLMs. Paper, Tweet
8) Scaling LLM Test-Time Compute Optimally - investigates the scaling behaviors of inference-time computation in LLMs; in particular, it analyses how much an LLM can be improved provided a fixed amount of inference-time compute; finds that the effectiveness of different scaling approaches varies by difficulty of prompt; it then proposes an adaptive compute-optimal strategy that can improve efficiency by more than 4x compared to a best-of-N baseline; reports that in a FLOPs-matched evaluation, optimally scaling test-time compute can outperform a 14x larger model. Paper, Tweet
9) MedGraphRAG - a graph-based framework for the medical domain with a focus on enhancing LLMs and generating evidence-based results; leverages a hybrid static-semantic approach to chunk documents to improve context capture; entities and medical knowledge are represented through graphs which leads to an interconnected global graph; this approach improves precision and outperforms state-of-the-art models on multiple medical Q&A benchmarks. Paper, Tweet
10) Survey of NL2QL - a comprehensive overview of NL2SQL techniques powered by LLMs; covers models, data collection, evaluation methods, and error analysis. Paper, Tweet
Paper Links
1) SAM 2 - an open unified model for real-time, promptable object segmentation in images and videos; can be applied to unseen visual content without the need for custom adaptation; to enable accurate mask prediction in videos, a memory mechanism is introduced to store information on the object and previous interactions; the memory module also allows real-time processing of arbitrarily long videos; SAM2 significantly outperforms previous approaches on interactive video segmentation across 17 zero-shot video datasets while requiring three times fewer human-in-the-loop interactions. Paper, Tweet
2) Structured Generation Limits Reasoning - investigates if structured generation can impact an LLM’s reasoning and domain knowledge comprehensive capabilities; observes that there is a significant decline in LLM’s reasoning abilities when applying format restrictions compared to free-form responses; this degradation effect is further amplified when applying stricter format constraints to reasoning tasks. Paper, Tweet
3) From LLMs to LLM-based Agents for Sofware Engineering - a survey paper on current practices and solutions for LLM-based agents for software engineering; covers important topics such as requirement engineering, code generation, test generation, and autonomous decision making; it also includes benchmarks, metrics, and models used in different software engineering applications. Paper, Tweet
4) Transformer Explainer - presents an open-source interactive tool to learn about the inner workings of a Transformer model; it runs a GPT-2 instance locally in the user's browser and allows experimenting with your own inputs. Paper, Tweet
5) Enhancing LLMs for RAG - introduces RAGFoundry, an open-source framework for augmented LLMs for RAG use cases; it supports data creation, training, inference, and evaluation; one useful application is the creation of data-augmented datasets for tuning and evaluating LLMs in RAG settings. Paper, Tweet
6) Synthesizing Text-to-SQL Data from Weak and Strong LLMs - proposes integrated synthetic data to build a highly specialized SoTA text-to-SQL model called SENSE; the synthetic data from strong models enhances data diversity while valuable erroneous data from weaker models combined with an executor to learn from execution feedback; preference learning is used to instruction-tune LLMs to learn from both correct and incorrect samples; SENSE achieves state-of-the-art results on the SPIDER and BIRD benchmarks, which bridges the performance gap between open-source models and methods that use closed-source models. Paper, Tweet
7) Conversational Prompt Engineering - proposes an approach to help users create personalized prompts by articulating the preferred outputs via interactions; it involves two stages: 1) an initial instruction shaped by the model based on user-provided unlabeled data, and 2) the model shares the output and the user provides feedback with refinements on outputs and instruction; this iterative process results in a personalized few-shot prompt that performs better and more optimally on the desired task. Paper, Tweet
8) Self-Taught Evaluators - an approach to improve model-based evaluators using synthetic training data only; it first generates contrasting outputs (good and bad model responses) and trains an LLM-as-a-Judge to produce reasoning traces and final judgments; the self-improvement scheme repeats the training process in an iterative way using its improved predictions; claims to outperform LLM-judges such as GPT-4 and match top-performing reward models trained on labeled examples; improves a strong LLM (Llama3-70BInstruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. Paper, Tweet
9) RAGEval - proposes a simple framework to automatically generate evaluation datasets to assess knowledge usage of different LLM under different scenarios; it defines a schema from seed documents and then generates diverse documents which leads to question-answering pairs; the QA pairs are based on both the articles and configurations. Paper, Tweet
10) Survey of Mamba - provides a systematic review of existing Mamba-based models across domains and tasks; specifically, focuses on advancements of Mamba-based models, techniques for adapting Mamba to diverse data, applications where Mamba excels, and promising research directions Paper, Tweet

Top AI Papers of the Week (July 29 - August 4) - 2024

Paper Links
1) Meta-Rewarding LLMs - proposes a self-improving alignment technique (no human supervision) where the LLM judges its own judgements and uses the feedback to improve its judgment skills; shows that leveraging this LLM-as-a-Meta-Judge approach improves the LLM's ability to judge and follow instructions; just doing self-improvement to generate better responses (act) saturates quickly; this work improves the LLM's ability to judge itself (judge) to avoid issues like reward hacking; in addition to the act and judge roles, a third role called meta-judge is used to evaluate the model's own judgements. Paper, Tweet
2) MindSearch - presents an LLM-based multi-agent framework to perform complex web-information seeking and integration tasks; a web planner effectively decomposes complex queries followed by a web searcher that performs hierarchical information retrieval on the Internet to improve the relevancy of the retrieved information; the planning component is powered by an iterative graph construction which is used to better model complex problem-solving processes; the multi-agent framework handles long context problems better by distributing reasoning and retrieval tasks to specialized agents. Paper, Tweet
3) Improved RAG with Self-Reasoning - presents an end-to-end self-reasoning framework to improve the reliability and traceability of RAG systems; leverages the reasoning trajectories generated by the LLM itself; the LLM is used to carry out the following 3 processes: 1) relevance-aware: judges the relevance between the retrieved documents and the question, 2) evidence-aware selective: chooses and cites relevant documents, and then automatically selects snippets of key sentences as evidence from the cited documents, and 3) trajectory analysis: generates a concise analysis based on all gathered self-reasoning trajectories generated by the previous 2 processes and then provides the final inferred answer; this method helps the model to be more selective, reason and distinguish relevant and irrelevant documents, therefore improving the accuracy of the overall RAG system; the framework achieves comparable performance to GPT-4 with only 2K training samples (generated by GPT-4). Paper, Tweet
4) Constrained-CoT - limits the model reasoning output length without sacrificing performance; shows that constraining the reasoning of LLaMA2-70b to 100 words improves the accuracy from 36.01% (CoT) to 41.07% (CCoT) on GSM8K, while reducing the average output length by 28 words. Paper, Tweet
5) Adaptive RAG for Conversations Sytems - develops a gating model that predicts if a conversational system requires RAG to improve its responses; shows that RAG-based conversational systems have the potential to generate high-quality responses and high generation confidence; it also claims to identify a correlation between the generation's confidence level and the relevance of the augmented knowledge. Paper, Tweet
6) ShieldGemma - offers a comprehensive suite of LLM-based safety content moderation models built on Gemma 2; includes classifiers for key harm types such as dangerous content, toxicity, hate speech, and more. Paper, Tweet
7) Evaluating Persona Agents - proposes a benchmark to evaluate persona agent capabilities in LLMs; finds that Claude 3.5 Sonnet only has a 2.97% relative improvement in PersonaScore compared to GPT 3.5 despite being a much more advanced model. Paper, Tweet
8) Machine Unlearning Survey - provides a comprehensive survey on machine unlearning in generative AI. Paper, Tweet
9) ThinK - proposes an approach to address inefficiencies in KV cache memory consumption; it focuses on the long-context scenarios and the inference side of things; it presents a query-dependent KV cache pruning method to minimize attention weight loss while selectively pruning the least significant channels Paper, Tweet
10) The Art of Refusal - a survey of the current methods used to achieve refusal in LLMs; provides evaluation benchmarks and metrics used to measure abstention in LLMs. Paper, Tweet

Top AI Papers of the Week (July 22 - July 28) - 2024

Paper Links
1) Llama 3.1 - a collection of LLMs that include 8B, 70B, and 405B parameters models; supports eight languages and extends the context window to 128K tokens; performs competitively and in some cases outperforms state-of-the-art models across capabilities like general knowledge, math reasoning, and tool use. Paper, Tweet
2) AlphaProof & Alpha Geometry 2 - solved 4 out of 6 problems in this year’s IMO which is the equivalent of a silver-medal score; AlphaProof consists of a Gemini model that automatically translates natural language problem statements into formal statements (i.e., formalizer network); then a solver network searches for proofs/disproofs and progressively trains itself using AlphaZero to learn to solve even more complex problems; AlphaGeometry 2, a neuro symbolic hybrid system, proved the geometry problem; based on the Gemini model and trained from scratch on large amounts of synthetic data. Paper, Tweet
3) RAG vs. Long-Context LLMs - compares RAG and long-context LLMs and finds that long-context LLMs outperform RAG on average performance while RAG is significantly less expensive; proposes Self-Route, leveraging self-reflection to route queries to RAG or LC; reports that Self-Route significantly reduces computational cost while maintaining comparable performance to LC. Paper, Tweet
4) OpenDevin - presents a platform to develop generalist agents that interact with the world through software; features include 1) an interaction mechanism for interaction between agents, interfaces, and environments, 2) an environment including a sandboxed operating system and web browser available to the agents, 3) interface to create and execute code, 4) multi-agent support, and 5) an evaluation framework. Paper, Tweet
5) LazyLLM - introduces a novel dynamic token pruning method for efficient long-context LLM inference; it can accelerate the prefilling stage of a Llama 2 7B model by 2.34x and maintain high accuracy; it selectively computes the KV for tokens that are important for the next token prediction in both the prefilling and decoding stages; it allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. Paper, Tweet
6) Teaching LLM Agents to Self-Improve - claims it is possible to iteratively fine-tune LLMs with the ability to improve their own response over multiple turns with additional environment feedback; the LLM learns to recursively detect and correct its previous mistakes in subsequent iterations; improves the self-improvement abilities of 7B models on reasoning tasks (GSM8K and MATH), attaining an improvement over turns that’s unseen in strong proprietary models. Paper, Tweet
7) Text-to-SQL Survey - provides a survey on employing LLMs for Text-to-SQL tasks, including prompt engineering techniques, fine-tuning methods, benchmarks, and more. Paper, Tweet
8) MINT-1T - open-sources a large-scale multimodal interleaved dataset consisting of 1 trillion tokens which has 3.4 billion images; it also includes new sources such as PDFs and ArXiv papers. Paper, Tweet
9) Model Collapse on Synthetic Data - investigates the effects of training models on recursively generated data; finds that training on model-generated content can cause irreversible defects where the original content distribution disappears; shows that the effect, referred to as model collapse, occurs in LLMs, VAEs, and GMMs; while tested on smaller scale models (~100M params), the authors suggest this effect is highly likely to transfer to larger models over time. Paper, Tweet
10) Mitigating Hallucination via Generation Constraint - proposes a new training-free approach to mitigate hallucination in LLMs; they scaled the readout vector that constrains generation in a memory-augmented LLM decoder; recent works claim that LLMs with explicit memory mechanisms can help lower hallucination; this work uses a memory-augmented LLM and constrains generation in the decoder by applying lightweight memory primitives to reduce hallucination. Paper, Tweet

Top AI Papers of the Week (July 15 - July 21) - 2024

Paper Links
1) Improving Legibility of LLM Outputs - iteratively trains small verifiers to predict solution correctness, helpful provers to produce correct solutions accepted by the verifier, and sneaky provers that produce incorrect solutions that fool the verifier; this process helps train models that can produce text that is correct and easy to understand by both humans and AI systems which leads to more trustworthy systems. Paper, Tweet
2) SpreadsheetLLM - presents an efficient encoding method to optimize an LLM’s understanding and reasoning capability on spreadsheets; develops a sheet compressor consisting of structural-anchor-based compression, inverse index translation, and data-format-aware aggregation modules to efficiently compress and encode spreadsheets; in GPT-4’s in-context learning, it improves performance in spreadsheet table detection by 25.6%. Paper, Tweet
3) Context Embeddings for Efficient Answer Generation in RAG - proposes an effective context compression method to reduce long context and speed up generation time in RAG systems; the long contexts are compressed into a small number of context embeddings which allow different compression rates that trade-off decoding time for generation quality; reduces inference time by up to 5.69 × and GFLOPs by up to 22 × while maintaining high performance. Paper, Tweet
4) Weak-to-Strong Reasoning - demonstrates the use of weak supervision to elicit strong reasoning capabilities in LLMs without relying on human annotations or advanced models; reports that strong models can automatically refine their training data without explicitly being trained to do so; enables expanding a model's learning scope and scaling performance on reasoning. Paper, Tweet
5) A Survey of Prompt Engineering Methods in LLMs - a collection of prompt engineering methods for a variety of NLP tasks. Paper, Tweet
6) Does Refusal Training in LLMs Generalize to the Past Tense? - finds that simply reformulating an LLM request into past tense can jailbreak many state-of-the-art LLMs; for example "How to make a Molotov cocktail?" can be rephrased as "How did people make a Molotov cocktail?"; finds that the success rate of such requests can increase from 1% to 88% using direct requests on GPT-4o; concludes that current alignment techniques may not always generalize as intended. Paper, Tweet
7) Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? - proposes a framework (NeedleBench) of progressively challenging tasks to assess the long-context retrieval and reasoning capabilities of LLMs; they also present the Ancestral Trace Challenge that increases the need for complex logical reasoning which is common in real-world long-context tasks; their findings suggest that current LLMs struggle to handle reasoning tasks with complex logical relationships, even with texts shorter than 2K tokens. Paper, Tweet
8) Distilling System 2 into System 1 - investigates self-supervised methods to distill high-quality outputs from System 2 techniques and then fine-tune System 1 to match the predictions of the System 2 technique but without generating intermediate steps; the process of distilling reasoning into System 1 results in less inference cost. Paper, Tweet
9) Exploring Advanced LLMs with LLMSuite - shares practical tips for developing with and evaluating LLMs; solutions covered range from ReAct to RAG to parameter-efficient methods. Paper, Tweet
10) Beyond Euclid - provides an illustrated guide and graphical taxonomy of recent advances in non-Euclidean machine learning. Paper, Tweet

Top AI Papers of the Week (July 8 - July 14) - 2024

Paper Links
1) FlashAttention-3 - proposes to adapt FlashAttention to take advantage of modern hardware; the techniques used to speed up attention on modern GPUs include producer-consumer asynchrony, interleaving block-wise matmul and softmax operations, and block quantization and incoherent processing; achieves speedup on H100 GPUs by 1.5-2.0x with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. Paper, Tweet
2) RankRAG - introduces a new instruction fine-tuning framework to perform effective context ranking and answering generation to enhance an LLM’s RAG capabilities; it leverages a small ranking dataset to outperform existing expert ranking models; shows that a Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. Paper, Tweet
3) Mixture of A Million Experts - introduces a parameter-efficient expert retrieval mechanism that leverages the product key technique for sparse retrieval from a million tiny experts; it attempts to decouple computational cost from parameter count by efficiently routing to a very large number of tiny experts through a learned index structure used for routing; demonstrates superior efficiency compared to dense FFW, coarse-grained MoEs, and Product Key Memory (PKM) layers. Paper, Tweet
4) Reasoning in LLMs: A Geometric Perspective - explores the reasoning of LLMs from a geometrical perspective; reports that a higher intrinsic dimension implies greater expressive capacity of the LLM; reports that they establish a connection between the expressive power of LLMs and the density of their self-attention graphs; their analysis demonstrates that the density of these graphs defines the intrinsic dimension of the inputs to the MLP blocks. Paper, Tweet
5) Contextual Hallucinations Mitigation in LLMs - proposes a new method that detects and significantly reduces contextual hallucinations in LLMs (e.g., reduces by 10% in the XSum summarization task); builds a hallucination detection model based on input features given by the ratio of attention weights on the context vs. newly generated tokens (for each attention head); the hypothesis is that contextual hallucinations are related to the extent to which an LLM attends to the provided contextual information; they also propose a decoding strategy based on their detection method which mitigates the contextual hallucination; the detector can also be transferred across models without the need for retraining. Paper, Tweet
6) RouteLLM - proposes efficient router models to dynamically select between stronger and weak LLMs during inference to achieve a balance between cost and performance; the training framework leverages human preference data and data augmentation techniques to boost performance; shows to significantly reduce costs by over 2x in certain cases while maintaining the quality of responses. Paper, Tweet
7) A Survey on Mixture of Experts - a survey paper on Mixture of Experts (MoE), including the technical details of MoE, open-source implementations, evaluation techniques, and applications of MoE in practice. Paper, Tweet
8) Internet of Agents - a new framework to address several limitations in multi-agent frameworks such as integrating diverse third-party agents and adaptability to dynamic task requirements; introduces an agent integration protocol, instant messaging architecture design, and dynamic mechanisms for effective collaboration among heterogeneous agents. Paper, Tweet
9) 3DGen - a new pipeline for end-to-end text-to-3D asset generation in under a minute; integrates state-of-the-art components like AssetGen and TextureGen to represent 3D objects in three ways, namely view space, in volumetric space, and in UV space; achieves a win rate of 68% with respect to the single-stage model. Paper, Tweet
10) Learning at Test Time - proposes new sequence modeling layers with linear complexity and an expressive hidden state; defines a hidden state as an ML model itself capable of updating even on test sequence; by a linear model and a two-layer MLP based hidden state is found to match or exceed baseline models like Transformers, Mamba, and modern RNNs; the linear model is faster than Transformer at 8k context and matches Mamba in wall-clock time. Paper, Tweet

Top AI Papers of the Week (July 1 - July 7) - 2024

Paper Links
1) APIGen - presents an automated data generation pipeline to synthesize high-quality datasets for function-calling applications; shows that 7B models trained on curated datasets outperform GPT-4 models and other state-of-the-art models on the Berkeley Function-Calling Benchmark; a dataset consisting of 60K entries is also released to help with research in function-calling enabled agents. Paper, Tweet
2) CriticGPT - a new model based on GPT-4 to help write critiques for responses generated by ChatGPT; trained using RLHF using a large number of inputs that contained mistakes for which it had to critique; built to help human trainers spot mistakes during RLHF and claims that CriticGPT critiques are preferred by trainers over ChatGPT critiques in 63% of cases on naturally occurring bugs. Paper, Tweet
3) Searching for Best Practices in RAG - shows the best practices for building effective RAG workflows; proposes strategies that focus on performance and efficiency, including emerging multimodal retrieval techniques. Paper, Tweet
4) Scaling Synthetic Data Creation - proposes 1 billion diverse personas to facilitate the creation of diverse synthetic data for different scenarios; uses a novel persona-driven data synthesis methodology to generate diverse and distinct data covering a wide range of perspectives; to measure the quality of the synthetic datasets, they performed an out-of-distribution evaluation on MATH. A fine-tuned model on their synthesized 1.07M math problems achieves 64.9% on MATH, matching the performance of gpt-4-turbo-preview at only a 7B scale. Paper, Tweet
5) Self-Evaluation as a Defense Against Adversarial Attacks on LLMs - proposes the use of self-evaluation to defend against adversarial attacks; uses a pre-trained LLM to build defense which is more effective than fine-tuned models, dedicated safety LLMs, and enterprise moderation APIs; they evaluate different settings like attacks on the generator only and generator + evaluator combined; it shows that building a dedicated evaluator can significantly reduce the success rate of attacks. Paper, Tweet
6) Agentless - introduces OpenAutoEncoder-Agentless which offers an agentless system that solves 27.3% GitHub issues on SWE-bench Lite; claims to outperform all other open-source AI-powered software engineering agents. Paper, Tweet
7) Adaptable Logical Control for LLMs - presents the Ctrl-G framework to facilitate control of LLM generations that reliably follow logical constraints; it combines LLMs and Hidden Markow Models to enable following logical constraints (represented as deterministic finite automata); Ctrl-G achieves over 30% higher satisfaction rate in human evaluation compared to GPT4. Paper, Tweet
8) LLM See, LLM Do - closely investigates the effects and effectiveness of synthetic data and how it shapes a model’s internal biases, calibration, attributes, and preferences; finds that LLMs are sensitive towards certain attributes even when the synthetic data prompts appear neutral; demonstrates that it’s possible to steer the generation profiles of models towards desirable attributes. Paper, Tweet
9) Summary of a Haystack - proposes a new task, SummHay, to test a model’s ability to process a Haystack and generate a summary that identifies the relevant insights and cites the source documents; reports that long-context LLMs score 20% on the benchmark which lags the human performance estimate (56%); RAG components is found to boost performance on the benchmark, which makes it a viable option for holistic RAG evaluation. Paper, Tweet
10) AI Agents That Matter - analyzes current agent evaluation practices and reveals shortcomings that potentially hinder real-world application; proposes an implementation that jointly optimizes cost and accuracy and a framework to avoid overfitting agents. Paper, Tweet

Top AI Papers of the Week (June 24 - June 30) - 2024

Paper Links
1) ESM3 - a new LLM-based biological model that generates a new green fluorescent protein called esmGFP; builds on a bidirectional transformer, uses masked language models for the objective function, leverages geometric attention to represent atomic coordinates, and applies chain-of-thought prompting to generate fluorescent proteins; estimates that esmGFP represents an equivalent of over 500 million years of natural evolution performed by an evolutionary simulator. Paper, Tweet
2) Gemma 2 - presents a family of open models ranging between 2B to 27B parameters; demonstrates strong capabilities in reasoning, math, and code generation, outperforming models twice its size. Paper, Tweet
3) LLM Compiler - a suite of open pre-trained models (7B and 13B parameters) designed for code optimization tasks; it’s built on top of Code Llama and trained on a corpus of 546 billion tokens of LLVM-IR and assembly code; it’s also instruction fine-tuned to interpreter compiler behavior; achieves 77% of the optimizing potential of autotuning search and performs accurate disassembling 14% of the time compared to the autotuning technique on which it was trained. Paper, Tweet
4) Enhancing RAG with Long-Context LLMs - proposes LongRAG, which combines RAG with long-context LLMs to enhance performance; uses a long retriever to significantly reduce the number of extracted units by operating on longer retrieval units; the long reader takes in the long retrieval units and leverages the zero-shot answer extraction capability of long-context LLMs to improve performance of the overall system; claims to achieve 64.3% on HotpotQA (full-wiki), which is on par with the state-of-the-art model. Paper, Tweet
5) Improving Retrieval in LLMs through Synthetic Data - proposes a fine-tuning approach to improve the accuracy of retrieving information in LLMs while maintaining reasoning capabilities over long-context inputs; the fine-tuning dataset comprises numerical dictionary key-value retrieval tasks (350 samples); finds that this approach mitigates the "lost-in-the-middle" phenomenon and improves performance on both information retrieval and long-context reasoning. Paper, Tweet
6) GraphReader - proposes a graph-based agent system to enhance the long-context abilities of LLMs; it structures long text into a graph and employs an agent to explore the graph (using predefined functions guided by a step-by-step rational plan) to effectively generate answers for questions; consistently outperforms GPT-4-128k across context lengths from 16k to 256k. Paper, Tweet
7) Faster LLM Inference with Dynamic Draft Trees - presents a context-aware dynamic draft tree to increase the speed of inference; the previous speculative sampling method used a static draft tree for sampling which only depended on position but lacked context awareness; achieves speedup ratios ranging from 3.05x-4.26x, which is 20%-40% faster than previous work; these speedup ratios occur because the new method significantly increases the number of accepted draft tokens. Paper, Tweet
8) Following Length Constraints in Instructions - presents an approach for how to deal with length bias and train instruction following language models that better follow length constraint instructions; fine-tunes a model using DPO with a length instruction augmented dataset and shows less length constraint violations and while keeping a high response quality. Paper, Tweet
9) On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation - survey on LLM-based synthetic data generation, curation, and evaluation. Paper, Tweet
10) Adam-mini - a new optimizer that reduces memory footprint (45%-50% less memory footprint) by using fewer learning rates and achieves on-par or even outperforms AdamW; it carefully partitions parameters into blocks and assigns a single high-quality learning that outperforms Adam; achieves consistent results on language models sized from 125M -7B for pre-training, SFT, and RLHF. Paper, Tweet

Top AI Papers of the Week (June 17 - June 23) - 2024

Paper Links
1) Claude 3.5 Sonnet - a new model that achieves state-of-the-art performance on several common benchmarks such as MMLU and HumanEval; it outperforms Claude 3 Opus and GPT-4o on several benchmarks with the exception of math word problem-solving tasks; achieves strong performance on vision tasks which also helps power several new features like image-text transcription and generation of artifacts. Paper, Tweet
2) DeepSeek-Coder-V2 - competes with closed-sourced models on code and math generation tasks; achieves 90.2% on HumanEval and 75.7% on MATH; these results are higher than GPT-4-Turbo-0409 performance according to their report; includes a 16B and 236B parameter model with 128K context length. Paper, Tweet
3) TextGrad - a new framework for automatic differentiation through backpropagation on textual feedback provided by an LLM; this improves individual components and the natural language helps to optimize the computation graph; it works by providing an objective function without tuning prompts or components; claims to achieve LeetCodeHard best scores and SoTA performance on GPQA when combined with GPT4o. Paper, Tweet
4) Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? - conducts a deep performance analysis of long-context LLMs on in-context retrieval and reasoning; they first present a benchmark with real-world tasks requiring 1M token context; reports that long-context LLMs can rival state-of-the-art retrieval and RAG systems, without any explicit training on the tasks; suggests that compositional reasoning (required in SQL-like tasks) is still challenging for these LLMs; they also encourage the need for continued research on advanced prompting strategies as they noted significant boosts in performance when applying them for long context problems. Paper, Tweet
5) PlanRAG - enhances decision making with a new RAG technique called iterative plan-then-RAG (PlanRAG); involves two steps: 1) an LM generates the plan for decision making by examining data schema and questions and 2) the retriever generates the queries for data analysis; the final step checks if a new plan for further analysis is needed and iterates on previous steps or makes a decision on the data; PlanRAG is found to be more effective than iterative RAG on the proposed Decision QA tasks. Paper, Tweet
6) Mitigating Memorization in LLMs - presents a modification of the next-token prediction objective called goldfish loss to help mitigate the verbatim generation of memorized training data; it uses a simple technique that excludes a pseudorandom subset of training tokens at training time; they show that the goldfish loss resists memorization and keeps the model useful; however, it may need to train for longer to more effectively learn from the training data. Paper, Tweet
7) Monte Carlos Tree Self-Refine - report to have achieved GPT-4 level mathematical olympiad solution using an approach that integrates LLMs with Monte Carlo Tree Search; this approach focuses on enhancing the mathematical reasoning performance of the system through capabilities such as systematic exploration, self-refinement, and self-evaluation. Paper, Tweet
8) From RAG to Rich Parameters - investigates more closely how LLMs utilize external knowledge over parametric information for factual queries; finds that in a RAG pipeline, LLMs take a “shortcut” and display a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. Paper, Tweet
9) Open-Sora - an open-source video generation model that can generate 16-second 720p videos; it’s a 1.1B parameter model trained on more than 30m data and now supports image-to-video; presents an enhanced diffusion model and video compression network for spatial and temporal compression; increases controllability of generations and reduces training costs. Paper, Tweet
10) Tree Search for Language Model Agents - proposes an inference-time tree search algorithm for LM agents to perform exploration and enable multi-step reasoning; it’s tested on interactive web environments and applied to GPT-4o to significantly improve performance; demonstrates that performance scales when increasing test-time compute. Paper, Tweet

Top AI Papers of the Week (June 10 - June 16) - 2024

Paper Links
1) Nemotron-4 340B - provides an instruct model to generate high-quality data and a reward model to filter out data on several attributes; demonstrates strong performance on common benchmarks like MMLU and GSM8K; it’s competitive with GPT-4 on several tasks, including high scores in multi-turn chat; a preference data is also released along with the base model. Paper, Tweet
2) Discovering Preference Optimization Algorithms with LLMs - proposes LLM-driven objective discovery of state-of-the-art preference optimization; no human intervention is used and an LLM is prompted to propose and implement the preference optimization loss functions based on previously evaluated performance metrics; discovers an algorithm that adaptively combined logistic and exponential losses. Paper, Tweet
3) SelfGoal - a framework to enhance an LLM-based agent's capabilities to achieve high-level goals; adaptively breaks down a high-level goal into a tree structure of practical subgoals during interaction with the environment; improves performance on various tasks, including competitive, cooperative, and deferred feedback environments Paper, Tweet
4) Mixture-of-Agents - an approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents methodology; layers are designed with multiple LLM agents and each agent builds on the outputs of other agents in the previous layers; surpasses GPT-4o on AlpacaEval 2.0, MT-Bench and FLASK. Paper, Tweet
5) Transformers Meet Neural Algorithmic Reasoners - a new hybrid architecture that enables tokens in the LLM to cross-attend to node embeddings from a GNN-based neural algorithmic reasoner (NAR); the resulting model, called TransNAR, demonstrates improvements in OOD reasoning across algorithmic tasks Paper, Tweet
6) Self-Tuning with LLMs - improves an LLM’s ability to effectively acquire new knowledge from raw documents through self-teaching; the three steps involved are 1) a self-teaching component that augments documents with a set of knowledge-intensive tasks focusing on memorization, comprehension, and self-reflection, 2) uses the deployed model to acquire knowledge from new documents while reviewing its QA skills, and 3) the model is configured to continually learn using only the new documents which helps with thorough acquisition of new knowledge. Paper, Tweet
7) Sketching as a Visual Chain of Thought - a framework that enables a multimodal LLM to access a visual sketchpad and tools to draw on the sketchpad; it can equip a model like GPT-4 with the capability to generate intermediate sketches to reason over complex tasks; improves performance on many tasks over strong base models with no sketching; GPT-4o equipped with SketchPad sets a new state of the art on all the tasks tested. Paper, Tweet
8) Mixture of Memory Experts - proposes an approach to significantly reduce hallucination (10x) by tuning millions of expert adapters (e.g., LoRAs) to learn exact facts and retrieve them from an index at inference time; the memory experts are specialized to ensure faithful and factual accuracy on the data it was tuned on; claims to enable scaling to a high number of parameters while keeping the inference cost fixed. Paper, Tweet
9) Multimodal Table Understanding - introduces Table-LLaVa 7B, a multimodal LLM for multimodal table understanding; it’s competitive with GPT-4V and significantly outperforms existing MLLMs on multiple benchmarks; also develops a large-scale dataset MMTab, covering table images, instructions, and tasks. Paper, Tweet
10) Consistent Middle Enhancement in LLMs - proposes an approach to tune an LLM to effectively utilize information from the middle part of the context; it first proposes a training-efficient method to extend LLMs to longer context lengths (e.g., 4K -> 256K); it uses a truncated Gaussian to encourage sampling from the middle part of the context during fine-tuning; the approach helps to alleviate the so-called "Lost-in-the-Middle" problem in long-context LLMs. Paper, Tweet

Top AI Papers of the Week (June 3 - June 9) - 2024

Paper Links
1) NLLB - proposes a massive multilingual model that leverages transfer learning across 200 languages; it’s based on a sparsely Gated Mixture of Experts architecture and trained on data via an approach tailored for low-resource languages; evaluates on 40K translations and achieves an average of 44% improvement in translation quality. Paper, Tweet
2) Extracting Concepts from GPT-4 - proposes a new scalable method based on sparse autoencoders to extract around 16 million interpretable patterns from GPT-4; the method demonstrates predictable scaling and is more efficient than previous techniques. Paper, Tweet
3) Mamba-2 - a new architecture that combines state space models (SSMs) and structured attention; it uses 8x larger states and trains 50% faster; the new state space duality layer is more efficient and scalable compared to the approach used in Mamba; it also improves results on tasks that require large state capacity. Paper, Tweet
4) MatMul-free LLMs - proposes an implementation that eliminates matrix multiplication operations from LLMs while maintaining performance at billion-parameter scales; the performance between full precision Transformers and the MatMul-free models narrows as the model size increases; claims that by using an optimized kernel during inference, memory consumption is reduced by more than 10x. Paper, Tweet
5) Buffer of Thoughts - presents a thought-augmented reasoning approach to enhance the accuracy, efficiency, and robustness of LLM-based reasoning; it leverages a meta-buffer containing high-level thoughts (thought templates) distilled from problem-solving processes; the relevant thought template is then retrieved and instantiated with task-specific reasoning structures for the thought-augmented reasoning process; it demonstrates SOTA performance on 10 challenging tasks while requiring 12% of the cost of multi-query prompting methods like Tree-of-Thoughts. Paper, Tweet
6) SaySelf - a training framework to teach LLMs to express more accurate fine-grained confidence estimates and self-reflective rationales; it performs supervised finetuning on a dataset that contains summaries of the difference between multiple reasoning chains; reinforcement learning is then applied to calibrate confidence estimates, encouraging the LLM to produce accurate, high-confidence predictions and penalize overconfidence in erroneous outputs. Paper, Tweet
7) The Geometry of Concepts in LLMs - studies the geometry of categorical concepts and how the hierarchical relations between them are encoded in LLMs; finds that simple categorical concepts are represented as simplices by the LLMs and complex concepts are represented as polytopes constructed from direct sums of simplices, which reflect the hierarchical structure. Paper, Tweet
8) Aligning LLMs with Demonstrated Feedback - proposes a method to align LLMs to a specific setting via a very small number of demonstrations as feedback; it aligns LLM outputs to a user’s demonstrated behaviors and can learn fine-grained style and task alignment across domains; outperforms few-shot prompting, SFT, and self-play methods on the tested benchmarks. Paper, Tweet
9) Towards Scalable Automated Alignment of LLMs - provides an overview of methods used for alignment of LLMs; explores the 4 following directions: 1) aligning through inductive bias, 2) aligning through behavior imitation, 3) aligning through model feedback, and 4) aligning through environment feedback. Paper, Tweet
10) AgentGym - a new framework featuring various environments and tasks for broad, real-time, and concurrent agent exploration; builds a generally capable LLM-based agent with self-evolution abilities and explores its potential beyond previously seen data across tasks and environments. Paper, Tweet

Top AI Papers of the Week (May 27 - June 2) - 2024

Paper Links
1) Contextual Position Encoding - proposes a new position encoding method, CoPE, to enable the position to be conditioned on context by incrementing position only on certain tokens; the position encoding is context-dependent and can represent different levels of position abstraction; the general position encoding method can attend to the i-th particular word, noun, or sentence; improves perplexity on language modeling and coding tasks. Paper, Tweet
2) Symbolic Chain-of-Thought - proposes a method that improves the logical reasoning capabilities of LLMs by integrating symbolic expressions and logical rules with chain-of-thought (CoT) prompting; the prompting technique is called Symbolic Chain-of-Thought and it’s a fully LLM-based framework with the following key steps: 1) translates natural language context to symbolic format, 2) derives step-by-step plan to solve problems following symbolic logical rules, and 3) uses a verifier to check the translation and reasoning chain. Paper, Tweet
3) Abacus Embeddings - achieves 99% accuracy on 100-digit addition problems by training on only 20-digit numbers with a single GPU; the main challenge this work addresses is the inability of transformers to track the exact position of digits; they do this by adding an embedding to each digit that encodes its position relative to the start of the number; these gains also transfer to multi-step reasoning tasks that include sorting and multiplication. Paper, Tweet
4) Introduction to Vision-Language Modeling - presents an introduction to vision-language models along with key details of how they work and how to effectively train these models. Paper, Tweet
5) GNN-RAG - combines the language understanding abilities of LLMs with the reasoning abilities of GNNs in a RAG style; the GNN extracts useful and relevant graph information while the LLM takes the information and leverages its capabilities to perform question answering over knowledge graphs (KGQA); GNN-RAG improves vanilla LLMs on KGQA and outperforms or matches GPT-4 performance with a 7B tuned LLM. Paper, Tweet
6) Attention as an RNN - presents a new attention mechanism that can be trained in parallel (like Transformers) and be updated efficiently with new tokens requiring constant memory usage for inferences (like RNNs); the attention formulation is based on the parallel prefix scan algorithm which enables efficient computation of attention’s many-to-many RNN output; achieves comparable performance to Transformers on 38 datasets while being more time and memory-efficient. Paper, Tweet
7) Aya23 - a family of multilingual language models that can serve up to 23 languages; it intentionally focuses on fewer languages and allocates more capacity to these languages; shows that it can outperform other massive multimodal models on those specific languages. Paper, Tweet
8) Are Long-LLMs A Necessity For Long-Context Tasks? - claims that long-LLMs are not a necessity to solve long-context tasks; proposes a reasoning framework to enable short-LLMs to address long-context tasks by adaptively accessing and utilizing the context based on the presented tasks; it decomposes the long context into short contexts and processes them using a decision-making process. Paper, Tweet
9) Financial Statement Analysis with LLMs - claims that LLMs can generate useful insights from its analysis of trends and financial ratios; shows that GPT-4 performs on par with narrowly specialized models; and achieves a profitable trading strategy based on GPT’s predictions. Paper, Tweet
10) SimPO - a simpler and more effective approach for preference optimization with a reference-free reward; uses the average log probability of a sequence as an implicit reward (i.e., no reference model required) which makes it more compute and memory efficient; demonstrates that it outperforms existing approaches like DPO and claims to produce the strongest 8B open-source model. Paper, Tweet

Top AI Papers of the Week (May 20 - May 26) - 2024

Paper Links
1) Extracting Interpretable Features from Claude 3 Sonnet - presents an effective method to extract millions of abstract features from an LLM that represent specific concepts; these concepts could represent people, places, programming abstractions, emotion, and more; reports that some of the discovered features are directly related to the safety aspects of the model; finds features directly related to security vulnerabilities and backdoors in code, bias, deception, sycophancy; and dangerous/criminal content, and more; these features are also used to intuititively steer the model’s output. Paper, Tweet
2) Agent Planning with World Knowledge Model - introduces a parametric world knowledge model to facilitate agent planning; the agent model can self-synthesize knowledge from expert and sampled trajectories; this is used to train the world knowledge model; prior task knowledge is used to guide global planning and dynamic state knowledge is used to guide the local planning; demonstrates superior performance compared to various strong baselines when adopting open-source LLMs like Mistral-7B and Gemma-7B. Paper, Tweet
3) Risks and Opportunities of Open-Source Generative AI - analyzes the risks and opportunities of open-source generative AI models; argues that the overall benefits of open-source generative AI outweigh its risks. Paper, Tweet
4) Enhancing Answer Selection in LLMs - proposes a hierarchical reasoning aggregation framework for improving the reasoning capabilities of LLMs; the approach, called Aggregation of Reasoning (AoR), selects answers based on the evaluation of reasoning chains; AoR uses dynamic sampling to adjust the number of reasoning chains with respect to the task complexity; it uses results from the evaluation phase to determine whether to sample additional reasoning chains; a known flaw of majority voting is that it fails in scenarios where the correct answer is in the minority; AoR focuses on evaluating the reasoning chains to improve the selection of the final answer; AoR outperforms various prominent ensemble methods and can be used with various LLMs to improve performance on complex reasoning tasks. Paper, Tweet
5) How Far Are We From AGI - presents an opinion paper addressing important questions to understand the proximity to artificial general intelligence (AGI); it provides a summary of strategies necessary to achieve AGI which includes a detailed survey, discussion, and original perspectives. Paper
6) Efficient Inference of LLMs - proposes a layer-condensed KV cache to achieve efficient inference in LLMs; only computes and caches the key-values (KVs) of a small number of layers which leads to saving memory consumption and improved inference throughput; can achieve up to 26x higher throughput than baseline transformers while maintaining satisfactory performance. Paper, Tweet
7) Guide for Evaluating LLMs - provides guidance and lessons for evaluating large language models; discusses challenges and best practices, along with the introduction of an open-source library for evaluating LLMs. Paper, Tweet
8) Scientific Applications of LLMs - presents INDUS, a comprehensive suite of LLMs for Earth science, biology, physics, planetary sciences, and more; includes an encoder model, embedding model, and small distilled models. Paper, Tweet
9) DeepSeek-Prover - introduces an approach to generate Lean 4 proof data from high-school and undergraduate-level mathematical competition problems; it uses the synthetic data, comprising of 8 million formal statements and proofs, to fine-tune a DeepSeekMath 7B model; achieves whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test; this surpasses the baseline GPT-4 (23.0%) with 64 samples and a tree search RL method (41.0%). Paper, Tweet
10) Efficient Multimodal LLMs - provides a comprehensive and systematic survey of the current state of efficient multimodal large language models; discusses efficient structures and strategies, applications, limitations, and promising future directions. Paper, Tweet

Top AI Papers of the Week (May 13 - May 19) - 2024

Paper Links
1) GPT-4o - a new model with multimodal reasoning capabilities with real-time support across audio, vision, and text; it can accept as input any combination of text, audio, image, and video to generate combinations of text, audio, and image outputs; it’s reported to match GPT-4 Turbo performance while being 50% much faster and cheaper via APIs. Paper, Tweet
2) Gemini 1.5 Flash - a lightweight transformer decoder model with a 2M context window with multimodal capabilities; it is designed for efficiency and yields the fastest output generation of all models on several evaluated languages; overall, Gemini 1.5 Flash performs uniformly better compared to Gemini 1.0 Pro and even performs at a similar level to 1.0 Ultra on several benchmarks. Paper, Tweet
3) Veo - Google Deepmind’s most capable video generation model generates high-quality, 1080p resolution videos beyond 1 minute; it supports masked editing on videos and can also generate videos with an input image along with text; the model can extend video clips to 60 seconds and more while keeping consistency with its latent diffusion transformer. Paper, Tweet
4) Chameleon - a family of token-based mixed-modal models for generating images and text in any arbitrary sequence; reports state-of-the-art performance in image captioning and outperforms Llama 2 in text-only tasks and is also competitive with Mixtral 8x7B and Gemini-Pro; exceeds the performance of Gemini Pro and GPT-4V on a new long-form mixed-modal generation evaluation. Paper, Tweet
5) Fine-tuning and Hallucinations - studies the impact of fine-tuning on new knowledge on the hallucination tendencies of LLMs; the setup includes fine-tuning examples that include new knowledge; shows that LLMs struggle to acquire new factual knowledge via fine-tuning; also finds that as new knowledge is learned it increases the model’s tendency to hallucinate. Paper, Tweet
6) Zero-shot Tokenizer Transfer - trains a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings; it demonstrates generalization to new tokenizers both with encoder and decoder LLMs; reports that the method achieves performance close to the original models' performance in cross-lingual and coding tasks while reducing the length of the tokenized sequence. Paper, Tweet
7) WavCraft - leverages LLMs to connect task-specific models for audio content creation and editing; decomposes users' instructions into several tasks and tackles each task collaboratively with the particular module; it can enable users to interact and produce audio content without explicit commands Paper
8) RLHF Workflow - provides an easily reproducible recipe for online iterative RLHF; discusses theoretical insights and algorithmic principles of online iterative RLHF and practical implementation. Paper, Tweet
9) You Only Cache Once - a decoder-decoder LLM architecture that only caches key-value pairs once; it involves a cross-decoder stacked upon a self-decoder which efficiently encodes global key-value caches and the cross-encoder reuses the cache via cross-attention; this leads to a significant reduction in GPU memory use without sacrificing capabilities; achieves comparable performance to Transformer in various settings of scaling up model size and number of training token. Paper, Tweet
10) CAT3D - presents a method for creating anything in 3D by simulating the real-world capture process using a multi-view diffusion model; it can generate consistent novel views of a scene which can be used as input to 3D reconstruction techniques to produce 3D representation rendered in real-time; the scene from CAT3D can be generated in less than one minute and is reported to outperform existing methods on single image and few-view 3D scene creation tasks. Paper, Tweet

Top AI Papers of the Week (May 6 - May 12) - 2024

Paper Links
1) AlphaFold 3 -releases a new state-of-the-art model for accurately predicting the structure and interactions of molecules; it can generate the 3D structures of proteins, DNA, RNA, and smaller molecules; the model is an improved version of the Evoformer module and then assembling its predictions using a diffusion network; the diffusion process starts with a cloud of atoms which converges to its final molecular structure. Paper, Tweet
2) xLSTM: Extended Long Short-Term Memory - attempts to scale LSTMs to billions of parameters using the latest techniques from modern LLMs and mitigating common limitations of LSTMs; to enable LSTMs the ability to revise storage decisions, they introduce exponential gating and a new memory mixing mechanism (termed sLSTM); to enhance the storage capacities of LSTMs, they add a matrix memory and a covariance update rule (termed mLSTM); Both the sLSTM and xLSTM cells stabilize their exponential gates using the same technique; these extensions lead to xLSTM blocks that are residually stacked into the final xLSTM architecture; compared to Transformers, xLSTMs have a linear computation and constant memory complexity concerning the sequence length; the xLSTM architecture is shown to be efficient at handling different aspects of long context problems; achieves better validation perplexities when compared to different model classes like Transformers, SSMs, and RNNs. Paper, Tweet
3) DeepSeek-V2 -a strong MoE model comprising 236B parameters, of which 21B are activated for each token; supports a context length of 128K tokens and uses Multi-head Latent Attention (MLA) for efficient inference by compressing the Key-Value (KV) cache into a latent vector; DeepSeek-V2 and its chat versions achieve top-tier performance among open-source models. Paper, Tweet
4) AlphaMath Almost Zero - enhances LLMs with Monte Carlo Tree Search (MCTS) to improve mathematical reasoning capabilities; the MCTS framework extends the LLM to achieve a more effective balance between exploration and exploitation; for this work, the idea is to generate high-quality math reasoning data without professional human annotations; the assumption is that a well pre-trained LLM already possesses mathematical knowledge to generate reasoning steps but needs better stimulation such as an advanced prompting or search strategy; unlike other methods such as Program-of-thought and Chain-of-thought, no solutions are required for the training data, just the math questions and the answers; the integration of LLMs, a value model, and the MCTS framework enables an effective and autonomous process of generating high-quality math reasoning data; the value model also aids the policy model in searching for effective solution paths. Paper, Tweet
5) DrEureka: Language Model Guided Sim-To-Real Transfer - investigates using LLMs to automate and accelerate sim-to-real design; it requires the physics simulation for the target task and automatically constructs reward functions and domain randomization distributions to support real-world transfer; discovers sim-to-real configurations competitive with existing human-designed ones on quadruped locomotion and dexterous manipulation tasks. Paper, Tweet
6) Consistency LLMs - proposes efficient parallel decoders that reduce inference latency by decoding n-token sequence per inference step; the inspiration for this work comes from the human's ability to form complete sentences before articulating word by word; this process can be mimicked and learned through fine-tuning pre-trained LLMs to perform parallel decoding; it is trained to perform parallel decoding by mapping randomly initialized n-token sequences to the same result yielded by autoregressive (AR) decoding in as few steps as possible; a consistency loss helps with multiple-token prediction and a standard AR loss prevents deviation from the target LLM and ensures generation quality. Shows 2.4x to 3.4x improvements in generation speed while preserving the generation quality. Paper, Tweet
7) Is Flash Attention Stable? - develops an approach to understanding the effects of numeric deviation and applies it to the widely-adopted Flash Attention optimization; finds that Flash Attention sees roughly an order of magnitude more numeric deviation as compared to Baseline Attention at BF16. Paper, Tweet
8) Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond - presents an overview of generative methodologies in video generation, where world models facilitate the synthesis of highly realistic visual content; examines challenges and limitations of world models, and discusses their potential future directions. Paper, Tweet
9) MAmmoTH2 - harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning; the approach first recalls relevant documents, extracts instruction-response pairs, and then refines the extracted pairs using open-source LLMs; MAmmoTH2-7B's (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K. Paper, Tweet
10) Granite Code Models -introduce Granite, a series of code models trained with code written in 116 programming languages; it consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from application modernization tasks to on-device memory-constrained use cases; demonstrates that the models reach state-of-the-art performance among available open-source code LLMs. Paper, Code, Tweet

Top AI Papers of the Week (April 29 - May 5) - 2024

Top AI Papers of the Week (April 22 - April 28) - 2024

Paper Links
1) Kolmogorov-Arnold Networks - proposes Kolmogorov-Arnold Networks (KANs) as alternatives to Multi-Layer Perceptrons (MLPs); KANs apply learnable activation functions on edges that represent the weights; with no linear weights used, KANs can outperform MLPs and possess faster neural scaling laws; the authors show that KANs can be used as collaborators to help scientists discover mathematics and physical laws. Paper, Tweet
2) Better and Faster LLMs via Multi-token Prediction - proposes a multi-token prediction approach that performs language modeling by training the predict the following n tokens using n independent output heads; the output heads operate on top of a shared transformer trunk; multi-token prediction is shown to be useful when using larger model sizes and can speed up inference up to 3x; the proposed 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Paper, Tweet
3) Med-Gemini - presents a family of multimodal models specialized in medicines and based on the strong multimodal and long-context reasoning capabilities of Gemini; achieves state-of-the-art performance on 10/14 benchmarks surpassing GPT-4 models; it achieves 91% accuracy on MedQA (USMLE) benchmark using an uncertainty-guided search strategy. Paper, Tweet
4) When to Retrieve? - presents an approach to train LLMs to effectively utilize information retrieval; it first proposes a training approach to teach an LLM to generate a special token, , when it's not confident or doesn't know the answer to a question; the fine-tuned model outperforms a base LLM in two fixed alternate settings that include never retrieving and always retrieving context Paper, Tweet
5) A Survey on Retrieval-Augmented Language Models - covers the most important recent developments in RAG and RAU systems; it includes evolution, taxonomy, and an analysis of applications; there is also a section on how to enhance different components of these systems and how to properly evaluate them; it concludes with a section on limitations and future directions. Paper, Tweet
6) An Open-source LM Specialized in Evaluating Other LMs - open-source Prometheus 2 (7B & 8x7B), state-of-the-art open evaluator LLMs that closely mirror human and GPT-4 judgments; they support both direct assessments and pair-wise ranking formats grouped with user-defined evaluation criteria; according to the experimental results, this open-source model seems to be the strongest among all open-evaluator LLMs; the key seems to be in merging evaluator LMs trained on either direct assessment or pairwise ranking formats. Paper, Tweet
7) Self-Play Preference Optimization - proposes a self-play-based method for aligning language models; this optimation procedure treats the problem as a constant-sum two-player game to identify the Nash equilibrium policy; it addresses the shortcomings of DPO and IPO and effectively increases the log-likelihood of chose responses and decreases the rejected ones; SPPO outperforms DPO and IPO on MT-Bench and the Open LLM Leaderboard. Paper, Tweet
8) Inner Workings of Transformer Language Models - presents a technical introduction to current techniques used to interpret the inner workings of Transformer-based language models; it provides a detailed overview of the internal mechanisms implemented in these models. Paper, Tweet
9) Multimodal LLM Hallucinations - provides an overview of the recent advances in identifying, evaluating, and mitigating hallucination in multimodal LLMs; it also provides an overview of causes, evaluation benchmarks, metrics, and other strategies to deal with challenges related to detecting hallucinations. Paper, Tweet
10) In-Context Learning with Long-Context Models - studies the behavior in-context learning of LLMs at extreme context lengths with long-context models; shows that performance increases as hundreds or thousands of demonstrations are used; demonstrates that long-context ICL is less sensitive to random input shuffling than short-context ICL; concludes that the effectiveness of long-context LLMs is not due to task learning but from attending to similar examples. Paper, Tweet
Paper Links
1) Phi-3 - Microsoft's Phi-3 is a family of small language models (3.8B, 7B, 14B) trained on 3.3-4.8T tokens of heavily filtered web data combined with synthetic data. The flagship phi-3-mini rivals Mixtral 8x7B and GPT-3.5 while being small enough to run locally on a phone.
● Data quality over scale: Training prioritizes curated web and synthetic data over raw token volume, suggesting that data quality is the main lever for small-model capability rather than sheer parameter count.
● Benchmark results: phi-3-mini reaches 69% on MMLU and 8.38 on MT-bench; phi-3-small hits 75% on MMLU and phi-3-medium hits 78%, closing much of the gap with models an order of magnitude larger.
● Long-context variant: A phi-3-mini-128K version extends the default 4K window to 128K tokens while preserving the base model's quality.
● On-device deployment: The 3.8B mini model can be quantized and deployed on an iPhone 14, making it one of the first genuinely usable LLMs at that form factor.
Paper, Tweet
2) OpenELM - Apple's OpenELM is a fully-open small language model family (270M, 450M, 1.1B, 3B) that uses layer-wise parameter scaling instead of uniform layer widths. At ~1B parameters it improves on OLMo by 2.36% accuracy while using half the pre-training tokens.
● Layer-wise scaling: Rather than allocating parameters uniformly across transformer layers, OpenELM adjusts the width of each layer to place capacity where it matters most for downstream performance.
● Efficient training: The ~1B variant achieves 2.36% higher accuracy than OLMo while requiring 2x fewer pre-training tokens, showing that architectural choices can match or beat raw data scale for small models.
● Complete open release: Apple ships not just weights but training logs, intermediate checkpoints, pre-training configs, and MLX inference code - a far more complete release than most commercial labs offer.
● Reproducibility focus: The release explicitly targets open research, giving the community the artifacts needed to re-trace and audit every step of the training pipeline.
Paper, Tweet
3) Arctic - Snowflake's Arctic is an Apache 2.0 open LLM with a Dense-MoE Hybrid transformer (480B total / 17B active) that matches Llama 3 70B on enterprise metrics while using under 3K GPU weeks (~$2M) of training compute - roughly 17x less than Llama 3 70B.
● Dense-MoE Hybrid architecture: A 10B dense backbone is paired with 128 fine-grained 3.66B experts selected via top-2 gating, balancing model capacity with communication efficiency across GPUs.
● Enterprise-first curriculum: A three-stage data curriculum emphasizes SQL, coding, and instruction-following over broad world knowledge, yielding strong results on Spider (SQL), HumanEval+/MBPP+ (code), and IFEval.
● Compute efficiency: Training ran in under 3K GPU weeks for roughly $2M, enabled by architecture-system co-design that overlaps communication with computation to hide MoE routing latency.
● Fully open release: Apache 2.0 weights, training code, serving code, fine-tuning pipelines, and a detailed cookbook, available across HuggingFace, NVIDIA API, AWS, and Azure.
Paper, Tweet
4) Make Your LLM Fully Utilize the Context (FILM-7B) - FILM-7B targets the lost-in-the-middle problem where long-context LLMs fail to retrieve information buried between the start and end of their input. The authors apply an information-intensive (IN2) training recipe to Mistral-7B that forces uniform attention across the full 32K window.
● IN2 synthetic data: Training examples are constructed so the answer requires either fine-grained awareness of a ~128-token segment placed anywhere in a 4K-32K context, or integration of information from two or more such segments.
● Position robustness: After IN2 training, FILM-7B retrieves reliably from arbitrary positions across document, code, and structured-data contexts, with forward, backward, and bidirectional retrieval patterns all supported.
● Real-world gains with minimal regression: NarrativeQA F1 jumps from 23.5 to 26.9 and short-context capability is effectively preserved (MMLU drops only 59.3 to 59.2), showing the fix is targeted rather than disruptive.
● Generalizable recipe: IN2 training is positioned as a cheap, bolt-on remedy that can be applied to existing long-context models to correct position-dependent attention failures systematically.
Paper, Tweet
5) FineWeb - HuggingFace's FineWeb is a 15 trillion token English web dataset built from 96 CommonCrawl snapshots (2013-2024). In 1.8B-parameter ablations, models trained on FineWeb beat C4, RefinedWeb, Dolma, The Pile, SlimPajama, and RedPajama2 across aggregated benchmarks.
● Scale: 15T tokens spanning 52.5B documents (~50TB on disk), with released sample subsets at 10B, 100B, and 350B tokens for smaller experiments.
● Filtering pipeline: Built on the open-source datatrove library - URL blocklists, Trafilatura text extraction, FastText English filter (>0.65), Gopher + C4 quality filters, custom FineWeb heuristics, MinHash deduplication, and PII anonymization.
● Per-dump deduplication insight: Ablations show per-dump MinHash dedup beats global dedup, a non-obvious finding that shaped the final pipeline and contradicts a common assumption that more aggressive dedup is always better.
● Fully reproducible: Released under ODC-By 1.0 with the complete datatrove pipeline and all ablation model checkpoints public, making it one of the most transparent large-scale web corpora to date.
Paper, Tweet
6) AI-powered Gene Editors - Profluent's OpenCRISPR-1 paper demonstrates that a large protein language model trained on biological diversity at scale can design programmable gene editors from scratch. The AI-designed editors successfully perform precision editing in the human genome. Paper, Tweet
7) AutoCrawler - AutoCrawler is a two-stage framework that combines LLMs with the hierarchical structure of HTML to auto-generate reusable web scrapers. Wrapper-based scrapers break on new sites and pure LLM agents don't reuse well across pages; AutoCrawler addresses both limitations.
● Hierarchical DOM understanding: The agent walks the HTML tree with top-down and step-back operations, progressively refining its understanding of a page before emitting a complete and executable scraper.
● Similarity across pages: Patterns learned on one page of a site generalize to structurally similar pages, making the generated scraper durable rather than a one-shot extraction.
● New executability metric: The paper introduces an executability metric for evaluating scraper-generation systems, filling a gap in prior benchmarks that focused only on extraction accuracy.
● EMNLP 2024: Experiments across multiple LLM backends validate the framework on diverse websites; the work was accepted to EMNLP 2024.
Paper, Tweet
8) Graph Machine Learning in the Era of LLMs - This survey maps the intersection of Graph ML and LLMs, covering both how LLMs enhance graph learning and how graphs (especially knowledge graphs) strengthen LLMs. The authors organize the literature into a taxonomy and highlight where open problems remain.
● Dual-direction coverage: Two complementary threads are surveyed - LLMs augmenting GNNs (feature quality, OOD generalization, few-shot learning) and graphs augmenting LLMs (knowledge grounding in pre-training and inference).
● Taxonomy of methods: Existing work is categorized by how LLMs interact with graphs - as feature extractors, as predictors, or as integral components of graph pipelines.
● Core problem domains: The paper explicitly covers graph heterogeneity, out-of-distribution generalization, explainability, and hallucination mitigation as key challenges at the intersection.
● Open directions: Identifies gaps in practical applications, reliable factual grounding from knowledge graphs, and broader empirical evaluation of graph-language approaches.
Paper, Tweet
9) Self-Evolution of LLMs - This survey organizes the emerging literature on self-evolving LLMs - models that improve through their own generated experience rather than additional human supervision. The authors propose a unified four-phase cycle and taxonomize existing methods across both standalone models and agent systems.
● Four-phase cycle: Self-evolution is framed as iterative cycles of experience acquisition, experience refinement, model updating, and evaluation, mirroring how humans learn from practice.
● Two application domains: The taxonomy separates standalone-LLM self-evolution (e.g., self-instruct, self-reward) from LLM-agent self-evolution (e.g., tool-use refinement, long-horizon planning).
● Motivation: Reduce dependence on costly human annotation and break through plateaus as task complexity grows, with self-evolution positioned as a possible path toward more autonomous capability growth.
● Open directions: The paper closes with concrete gaps - stable update dynamics, evaluation of self-evolved capabilities, and safety considerations as self-improvement loops tighten.
Paper, Tweet
10) Naturalized Execution Tuning (NExT) - NExT teaches LLMs to reason about program runtime behavior by generating synthetic chain-of-thought rationales over execution traces. The approach bootstraps training data through self-training rather than manual annotation, and the learned reasoning transfers to scenarios where traces are unavailable at inference.
● Execution-aware rationales: The model inspects variable states at each step of a program's execution and produces natural-language rationales explaining what is happening, then fine-tunes on those rationales.
● Self-training bootstrap: No human-annotated rationales are needed - rationales for correct repairs are kept and used to teach the model, scaling naturally across many programs.
● Large repair gains: On PaLM 2, NExT improves fix rate by +26.1% on MBPP and +14.3% on HumanEval (absolute), with both automated metrics and human evaluators rating the rationales as higher quality.
● Trace-free generalization: At inference the model applies the same reasoning patterns without live execution traces, showing it learned transferable execution-aware reasoning rather than trace-copying.
Paper, Tweet

Top AI Papers of the Week (April 15 - April 21) - 2024

Paper Links
1) Llama 3 - Meta's Llama 3 launches with 8B and 70B pretrained and instruction-tuned variants. Llama 3 8B beats Gemma 7B and Mistral 7B Instruct, and Llama 3 70B is competitive with Gemini Pro 1.5 and Claude 3 Sonnet on standard benchmarks.
● Sizes and release: Meta ships 8B and 70B base and Instruct variants first; larger 400B+ models are still training and planned for later releases with multimodality and longer context.
● Training data: Pretrained on 15T+ tokens (7x more than Llama 2) including 4x more code, with 5%+ non-English across 30+ languages. A new 128K-token tokenizer improves encoding efficiency by roughly 15%.
● Benchmark results: The 70B Instruct model wins human preference rankings against Claude Sonnet, Mistral Medium, and GPT-3.5 across 12 use-case categories; the 8B sets a new state of the art for open models in its size class.
● Open deployment: Weights are available across AWS, HuggingFace, Databricks, Google Cloud, Azure, and more, alongside a safety suite - Llama Guard 2, Code Shield, and CyberSec Eval 2 - for responsible deployment.
Paper, Tweet
2) Mixtral 8x22B - Mistral's Mixtral 8x22B is a sparse Mixture-of-Experts model with 141B total / 39B active parameters and a 64K context window, released under Apache 2.0. It leads open models on MMLU and posts strong math, code, and multilingual numbers.
● Sparse MoE setup: 8 experts with 2 active per token yields 39B active parameters out of 141B total, giving Llama-2-70B-class quality at roughly half the inference cost.
● Multilingual and tooling: Fluent across English, French, Italian, German, and Spanish, with native function calling built in - a practical unlock for agent and tool-use pipelines.
● Benchmark results: Best-in-class among open models on MMLU; 90.8% on GSM8K (maj@8), 44.6% on MATH (maj@4), and state-of-the-art HumanEval pass@1 among contemporary open-weight models.
● License and openness: Apache 2.0 with weights freely available, explicitly framed by Mistral as promoting innovation and unrestricted deployment.
Paper, Tweet
3) Chinchilla Scaling: A replication attempt - This paper re-examines the third estimation procedure in Hoffmann et al. (2022) Chinchilla scaling law and finds it is inconsistent with the paper's own first two methods, fails to fit the extracted data, and reports implausibly narrow confidence intervals.
● What was audited: Chinchilla proposed three independent methods to estimate the compute-optimal ratio of parameters to training tokens; this paper digs into the parametric-loss-fitting approach (method 3).
● Inconsistent estimates: The published estimates from method 3 do not match the predictions of methods 1 and 2, and the parametric fit does not actually pass through the reconstructed data points.
● Implausible confidence intervals: The reported intervals would statistically require over 600,000 training runs, whereas the authors likely ran fewer than 500 - suggesting methodological errors in the uncertainty quantification.
● Rederivation: A corrected fit using method 3 produces scaling estimates that are consistent with methods 1 and 2, restoring internal coherence and slightly revising the compute-optimal guidance.
Paper, Tweet
4) How Faithful are RAG Models? (ClashEval) - ClashEval constructs a 1,200-question benchmark across six domains with intentionally corrupted retrieved documents to measure when RAG helps and when it misleads GPT-4 and other top LLMs.
● Controlled conflict setup: Questions span drug dosages, Olympic records, locations, and other verifiable facts. Retrieved documents are perturbed from subtle to obvious errors so the authors can study model behavior under realistic vs implausible corruption.
● RAG can override correct priors: LLMs abandon their correct internal knowledge over 60% of the time when the retrieved document is wrong but plausible - a strong demonstration that retrieval can hurt as much as help.
● Prior strength matters: The weaker a model's initial confidence (measured via token probabilities), the more it capitulates to the retrieved content. Models with strong, high-probability priors resist incorrect retrieval more effectively.
● Simple interventions help: The authors show that exploiting confidence signals - for example, gating acceptance of retrieved content on the model's prior probability - measurably improves accuracy under conflicting information.
Paper, Tweet
5) A Survey on Retrieval-Augmented Text Generation for LLMs - This survey organizes the RAG literature into a four-stage framework (pre-retrieval, retrieval, post-retrieval, generation) and traces the paradigm's evolution alongside open challenges.
● Four-pillar framework: The survey breaks each RAG system into pre-retrieval (indexing, query formulation), retrieval (dense/sparse/hybrid), post-retrieval (reranking, compression), and generation (prompt assembly and synthesis).
● Evolution of the paradigm: Traces RAG from early dense-retrieval + reader pipelines to modern multi-hop, agentic, and graph-based variants, highlighting how each pillar has been refined.
● Evaluation methodology: Covers benchmarks and metrics for retrieval quality, faithfulness, and end-task performance, noting that evaluation remains a bottleneck for comparing systems fairly.
● Open directions: Identifies gaps in multimodal RAG, long-context RAG, real-time and streaming RAG, and safe integration with agent workflows.
Paper, Tweet
6) The Illusion of State in State-Space Models - This paper proves that modern state-space models (Mamba, S4, etc.) share the same expressive ceiling as transformers: they cannot compute anything outside the TC^0 complexity class, despite the RNN-like "state" vocabulary they borrow.
● Expressive-power result: SSMs with typical parameterizations are provably confined to TC^0, which means the "state" that accumulates through the recurrence cannot simulate general sequential computation.
● Tasks they cannot solve: Permutation composition, code evaluation with branches, and entity tracking across long narratives all require state beyond TC^0 and therefore cannot be learned reliably by these SSMs.
● Transformer parity: The result places SSMs on the same theoretical footing as transformers rather than above them, pushing back against the intuition that recurrence automatically grants richer state.
● Practical implication: If you need genuine state tracking (interpreters, stateful agents, long-horizon planning), architectural changes beyond current SSMs are required - a clean "state is an illusion" framing for the field.
Paper, Tweet
7) Reducing Hallucination in Structured Outputs via RAG - This paper deploys a compact RAG pipeline - small retriever plus small LM - for an enterprise workflow-generation task and shows it reduces hallucination while improving out-of-domain generalization vs a baseline LLM.
● Target setting: A production system that turns natural-language requirements into executable workflows, where hallucinated fields or missing steps break the downstream pipeline.
● Small retriever + small LM: Instead of a massive generator, the authors train a specialized retriever encoder and pair it with a much smaller LM, cutting compute and memory without losing output quality.
● Hallucination and generalization gains: The RAG-augmented small system reduces factual errors in the structured output and generalizes better to out-of-domain inputs than the baseline LM alone.
● Deployment implication: The setup shows that high-quality structured generation does not require frontier-sized LLMs - a disciplined retriever + small LM can be cheaper to run and easier to productionize.
Paper, Tweet
8) Emerging AI Agent Architectures - A short survey mapping the current landscape of LLM-based agent architectures, focused on reasoning, planning, and tool calling as the three capability pillars for complex agentic workflows.
● Capability pillars: Reasoning, planning, and tool/API execution are treated as the core primitives; most modern agent frameworks are characterized as different combinations and orchestrations of these three.
● Single- vs multi-agent patterns: The survey separates single-agent architectures (ReAct-style loops, tool-augmented chains) from multi-agent patterns (leader/follower, debate, specialized role teams) and contrasts their trade-offs.
● Phases and meta-design: Describes how planning, execution, and reflection phases combine inside agents, and how choices like leadership structure and communication style materially affect reliability.
● Honest assessment: The survey explicitly calls out present-day limitations - brittleness, evaluation difficulty, and the gap between demos and deployable systems - grounding future-direction discussions.
Paper, Tweet
9) LM In-Context Recall is Prompt Dependent - Using needle-in-a-haystack tests across multiple models, this paper shows that in-context recall is highly sensitive to prompt wording and that training data biases can silently degrade a model's ability to retrieve from its own context.
● Needle-in-a-haystack methodology: A factoid is embedded at various positions inside a long filler context, and recall is measured as prompt length and needle depth vary across several frontier LLMs.
● Prompt sensitivity: Small rewordings of the query can move recall accuracy dramatically, indicating that existing "context-window size" numbers overstate practical recall capability.
● Training-data interference: When the needle conflicts with content the model likely saw in pre-training, the model often returns its memorized answer instead of the in-context fact - a subtle but important failure mode.
● Paths to improve recall: The paper shows that larger size, stronger attention mechanisms, alternative training objectives, and targeted fine-tuning each independently improve recall under prompt variation.
Paper, Tweet
10) A Survey on State Space Models - A comprehensive survey of modern SSMs with a principles-first walkthrough, taxonomy of existing variants, and experimental comparison across NLP, vision, graph, multimodal, point-cloud, event-stream, and time-series tasks.
● Principles first: The survey introduces the core SSM recurrence, discretization, and parallel-scan tricks up front so readers can reason about why variants like S4, Mamba, and H3 differ.
● Broad variant coverage: Catalogs architectural and parameterization choices across major SSM families, highlighting which choices matter for which modalities.
● Cross-domain applications: Demonstrates how SSMs have been applied beyond language - vision backbones, graph learning, multimodal fusion, and long time-series modeling - with comparative results.
● Open challenges: Identifies theoretical limits (e.g. state capacity), scaling behavior, hybridization with attention, and hardware-efficient training as the main frontiers, alongside a live GitHub tracker.
Paper, Tweet

Top AI Papers of the Week (April 8 - April 14) - 2024

Paper Links
1) Leave No Context Behind (Infini-attention) - Google's Infini-attention extends Transformer LLMs to effectively infinite context with bounded memory and compute. It blends a compressive memory module with both masked local attention and linear long-term attention inside a single Transformer block.
● Unbounded context, bounded memory: The compressive memory stores historical key-value statistics in a fixed-size matrix, so token streaming does not grow memory over time and enables true long-form inference.
● Dual-attention block: Each Infini-attention layer combines local windowed attention for recent tokens with long-term linear attention over the compressive memory, giving the same block access to short- and long-range dependencies simultaneously.
● Empirical results: 1B and 8B Infini-Transformers outperform baseline long-context models on book summarization (500K tokens) and 1M-token passkey retrieval, while achieving a 114x memory compression ratio on long-context language modeling.
● Streaming implication: Because the memory footprint is bounded, Infini-Transformer is a clean fit for streaming inference, document-flow agents, and any setting where the input grows indefinitely.
Paper, Tweet
2) OpenEQA - Meta's OpenEQA is an open-vocabulary benchmark for embodied question answering: 1,600+ human-written questions across 180+ real-world environments, with a calibrated LLM-as-judge metric that tracks human agreement closely.
● Benchmark setup: Questions cover episodic-memory use cases (smart glasses) and active-exploration use cases (mobile robots), demanding that agents reason about the environment they occupy rather than just a single image.
● LLM-as-judge scoring: The paper introduces an automatic LLM-powered evaluation protocol with strong correlation to human judgment, solving the open-vocabulary scoring problem that blocks most EQA benchmarks from scaling.
● Frontier-model performance: GPT-4V, Claude 3, and Gemini Pro significantly outperform text-only baselines, but their gains come mostly from object recognition - for several question categories, they barely beat a blind-LLM baseline.
● Gap to humans: Across the board, state-of-the-art multimodal LLMs lag well behind human performance on OpenEQA, positioning the benchmark as a concrete target for embodied-agent research.
Paper, Tweet
3) CodeGemma - CodeGemma is a family of open code LLMs built on Gemma, released in 2B (pretrained), 7B (pretrained), and 7B-IT (instruction-tuned) variants. The 2B model is optimized for low-latency code completion, and the 7B-IT model leads its weight class on HumanEval.
● Three-variant family: 2B for fast on-device completion, 7B as a capable pretrained coder, and 7B-IT for chat-style code assistance, all derived from Gemma and released with open weights.
● Training recipe: Trained on 500B additional tokens of code, math, and synthetic data with a Fill-in-the-Middle objective (80% FIM rate, 50/50 PSM/SPM split), plus novel dependency-graph-based packing and unit-test-based lexical packing.
● Benchmark results: HumanEval pass@1 of 31.1% (2B), 44.5% (7B), and 56.1% (7B-IT). Single-line infilling reaches 78.4% for the 2B model, making it a strong low-latency IDE companion.
● Deployment focus: FIM tokens enable direct use in IDE auto-completion pipelines, and quantized builds are already available for llama.cpp, LM Studio, Jan, and Ollama for local deployment.
Paper, Tweet
4) LM-Guided Chain-of-Thought - This paper offloads rationale generation to a small, trained LM while keeping a frozen large LM as the answer predictor. The small model is optimized with knowledge distillation and reinforcement learning so it produces rationales that steer the large model more effectively.
● Split responsibilities: A small (<1B) LM writes the chain-of-thought rationale; a frozen large (>10B) LM reads the rationale and produces the final answer. Only the small LM is trained, cutting cost sharply.
● Two-stage training: First, distill rationales from the large LM into the small one (knowledge distillation). Then fine-tune the small LM with RL using rationale-oriented and task-oriented reward signals.
● Multi-hop QA gains: Evaluated on HotpotQA and 2WikiMultiHopQA, LM-guided CoT outperforms standard prompting and vanilla CoT prompting on answer-prediction accuracy; self-consistency decoding compounds the gains.
● Cost-aware reasoning: The recipe is a pragmatic template for teams that cannot fine-tune frontier models - train a tiny rationale generator instead and leave the big model frozen as an API.
Paper, Tweet
5) Best Practices and Lessons on Synthetic Data - Google DeepMind's survey-style position paper on synthetic data for LLMs. It covers applications, quality-assurance principles, and the open challenges of factuality, fidelity, bias, and privacy.
● Why synthetic data now: Addresses the shortage of large, diverse, high-quality natural data and the privacy constraints around using real user content, positioning synthetic data as a complementary source rather than a replacement.
● Quality-assurance triad: Emphasizes factuality (claims are true), fidelity (distribution matches real use), and unbiasedness (no over/under-representation) as the three tests every synthetic pipeline should run.
● Responsible use: Discusses provenance tagging, contamination risks with eval sets, and how to avoid model-collapse feedback loops when synthetic data is used for pretraining.
● Open directions: Calls out evaluation of synthetic pipelines, hybrid natural+synthetic recipes, and privacy-preserving generation (e.g., differential privacy) as the main frontiers.
Paper, Tweet
6) Reasoning with Intermediate Revision and Search (THOUGHTSCULPT) - THOUGHTSCULPT is a graph-based reasoning framework that combines Monte Carlo Tree Search with an explicit revision action, letting an LLM iteratively rewrite earlier thoughts instead of only extending them.
● Revision as a first-class action: Unlike Tree-of-Thoughts, THOUGHTSCULPT allows each node to either extend or revise previous reasoning, producing an interwoven graph of thoughts rather than a pure tree.
● MCTS-driven search: Monte Carlo Tree Search navigates the solution space efficiently; evaluation is done with either domain-specific heuristics or an LLM evaluator, giving flexibility across tasks.
● Concrete gains: +30% on story outline "interestingness," +16% word success rate on mini-crosswords, and +10% concept coverage on constrained generation, all vs. competitive ToT-style baselines.
● Fit for open-ended work: Because revision is built in, THOUGHTSCULPT is especially strong on tasks like creative ideation, multi-step reasoning, and open-ended generation where first drafts are rarely optimal.
Paper, Tweet
7) Overview of Multilingual LLMs - A first-of-its-kind survey on multilingual LLMs, organized by multilingual alignment principles rather than model-family hierarchy. The authors propose a unified taxonomy and collect open resources to accelerate future research.
● Alignment-first organization: The survey groups methods by how they align languages internally (shared embeddings, cross-lingual pre-training, translation-based alignment, in-context alignment) rather than by model family.
● Unified taxonomy: Provides a single framework that covers pretraining-only multilingual models, adapter-based variants, and LLMs adapted post-hoc with translation or code-switching data.
● Emerging frontiers: Highlights low-resource languages, cross-lingual transfer for reasoning, and multilingual evaluation as the three frontiers where the community still lacks strong benchmarks.
● Open resources: The paper accompanies a curated list of papers, datasets, and leaderboards, lowering the barrier to entry for teams launching new multilingual LLM projects.
Paper, Tweet
8) The Physics of Language Models - This paper measures how many bits of factual knowledge a language model can store per parameter and finds a remarkably stable 2-bits-per-parameter ceiling, even after int8 quantization. A 7B model can therefore hold ~14B bits - more than the English Wikipedia and textbooks combined.
● Knowledge-tuple methodology: Rather than using loss or benchmarks, the authors measure storage capacity directly by training models on controlled (entity, relation, value) tuples and counting how many are retrievable.
● The 2-bits-per-parameter law: Across model sizes and architectures, trained LLMs store about 2 bits of knowledge per parameter - a ceiling that stays roughly constant under post-training int8 quantization.
● Architecture and data effects: GPT-2 with rotary embeddings matches Llama/Mistral capacity, and prepending domain names to training data significantly boosts how much the model retains, hinting at implicit domain prioritization.
● Implications for scaling: If the 2-bit ceiling is real, compute-efficient scaling should match parameter count to the amount of factual knowledge you want to store, and quantization can compress weights without losing knowledge.
Paper, Tweet
9) Aligning LLMs to Quote from Pre-Training Data (Quote-Tuning) - Quote-Tuning aligns LLMs to quote verbatim from trusted pre-training sources, turning the attribution step from post-hoc fact-checking into a built-in model behavior.
● Membership inference at train time: A fast membership-inference function checks whether generated spans exist verbatim in a trusted corpus, producing a reward signal without any human annotation.
● Preference-based alignment: The authors build a synthetic preference dataset (quoted vs non-quoted outputs) and align the model with preference optimization, teaching it when to quote.
● Strong verbatim gains: Quote-Tuning achieves up to a 130% relative increase in verbatim quotes from high-quality documents while preserving response quality across tasks, domains, and model families.
● Verification advantage: Because quoted passages can be matched exactly to the source, downstream verification becomes trivial, helping regulated domains like medicine, law, and journalism.
Paper, Tweet
10) The Influence Between NLP and Other Fields - This EMNLP 2023 analysis quantifies NLP's cross-disciplinary engagement using a Citation Field Diversity Index across 23 academic fields. The headline: NLP has become dramatically more insular over four decades.
● CFDI drop: The Citation Field Diversity Index fell from 0.58 in 1980 to 0.31 in 2022 - an all-time low, indicating NLP increasingly cites itself rather than drawing from neighboring fields.
● CS dominance: Over 80% of NLP citations now go to computer science, with less than 8% to linguistics and less than 3% each to mathematics and psychology.
● Growing echo chamber: The paper quantifies an explicit rise in intra-field citation alongside a decline in multidisciplinary works, painting NLP as a community that has closed off from cognate disciplines.
● Call to action: The authors argue the field needs to actively rebuild engagement with linguistics, psychology, and the social sciences if it wants research foundations broader than "scale and transformers."
Paper, Tweet

Top AI Papers of the Week (April 1 - April 7) - 2024

Paper Links
1) Many-shot Jailbreaking - Anthropic shows that long-context windows enable a new attack where hundreds of fake user/assistant dialogues are packed into a single prompt, coaxing frontier LLMs to answer the final harmful question despite safety training.
● Attack mechanics: The prompt embeds faux dialogues in which a cooperating assistant answers harmful queries, followed by the target question; in-context learning generalizes the pattern and overrides RLHF safety.
● Power-law scaling: Attack success rises as a power law in the number of shots up to ~256, and is more effective on larger models that have stronger in-context learning.
● Model coverage: The technique works on Claude 2.0 and comparable frontier LLMs from other labs, indicating a structural rather than model-specific vulnerability.
● Mitigations: Context-window limits plus prompt classification / modification cut attack success from 61% to 2% in one setup, though the authors note variants may evade detection.
Paper, Tweet
2) SWE-Agent - Princeton's SWE-agent pairs a language model with a custom agent-computer interface (ACI) that exposes file navigation, editing, and test execution as discrete tools, letting the agent autonomously resolve real GitHub issues.
● Agent-computer interface: The ACI replaces raw shell access with carefully designed commands (scrolling viewers, structured editors, test runners) that make it easier for an LLM to plan multi-step code changes.
● SWE-bench results: SWE-agent resolves 12.29% of SWE-bench issues end-to-end on the full test set, matching Devin's reported accuracy while being fully open-source.
● HumanEvalFix: On HumanEvalFix the same agent reaches an 87.7% pass rate, showing the ACI generalizes beyond large-repo bug fixing to smaller self-contained tasks.
● Interface-over-model lesson: The gains come primarily from interface design rather than model changes, reinforcing that agent capability depends heavily on how tools are exposed to the LLM.
Paper, Tweet
3) Mixture-of-Depths - DeepMind proposes dynamically allocating transformer FLOPs across sequence positions via a top-k router, so "easy" tokens skip expensive blocks while "hard" tokens get full computation.
● Top-k routing: Each MoD layer picks a fixed-size subset of tokens to process, keeping the compute graph statically shaped while introducing conditional depth.
● Matching baselines at lower FLOPs: MoD models match standard transformers on training loss for equivalent FLOPs, and outperform them at matched step count by spending compute where it matters.
● Faster sampling: Inference gets up to 50% faster because many tokens bypass the heaviest layers without sacrificing downstream quality.
● Static-compute advantage: Because the total per-batch FLOPs are predictable, MoD is easier to deploy on accelerators than token-routed MoE, combining the efficiency benefits of sparsity with dense execution patterns.
Paper, Tweet
4) Long-context LLMs Struggle with Long In-Context Learning - LongICLBench stress-tests 13 long-context LLMs on extreme-label classification with up to 174 classes and 50K-token prompts, exposing sharp quality cliffs beyond 20K tokens.
● Benchmark design: Tasks cover 28-174 label classes packed into 2K-50K token contexts, requiring models to integrate many demonstrations rather than retrieve a single fact.
● Short-to-mid context OK: Most models perform adequately under ~20K tokens, suggesting that advertised long-context windows can at least hold instructions and examples.
● Sharp degradation past 20K: On the hardest task (Discovery, 174 labels) every open model collapses, and all but GPT-4 dip dramatically as context grows beyond 20K tokens.
● Position bias: Models over-predict labels that appear later in the prompt, revealing that long-context ICL fails both on reasoning across examples and on treating positions symmetrically.
Paper, Tweet
5) Visualization-of-Thought - Microsoft's Visualization-of-Thought (VoT) prompts LLMs to emit intermediate "mental images" of their reasoning state, lifting spatial-reasoning accuracy on grid-world tasks and beating multimodal baselines that actually see images.
● Mental-imagery prompting: The model renders each reasoning step as an ASCII/grid-style visualization, which is then fed back in to constrain subsequent steps, echoing human mental imagery.
● Benchmarks: VoT is evaluated on natural-language navigation, visual navigation, and visual tiling in 2D grid worlds, targeting multi-hop spatial reasoning.
● Beats multimodal LLMs: Text-only LLMs with VoT outperform contemporary multimodal LLMs on the same tasks, showing explicit state visualization can substitute for actual image tokens.
● NeurIPS 2024: The method was accepted to NeurIPS 2024 and positions mental imagery as a general-purpose tool for strengthening reasoning in otherwise text-only models.
Paper, Tweet
6) The Unreasonable Ineffectiveness of the Deeper Layers - The paper shows that open-weight LLMs tolerate removing up to half of their transformer blocks with only minor degradation, provided a short QLoRA pass is used to heal the damage afterwards.
● Similarity-based pruning: A layer-similarity metric identifies contiguous blocks that can be removed without dramatically changing the hidden-state distribution at that depth.
● Heal with QLoRA: After pruning, a small fine-tune with QLoRA on a single 40GB GPU recovers most of the lost performance - the whole recipe is reproducible on modest hardware.
● Up to half the layers gone: Performance stays close to the original until a large fraction of layers (up to ~50%) has been removed, at which point accuracy drops sharply.
● Implications: Either current pre-training underutilizes deeper layers or shallow layers carry most of the useful knowledge - either reading questions how efficiently today's LLMs use their depth.
Paper, Tweet
7) JetMoE - MyShell's JetMoE-8B is an open MoE model trained for under $100K that matches or beats LLaMA2-7B, showing that competitive LLM training can be achieved on modest budgets with public data.
● MoA + MoE architecture: 24 blocks each combine a Mixture of Attention heads (MoA) and a Mixture of MLP Experts (MoE), with 8 experts per layer and top-2 activation giving 2.2B active parameters out of 8B total.
● Cheap training: Trained on 1.25T publicly available tokens using 96 H100s for about two weeks at under $100K of compute, an order of magnitude less than typical 7B training runs.
● Benchmark wins: Beats LLaMA2-7B on MMLU (49.2 vs 46.9) and GSM8K (27.8 vs 14.5), and is competitive with far larger open models on standard benchmarks.
● Democratization signal: Full weights and training recipe are released, reinforcing that efficient MoE + strong data pipelines let smaller labs ship frontier-quality open models.
Paper, Tweet
8) ReFT: Representation Finetuning for LMs - Stanford's ReFT freezes the base model and instead learns small interventions on hidden representations at selected layers, offering a more parameter-efficient alternative to LoRA-style PEFT.
● Representations as targets: Instead of updating weights, ReFT trains lightweight linear interventions that modify a rank-limited subspace of the hidden state at specified layers and positions.
● LoReFT variant: Low-rank LoReFT is the headline method and is drop-in compatible with the PEFT ecosystem, acting as a direct LoRA replacement.
● 15-65x fewer parameters than LoRA: Across commonsense reasoning, arithmetic, instruction tuning, and GLUE, LoReFT matches or beats LoRA while using 15-65x fewer trainable parameters.
● Interpretability-informed: The method is motivated by interpretability results showing that semantic content is encoded in compact subspaces of the hidden state, and it exploits that structure directly.
Paper, Tweet
9) Advancing LLM Reasoning (Eurus) - OpenBMB's Eurus is a suite of reasoning-specialized LLMs (7B and 70B) fine-tuned on UltraInteract, a new alignment dataset built around preference trees for complex math, code, and logical tasks.
● UltraInteract data: Each instruction is paired with a tree of reasoning chains plus multi-turn interactions and pairwise preferences, giving the model structured examples of correct vs incorrect reasoning.
● SoTA open reasoning: Across 12 reasoning benchmarks Eurus-70B surpasses GPT-3.5 Turbo, reaching 33.3% on LeetCode and 32.6% on TheoremQA, and beats existing open-source baselines by 13.3%+ on average.
● Tailored reward objective: Standard DPO is shown to be suboptimal for reasoning, so the authors design a specialized reward modeling objective better suited to chain-of-thought style data.
● Takeaway: Demonstrates that task-specific alignment data design - not just model scale - is a key lever for pushing open models into frontier reasoning territory.
Paper, Tweet
10) Training LLMs over Neurally Compressed Text - The paper proposes Equal-Info Windows, a neural compression scheme that segments text into equal-bit-length blocks so an LLM can train directly on compressed bytes without losing learnability.
● Equal-Info Windows: Text is split into windows that each compress to the same number of bits, turning arithmetic-coded output into a stable sequence that a transformer can learn from.
● Why naive compression fails: Standard arithmetic coding produces sequences whose boundaries shift with context, making training unstable; Equal-Info Windows restores the positional regularity LLMs rely on.
● Beats byte-level, trails BPE: At scale the method outperforms byte-level baselines by a wide margin on perplexity and inference speed, but still trails traditional subword tokenizers at matched parameter counts.
● Shorter sequences, faster inference: Because each token encodes more raw text, autoregressive generation produces the same output in fewer steps, cutting latency meaningfully.
Paper, Tweet

Top AI Papers of the Week (March 26 - March 31) - 2024

Paper Links
1) DBRX - Databricks releases DBRX, a 132B-total / 36B-active open Mixture-of-Experts LLM that beats established open models on MMLU, HumanEval, and GSM8K while delivering 2x faster inference than LLaMA2-70B.
● Fine-grained MoE: 16 experts with top-4 selection per token gives 65x more expert combinations than typical MoE configurations, improving capacity without increasing active compute.
● Pretraining: Trained on 12T carefully curated text-and-code tokens with a 32K context window; the base model is shipped alongside DBRX Instruct.
● Benchmark wins: DBRX Instruct reaches 73.7% MMLU and 70.1% HumanEval vs 69.8% / 31.0% for LLaMA2-70B; it also edges out Mixtral Instruct on Open LLM Leaderboard composites (74.5% vs 72.7%) and rivals Grok-1 at 2.4x fewer parameters.
● Practical inference: Serves up to 150 tok/s/user on Databricks Model Serving, roughly 2x faster than LLaMA2-70B, and even outperforms CodeLLaMa-70 Instruct on code tasks despite being general-purpose.
Paper, Tweet
2) Grok-1.5 - xAI's Grok-1.5 is the successor to the open-weight Grok-1, emphasizing long-context understanding and substantially stronger math, code, and reasoning performance.
● Benchmarks: Reports 50.6% on the MATH benchmark, 90.0% on GSM8K, 74.1% on HumanEval, and 81.3% on MMLU - a large jump over Grok-1 and competitive with contemporary closed-source peers.
● 128K context window: Grok-1.5 can process up to 128K tokens, a 16x expansion over Grok-1, enabling longer documents and multi-turn sessions without external retrieval.
● Long-context retrieval: On in-house needle-in-a-haystack evaluations the model demonstrates strong recall across its full 128K window, not just near the boundaries.
● Availability: Rolled out first to early testers and existing Grok users on 𝕏, with broader release planned as the platform continues to iterate on reasoning and agentic features.
Paper, Tweet
3) SEEDS - Google's Scalable Ensemble Envelope Diffusion Sampler (SEEDS) uses diffusion models to generate very large, physically plausible weather-forecast ensembles conditioned on only one or two operational forecasts.
● Diffusion-based ensemble: SEEDS learns from historical reanalysis and forecast data, so each generated sample is a coherent "alternative atmosphere" consistent with numerical-weather-prediction physics.
● Few inputs, many outputs: A small number of operational NWP forecasts is enough to seed the sampler, which then produces hundreds of ensemble members at a fraction of the compute cost of running NWP that many times.
● Uncertainty quantification: The resulting ensembles better capture the tails of the forecast distribution, which is exactly where traditional small ensembles under-sample extreme events.
● Operational implications: The approach can complement existing NWP systems by cheaply densifying their ensembles, improving probabilistic forecasts for tropical cyclones, heatwaves, and other high-impact weather.
Paper, Tweet
4) LLMs on University-Level Physics Coding - A controlled study pits ChatGPT variants against University of Durham physics students on Python coding assignments, finding that humans still outperform even the strongest prompt-engineered GPT-4.
● Setup: 50 student submissions and matched GPT-3.5 / GPT-4 submissions were blind-scored by three independent markers, with prompt engineering applied as an explicit variable.
● Students win on average: Students averaged 91.9%, while GPT-4 with prompt engineering scored 81.1% (SE 0.8) - a statistically significant gap.
● Prompt engineering matters: Prompt engineering produced large, highly significant improvements for both GPT-3.5 (p ≈ 5×10⁻⁹) and GPT-4 (p ≈ 1.7×10⁻⁴).
● Detectable: Human markers correctly identified AI-authored submissions 85.3% of the time, suggesting that while LLM output is close to student quality, it remains stylistically distinguishable.
Paper, Tweet
5) Mini-Gemini - Mini-Gemini enhances vision-language models by adding a second high-resolution visual encoder that refines details without increasing the number of visual tokens consumed by the LLM.
● Dual-encoder design: A primary low-res encoder produces the visual tokens fed to the LLM while a high-res encoder provides patch-level features used to refine those tokens via attention.
● Token count stays constant: Because refinement happens in feature space rather than by adding tokens, inference cost stays comparable to the base VLM even as visual fidelity improves.
● Works across scales: The framework supports LLM backbones from 2B to 34B parameters and pairs image understanding with VLM-guided image generation in a single stack.
● Benchmark leadership: Achieves leading zero-shot scores across several vision-language benchmarks, with the authors reporting results competitive with or surpassing GPT-4 and Gemini on comparable tasks.
Paper, Tweet
6) Long-form factuality in LLMs - Google DeepMind introduces LongFact and SAFE, a prompt set and automated evaluator for judging whether the long-form answers of modern LLMs are actually factual.
● LongFact prompts: LongFact contains thousands of questions across 38 topics designed to elicit multi-paragraph, fact-dense responses rather than short answers.
● SAFE evaluator: SAFE decomposes each response into atomic claims, issues Google Search queries for each, and uses an LLM agent to judge whether each claim is supported.
● Superhuman agreement: SAFE matches crowdsourced human annotators on 72% of claims and, on disagreements, wins 76% of the time - while being roughly 20x cheaper per annotation.
● Bigger = more factual: Across 13 evaluated models, larger LLMs generally produce more factually accurate long-form answers, and the team reports F1@K (precision balanced against recall at K claims) as a new long-form factuality metric.
Paper, Tweet
7) Agent Lumos - Lumos is a unified recipe for training open-source LLM agents that separates high-level planning from low-level grounding so each module can be supervised and improved independently.
● Modular architecture: One module learns to decompose complex tasks into subgoals (planning), while a second module translates those subgoals into concrete tool calls and actions (grounding).
● Training data: The authors compile large-scale agent annotations by re-formatting reasoning rationales from math, web, and QA tasks into the Lumos planner/grounder format.
● Nine-dataset evaluation: Across complex QA, web tasks, and math tasks, Lumos beats larger open agents and even surpasses GPT-3.5/GPT-4-based agents on QA and web domains.
● Takeaway: Explicit modularization gives open-source agents a reproducible way to catch up to closed-source systems without needing frontier-scale base models.
Paper, Tweet
8) AIOS - AIOS treats the LLM as the "brain" of an operating-system kernel for agents, providing scheduling, memory, storage, tool, and access-control services so agent apps can share resources safely.
● Kernel-style isolation: LLM-specific services (scheduler, context manager, memory/storage managers, tool manager, access manager) are separated from application code, mirroring traditional OS architecture.
● Concurrent agents: The scheduler allows multiple agents to share the same LLM backend with context switching, making multi-agent workloads far more efficient than ad-hoc orchestration.
● SDK for developers: Agent apps talk to the kernel through a standardized SDK, so tool services, memory stores, and permissions are consistent across frameworks.
● Performance: Serving agents built with various agent frameworks on top of AIOS yields up to 2.1x faster execution compared to running them independently.
Paper, Tweet
9) FollowIR - FollowIR is both a benchmark and a training set for teaching retrieval models to follow real-world, instruction-style queries rather than just match keywords.
● Benchmark from TREC: Evaluation instances come from TREC shared tasks with hundreds to thousands of labeled documents per query, letting the authors test real instruction adherence.
● Professional annotator narratives: The training set repurposes assessor narratives (the plain-language instructions TREC annotators follow) as supervised signal for instruction following.
● FollowIR-7B: Fine-tuning a 7B Mistral-based retriever on the new training set yields over 13% improvement on the instruction-following benchmark.
● Implication: Retrieval quality is now bottlenecked by instruction understanding rather than basic lexical matching, and FollowIR gives the community a concrete dataset to close that gap.
Paper, Tweet
10) LLM2LLM - LLM2LLM is an iterative data augmentation scheme where a strong teacher LLM generates new training examples targeted at the specific mistakes a student model makes during fine-tuning.
● Error-targeted augmentation: The student trains on a seed dataset, its wrong answers are isolated, and the teacher LLM synthesizes new examples around those failure modes rather than expanding the data uniformly.
● Iterative loop: Each round the student is re-trained, re-evaluated, and re-augmented, amplifying signal on hard examples over successive cycles.
● Large gains on small data: Using Llama-2-7B, LLM2LLM improves performance by up to 52.6% on TREC and 24.2% on GSM8K over plain fine-tuning baselines in low-data regimes.
● Five-dataset evaluation: Strong results are shown on GSM8K, CaseHOLD, SNIPS, TREC, and SST-2, covering math, legal, intent, question, and sentiment classification respectively.
Paper, Tweet

Top AI Papers of the Week (March 18 - March 25) - 2024

Paper Links
1) Grok-1 - xAI open-sources Grok-1, a 314B-parameter Mixture-of-Experts base model, making it the largest openly released LLM at the time of publication.
● MoE at 314B: The architecture activates 25% of weights (~86B active parameters) per token, positioning Grok-1 between dense 70B models and proprietary trillion-parameter systems.
● Base model release: xAI releases the raw base model weights and network architecture under Apache 2.0 - no instruction tuning, no RLHF, intentionally leaving fine-tuning to the community.
● Training data cutoff: Pretraining uses a corpus with an October 2023 knowledge cutoff, so downstream fine-tunes inherit roughly 1-year-old world knowledge.
● Openness signal: The release significantly raises the bar for "open" large-model weights and pressures other labs, coming shortly before Elon Musk's public criticism of OpenAI's closed approach.
Paper, Tweet
2) Evolutionary Model Merge - Sakana AI proposes using evolutionary algorithms to automatically discover effective merges of open-source models, producing strong composite models without any additional training.
● Two search spaces: Evolution operates in parameter space (weight mixing coefficients) and in data-flow space (which layers of which models to route activations through), giving a richer merge vocabulary than hand-crafted recipes.
● Cross-domain transfer: The method produces a Japanese LLM with math reasoning by merging unrelated parents - showing that evolution can graft capabilities from one model onto another even across languages.
● New SoTA with less compute: The resulting Japanese Math LLM achieves state-of-the-art on Japanese LLM benchmarks, beating models with many more parameters that were never explicitly trained for either language or math.
● Collective intelligence: Positions evolutionary merge as an automated composition paradigm that leverages the open-source community's collective work instead of training every capability from scratch.
Paper, Tweet
3) TacticAI - Google DeepMind, in collaboration with Liverpool FC, releases TacticAI, a geometric deep-learning system that analyzes football corner kicks and suggests alternative tactics for coaches to explore.
● GNN-based setup: Players are modeled as nodes in a graph whose features encode positions, velocities, and roles; the network predicts receiver identity, shot probability, and related outcomes jointly.
● Tactic exploration: Coaches can sample alternative player setups for a given corner and rank them by predicted outcome, turning TacticAI into a what-if engine rather than a static predictor.
● Expert preference: In blind A/B tests with Liverpool FC experts, TacticAI's suggestions were preferred over existing professional tactics 90% of the time.
● Effective retrieval: A secondary retrieval system lets coaches find historically similar corner kicks to study, turning the large game-footage archive into a searchable tactical library.
Paper, Tweet
4) What Are Tools Anyway? A Survey of Tool Use in LLMs - This survey establishes a formal definition of tools as "external programs used by LMs" and systematizes when, why, and how tool-use improves LLM performance.
● Formal definition: Tools are any external program invoked by an LM to augment its capabilities, unifying function calling, retrieval, code execution, and web browsing under one framework.
● Taxonomy: Organizes tool-use scenarios and techniques so that prior work can be compared on common axes (trigger, invocation, integration, supervision).
● Cost-benefit analysis: Empirically measures compute and latency overhead of tool use versus accuracy gain across benchmarks, revealing clear tradeoffs rather than free lunches.
● Open directions: Identifies challenges in tool selection, generalization to unseen tools, and evaluation methodology as the main bottlenecks for the next wave of tool-augmented LLMs.
Paper, Tweet
5) RankPrompt: Step-by-Step Comparisons Make LLMs Better Reasoners - RankPrompt is a prompting method that lets an LLM self-rank its own candidate answers via chains of pairwise comparisons, without needing an external verifier or additional fine-tuning.
● Self-ranking via comparisons: Candidates are evaluated by having the LLM walk through systematic pairwise comparisons as in-context demonstrations, rather than scoring each answer independently.
● 11 benchmarks: Tested across 11 arithmetic and commonsense reasoning datasets, RankPrompt lifts ChatGPT and GPT-4 accuracy "by up to 13%" on the strongest cases.
● Also works on open-ended tasks: On AlpacaEval-style open-ended evaluations, RankPrompt's ranking aligns with human judgments in 74% of cases, suggesting it generalizes beyond just closed-form reasoning.
● Implication: Shows that LLMs can function as effective self-evaluators when guided to produce comparison chains, unlocking cheaper alternatives to external reward models for test-time selection.
Paper, Tweet
6) LLM4Decompile - LLM4Decompile is the first open-source family of LLMs specialized for decompiling machine code back into readable, re-executable C source.
● Model range: Open-weight decompilation LLMs from 1.3B to 33B parameters, released with both End-to-end variants (binary → C) and Ref variants that refine outputs produced by the Ghidra decompiler.
● Training scale: Trained on 4B tokens of paired C source and assembly, giving the models ample coverage of realistic build-system and optimization patterns.
● Decompile-Eval benchmark: The authors introduce a benchmark that measures not just textual similarity but re-compatibility and re-executability of the recovered C code - a more faithful proxy for usefulness.
● Strong gains: LLM4Decompile-Ref variants outperform GPT-4o and raw Ghidra by over 100% in re-executability rate, with an additional 16.2% improvement from the refinement stage.
Paper, Tweet
7) Agent-FLAN - Agent-FLAN redesigns fine-tuning data so that open models can learn agentic skills without sacrificing general capability, hitting new open-source SoTA for Llama2-7B-based agents.
● Format vs reasoning split: The training corpus separates "format following" (producing valid tool/JSON syntax) from "agent reasoning" so each skill can be learned at its own rate and against pre-training distribution.
● Negative sampling: Carefully constructed negative examples teach the model when not to call a tool, directly targeting hallucinated or spurious actions.
● Llama2-7B gains: Agent-FLAN beats prior best open agent fine-tunes by 3.5% averaged across agent evaluation datasets, while preserving general LLM capability.
● Scales cleanly: The recipe's improvements hold up as the base model grows, making it a reusable template for future open-source agent fine-tunes.
Paper, Tweet
8) Logits of API-Protected LLMs Leak Proprietary Information - The paper shows that the softmax bottleneck in modern LLMs means even logit-level APIs leak enough information to reconstruct hidden architectural details.
● Softmax bottleneck exploit: Because the output distribution is a linear projection of a lower-dimensional embedding, clever API queries can expose the embedding rank and related structural parameters.
● Cheap recovery: With fewer than $1,000 in API calls, the authors estimate the embedding dimension of OpenAI's gpt-3.5-turbo at ~4,096 and recover related non-public quantities.
● Auditing use-case: The same technique can detect silent model updates, identify shared ancestry between different provider models, and discover hidden layer widths - making it a potential auditing tool as much as an attack.
● Mitigations: The paper outlines defenses providers can deploy (logit-bias restrictions, output calibration) while arguing that some transparency gains from this capability are net positive.
Paper, Tweet
9) DROID - DROID is an open-source robot manipulation dataset that dramatically expands the diversity of real-world robot demonstrations available for imitation-learning research.
● Scale and diversity: 76K demonstration trajectories (~350 hours of interaction) collected across 564 scenes and 84 manipulation tasks - substantially more varied than prior public datasets.
● Distributed collection: 50 data collectors across multiple international locations contributed over 12 months, using a standardized hardware setup that is released alongside the data.
● Stronger policies: Policies trained on DROID show higher success rates and improved out-of-distribution generalization compared to training on earlier, less diverse datasets.
● Full open release: Dataset, training code, and hardware reproduction guides are all public, giving the community a common substrate to benchmark and iterate on manipulation policies.
Paper, Tweet
10) RAFT: Retrieval-Augmented Fine-Tuning - RAFT is a fine-tuning recipe that teaches LLMs to handle distractor documents during RAG and to answer with CoT-style citations to retrieved passages.
● Distractor-aware training: Each training example mixes relevant documents with distractors, forcing the model to learn to ignore irrelevant retrieved content rather than averaging over it.
● CoT + citations: Responses are trained to walk through chain-of-thought reasoning while verbatim-quoting the supporting passages, improving both accuracy and transparency.
● Domain-specific RAG: Evaluated on PubMed, HotpotQA, and Gorilla (API-calling), RAFT consistently improves open-book in-domain QA over both plain fine-tuning and plain RAG baselines.
● Practical recipe: Positions RAFT as a post-training step that upgrades a pretrained LLM for production-grade RAG without needing a new architecture.
Paper, Tweet

Top AI Papers of the Week (March 11 - March 17) - 2024

Paper Links
1) SIMA - DeepMind's Scalable Instructable Multiworld Agent (SIMA) is a generalist AI agent that follows natural-language instructions across nine commercial 3D video games like No Man's Sky, Teardown, Valheim, and Space Engineers.
● 600-skill taxonomy: Evaluation covers 600 basic skills spanning navigation, object interaction, and menu use, giving a detailed picture of what generalist 3D agents can and can't do.
● Training data: Built by recording human players instructing each other in gameplay and fine-tuning a pretrained vision-language backbone on this instruction-paired footage.
● Cross-game generalization: Training across many games produces an agent that performs nearly as well on unseen games as on games it trained on - a strong generalization signal for 3D embodied agents.
● Language matters: Ablations show that textual instruction is the dominant performance driver; removing the language conditioning collapses skill execution, underscoring language as an alignment interface for embodied agents.
Paper, Tweet
2) Retrieval Augmented Thoughts (RAT) - RAT augments chain-of-thought by iteratively rewriting each reasoning step using retrieved context, sharply reducing hallucination on long-horizon generation tasks.
● Iterative thought revision: After producing a zero-shot CoT, the method walks step-by-step and rewrites each thought using retrieved info that depends on the query and the preceding thoughts.
● Zero-shot and model-agnostic: RAT works without task-specific training and improves GPT-3.5, GPT-4, and CodeLLaMA-7B alike, generalizing across backbone sizes.
● Large gains on long-horizon tasks: Average relative improvements of 13.6% (code), 17.0% (math), 19.2% (creative writing), and 42.8% (embodied planning) over zero-shot CoT and vanilla RAG baselines.
● Implication: Retrieval is most useful when woven into reasoning at each step rather than bolted on up-front, especially as tasks demand longer, multi-step outputs.
Paper, Tweet
3) Quiet-STaR - Quiet-STaR generalizes the Self-Taught Reasoner (STaR) so that a language model learns to generate internal rationales between every token, not just for explicit QA problems.
● Rationales per token: At each token position the model emits a latent "thought" that helps predict the next token; learnable boundary tokens delimit start/end of each thought.
● Tokenwise parallel sampling: A specialized attention kernel generates all per-token thoughts in parallel, turning what would be a quadratic blowup into tractable training.
● REINFORCE objective: Thoughts that improve next-token predictions are rewarded; the authors also use extended teacher forcing to stabilize training.
● Zero-shot gains: On Mistral-7B, zero-shot GSM8K jumps from 5.9% to 10.9% and CommonsenseQA from 36.3% to 47.2% - improvements obtained purely from continued pretraining without any task-specific fine-tuning.
Paper, Tweet
4) Knowledge Conflicts for LLMs - A survey that maps the landscape of knowledge conflicts in LLMs, covering how they arise, how models behave under them, and how to mitigate them.
● Three conflict types: Context-memory (retrieved context disagrees with parametric knowledge), inter-context (multiple retrieved documents contradict each other), and intra-memory (the model's own parameters encode contradictions).
● Causes: The survey traces conflicts to data quality, retrieval noise, outdated parametric knowledge, and pretraining contradictions - giving a shared vocabulary for prior work.
● Model behavior: Catalogs how LLMs choose between conflicting sources and how that behavior shifts with model size, training, and prompting.
● Mitigation directions: Reviews calibration-based, training-based, and prompting-based interventions for reducing conflict-induced errors, pointing to open evaluation gaps.
Paper, Tweet
5) Stealing Part of a Production Language Model - The paper demonstrates the first practical attack that extracts the embedding-projection layer of production LLMs through their ordinary logit APIs.
● Attack setup: By querying the API with carefully chosen prompts and analyzing the logit outputs, the attacker can reconstruct the final projection matrix from public API access alone.
● Concrete extractions: The attack recovers Ada (hidden dim 1024) and Babbage (hidden dim 2048) matrices for under $20, and estimates GPT-3.5-turbo's hidden dimension for under $2,000.
● Cross-provider: Similar attacks apply to other production LLMs including PaLM-2, indicating the vulnerability is structural rather than tied to any one provider.
● Mitigations: The authors propose defenses such as restricting logit-bias API features, adding noise, and careful API design to close off the attack surface.
Paper, Tweet
6) Branch-Train-MiX (BTX) - Meta's BTX produces a single Mixture-of-Experts LLM by first training specialized experts in parallel and then mixing them, sidestepping the high cost of training one big generalist.
● Branch: Start from a seed LLM and branch multiple copies, each trained in embarrassingly parallel fashion on a different domain (e.g., code, math, Wikipedia).
● Mix: The experts' FFNs are combined into MoE layers of a single model while non-FFN parameters are averaged, producing a unified MoE LLM without additional joint pretraining.
● Light fine-tune: Final supervised fine-tuning learns routing and harmonizes the experts, yielding token-level specialization at inference time.
● Better accuracy-compute trade-off: BTX matches or beats training-one-generalist and unified-MoE baselines on multi-domain benchmarks at substantially lower total compute, generalizing both Branch-Train-Merge and sparse upcycling.
Paper, Tweet
7) LLMs Predict Neuroscience Results (BrainBench) - BrainBench asks both LLMs and human experts to predict the outcomes of neuroscience experiments from their abstracts, and finds LLMs outperform experts.
● Benchmark design: Each item is a real neuroscience abstract with two alternative result endings (correct and plausibly incorrect); the task is to pick the correct one.
● LLMs beat experts: Across several frontier LLMs, mean accuracy is higher than the mean of human neuroscientists, flipping the usual expectation that experts dominate specialist benchmarks.
● BrainGPT: A version fine-tuned on neuroscience literature (BrainGPT) improves further, showing that targeted pretraining can extract more structure from a specialist corpus.
● Calibrated confidence: Like human experts, LLMs are more accurate when they express higher confidence, pointing toward productive human-AI collaboration in forward-looking science.
Paper, Tweet
8) C4AI Command-R - Cohere for AI releases Command-R, a 35B open-weight LLM tuned specifically for retrieval-augmented generation, tool use, and multilingual workflows.
● Long context + multilingual: Supports a 128K-token context window with primary coverage of 10 languages (including Arabic, Japanese, Korean, and Chinese) and partial coverage of 13 more.
● RAG-native: Trained for grounded generation with explicit citations, making it easier to deploy in production RAG systems without heavy prompt engineering.
● Tool use and agents: Ships with single-step ("function calling") and multi-step ("agents") tool-use modes, directly supporting sequential API calls and long-running workflows.
● Research release: Weights are available under CC-BY-NC, with full-precision and 4-bit / 8-bit quantized versions on Hugging Face, explicitly aimed at research use and reproducible benchmarking.
Paper, Tweet
9) Is Cosine-Similarity Really About Similarity? - This paper argues that cosine similarity between learned embeddings does not always measure semantic similarity, and gives analytical examples where it produces arbitrary or non-unique values.
● Analytical setting: Studies embeddings derived from regularized linear models, where closed-form expressions expose how cosine similarity depends on regularization choices.
● Arbitrary "similarities": Different but equally valid optimal solutions can yield wildly different cosine similarities between the same input pairs, undermining the metric's interpretability.
● Implicit regularization effects: Common deep-learning regularizers induce unintended effects on cosine similarity that practitioners often don't notice or control for.
● Recommendations: The authors urge caution in using cosine similarity as a universal semantic metric and suggest alternatives such as task-calibrated similarity functions or metric learning.
Paper, Tweet
10) MM1: Multimodal LLM Pre-training - Apple's MM1 paper runs extensive ablations on multimodal LLM pretraining choices and releases a family of models up to 30B parameters that set competitive MLLM pretraining benchmarks.
● Data mixture is key: Image-caption, interleaved image-text, and text-only data each play distinct roles; the right mix is essential for few-shot and zero-shot performance.
● Image encoder dominates: Encoder choice and input resolution matter a lot, while the vision-language connector design turns out to be comparatively unimportant.
● 30B dense + MoE: The MM1 family includes both dense and mixture-of-experts variants up to 30B parameters, with strong SoTA pretraining metrics.
● Emergent capabilities: After pretraining, MM1 exhibits multi-image reasoning and supports few-shot chain-of-thought prompting - capabilities that earlier smaller MLLMs typically lacked.
Paper, Tweet

Top AI Papers of the Week (March 4 - March 10) - 2024

Paper Links
1) Claude 3 - Anthropic releases the Claude 3 family (Haiku, Sonnet, Opus), with Opus leapfrogging GPT-4 on many standard benchmarks and bringing frontier multimodal capability plus a much larger context window.
● Three tiers: Haiku for speed/cost, Sonnet as the balanced default, and Opus as the flagship; each covers analysis, forecasting, content creation, code, and multilingual translation (Spanish, Japanese, French, etc.).
● Benchmark leadership: Opus posts 86.8% on MMLU and 84.9% on HumanEval, edging past GPT-4 on reasoning and code benchmarks while also leading on MATH and GSM8K.
● Long context: All three models ship with a 200K-token context window, extensible to 1M tokens for select customers, targeting long-document and long-agent-trajectory use cases.
● Vision and refusals: Strong vision for photos, charts, and graphs; Anthropic reports more nuanced request handling with materially fewer unwarranted refusals compared to prior Claude generations.
Paper, Tweet
2) Robust Evaluation of Reasoning - The paper introduces functional benchmarks that parameterize reasoning problems so the same structural question can be re-instantiated with fresh surface forms, then uses them to expose a large "reasoning gap" in frontier LLMs.
● Functional benchmarks: MATH() is built as a functional variant of the MATH benchmark, turning each item into a template that can generate many semantically equivalent but textually different instances.
● Large reasoning gap: On functional variants, SoTA models drop between 58.35% and 80.31% relative to the static benchmark, suggesting much of headline performance is memorization or surface pattern matching.
● Prompting helps: More sophisticated prompting strategies shrink (but do not eliminate) the gap, implying that eliciting the right reasoning traces is partially possible without changing the model.
● Evaluation lesson: Static benchmarks overstate reasoning ability; functionalization is proposed as a cheap, generalizable way to stress-test new models.
Paper, Tweet
3) GaLore - GaLore (Gradient Low-Rank Projection) reduces optimizer-state memory during LLM training while still permitting full-parameter updates, unlike LoRA-style adapters that restrict learning to a low-rank subspace.
● Project gradients, not weights: Gradients are projected into a low-rank subspace for the optimizer state (momentum, variance), while weight updates themselves remain full-rank.
● 65.5% optimizer-state savings: Memory for optimizer state drops by up to 65.5% relative to Adam, with comparable downstream quality on pretraining LLaMA 1B and 7B architectures.
● Consumer-GPU pretraining: Enables pretraining a 7B LLM on a single 24GB consumer GPU - a regime previously reserved for specialized clusters - by combining GaLore with 8-bit optimizers.
● Plugs into existing stacks: GaLore is framework-agnostic and drops into standard training code without changes to model architecture or data pipelines.
Paper, Tweet
4) Can LLMs Reason and Plan? - Kambhampati's position paper argues that what looks like reasoning and planning in LLMs is better understood as "universal approximate retrieval" powered by web-scale training.
● Core claim: "Nothing that I have read, verified, or done gives me any compelling reason to believe that LLMs do reasoning/planning, as normally understood" - LLMs interpolate over memorized patterns rather than search solution spaces.
● Planning evaluations: Cites benchmarks like PlanBench where LLM performance collapses under obfuscation or natural adversarial perturbations, arguing that true planning would be more robust.
● Self-critique skepticism: Questions claims that LLMs can reliably self-critique, noting that current systems hallucinate evaluations as readily as they hallucinate answers.
● LLM-Modulo framework: Proposes using LLMs as components ("idea generators") alongside sound external verifiers (planners, theorem provers) rather than trusting them as standalone reasoners.
Paper, Tweet
5) RAG for AI-Generated Content - A survey that extends RAG beyond text, showing how retrieval augmentation is being applied across code, image, audio, video, and 3D generation.
● Cross-modal taxonomy: Organizes RAG systems by how the retriever augments the generator (query, context, knowledge injection) across modalities, unifying methods that previously appeared disconnected.
● Problem coverage: Maps which AIGC pain points each RAG pattern addresses - knowledge updates, long-tail data, leakage mitigation, and compute cost.
● Enhancement catalog: Catalogues concrete techniques (re-rankers, multi-hop retrieval, modality-specific retrievers) and the tasks where each pays off.
● Resources: Ships with benchmarks and a GitHub repo of 353 referenced papers, giving practitioners a structured entry point into this fast-moving area.
Paper, Tweet
6) KnowAgent - KnowAgent improves LLM-based planning agents by explicitly injecting action knowledge - what the actions are and how they relate - rather than letting the LLM invent its own action space at runtime.
● Action Knowledge Base: A structured description of valid actions, preconditions, and transitions that constrains the plans the agent is allowed to generate.
● Knowledgeable self-learning: The agent iteratively plans, executes, critiques, and refines actions against the knowledge base, continuously improving while avoiding invalid steps.
● Hallucination mitigation: By grounding planning in explicit action semantics, KnowAgent noticeably reduces the "planning hallucinations" where agents emit syntactically valid but semantically nonsensical actions.
● Benchmarks: Evaluated on HotpotQA (multi-hop QA) and ALFWorld (interactive embodied tasks), it matches or outperforms strong baselines across multiple LLM backbones.
Paper, Tweet
7) Sora Overview - A comprehensive academic review of OpenAI's Sora, tracing the technical ingredients behind the text-to-video "world simulator" and the opportunities/limitations for the next wave of large vision models.
● Technical lineage: Maps Sora back to diffusion-transformer architectures, patchified video tokens, and large-scale joint text-video pretraining, synthesizing the publicly available technical signals.
● Capabilities and applications: Catalogs the tasks Sora enables across filmmaking, education, and marketing, including long-duration generation, multi-shot continuity, and physics-plausible object interaction.
● Limitations: Flags known failure modes - simulation artifacts, physical inconsistencies, object permanence errors - as well as safety concerns around misuse and bias.
● Open directions: Identifies research problems such as efficient long-video training, grounded 3D/world modeling, and standardized evaluation for generative video.
Paper, Tweet
8) SaulLM-7B: LLM for Law - SaulLM-7B is an open legal-domain LLM built on Mistral 7B and continually pretrained on 30B+ tokens of English legal text, with a companion instruction-tuning recipe.
● Legal pretraining corpus: 30B+ tokens of case law, statutes, regulations, and other legal documents covering U.S. and international jurisdictions.
● Legal instruction tuning: A novel fine-tuning method uses legal-task instructions to convert the continually pretrained model into an instruction-following legal assistant.
● SoTA on LegalBench: Reports state-of-the-art results on LegalBench-Instruct and legal subsets of MMLU, outperforming general-purpose 7B models on statute interpretation, contract analysis, and legal QA.
● MIT-licensed release: Model and training artifacts are released under MIT, making it a practical starting point for legal-domain RAG, contract review, and compliance workflows.
Paper, Tweet
9) Design2Code - Design2Code tackles the front-end engineering problem of turning a visual design into working HTML/CSS and gives the community both a benchmark and strong MLLM baselines.
● 484-webpage benchmark: A curated set of 484 real-world webpages with screenshot + reference code pairs, paired with automatic metrics validated against human judgments.
● Frontier MLLM comparison: GPT-4V comes out on top, Gemini Pro Vision is competitive, and an open-source fine-tuned Design2Code model matches Gemini Pro Vision - demonstrating that open models can be closed the gap on this task.
● Failure modes: Models mostly fail on recalling fine-grained visual elements and reproducing correct layouts, not on individual snippets, isolating the remaining hard problem.
● Prompting matters: The authors develop multimodal prompting strategies that materially lift GPT-4V and Gemini performance, giving practitioners ready-made recipes.
Paper, Tweet
10) TripoSR - TripoSR is a transformer-based single-image 3D reconstruction model that returns a textured mesh in under 0.5 seconds, building on the LRM architecture with a stronger data and training pipeline.
● Feed-forward 3D: A single image goes in and a complete 3D mesh comes out in one forward pass - no per-object optimization loop as in score-distillation methods.
● Improvements over LRM: Data processing (rendering, filtering), model tweaks (improved triplane decoder), and training-technique refinements together lift both speed and quality over the base LRM recipe.
● Sub-second inference: Runs in under 0.5 s per object on commodity GPUs, fast enough for real-time asset generation in creative tooling.
● MIT-licensed: Released under MIT alongside model weights, positioning TripoSR as a democratizing baseline for researchers and practitioners in 3D generative modeling.
Paper, Tweet

Top AI Papers of the Week (February 26 - March 3) - 2024

Paper Links
1) Genie - DeepMind's Genie is an 11B-parameter foundation world model trained unsupervised on internet gameplay videos that generates action-controllable 2D worlds from a single image prompt.
● Three-component architecture: A spatiotemporal video tokenizer, an autoregressive dynamics model, and a scalable latent action model together learn to roll out playable worlds without any explicit action labels in training.
● Prompt-conditioned worlds: Given a single image (photo, sketch, or video frame), Genie produces an interactive 2D environment the user can "play" frame-by-frame with inferred action controls.
● Latent action space: The learned action embedding lets an agent imitate behaviors demonstrated in unseen videos, turning internet video into a generic source of demonstrations for training embodied policies.
● Generalist-agent implication: Suggests a path toward training generalist agents entirely from observational video, without needing action-labeled datasets or curated simulators.
Paper, Tweet
2) Mistral Large - Mistral AI releases Mistral Large, its flagship closed-weight LLM positioned as the second-ranked API-accessible model behind GPT-4 at launch.
● Feature set: 32K-token context window, native multilingual fluency (English, French, Spanish, German, Italian), strong coding, math, and reasoning capability out of the box.
● Benchmarks: Claims SoTA or near-SoTA among API-available models on standard reasoning and knowledge benchmarks, approaching GPT-4 while exceeding prior Mistral releases by a clear margin.
● Native tool use: Ships with built-in function calling and JSON-mode output, targeting agent and structured-extraction workflows without extra prompt engineering.
● Deployment: Available via la Plateforme and Microsoft Azure, with Mistral Small released alongside as a latency-optimized companion model.
Paper, Tweet
3) The Era of 1-bit LLMs (BitNet b1.58) - Microsoft's BitNet b1.58 shows that restricting every weight to the ternary set {-1, 0, 1} can match full-precision FP16 transformers on perplexity and downstream tasks at the same parameter count.
● Ternary weights: All linear-layer weights are quantized to {-1, 0, 1}, giving an effective bit-width of log2(3) ≈ 1.58 bits per weight while still permitting meaningful sparsity via the 0 value.
● Parity with FP16: At matched parameter and token budgets, BitNet b1.58 matches FP16 baselines in both perplexity and zero-shot accuracy - a result that was not previously demonstrated at 3B+ scale.
● Efficiency gains: Dramatic wins in latency, memory footprint, throughput, and energy consumption compared to FP16 transformers, especially at longer sequence lengths.
● Hardware co-design: The paper lays out a new scaling law for 1-bit LLMs and argues for specialized hardware tailored to ternary compute - positioning this as a potential inflection point in LLM deployment.
Paper, Tweet
4) Datasets for LLMs: A Comprehensive Survey - A 180+-page survey that catalogs and analyzes the datasets that underpin modern LLM training and evaluation.
● Five-category taxonomy: Organizes the space into pretraining corpora, instruction fine-tuning datasets, preference datasets, evaluation datasets, and traditional NLP datasets.
● Scale of the review: Covers 444 datasets spanning eight language families and 32 domains, including 774.5 TB of pretraining data and 700M+ instances across non-pretraining dataset types.
● Twenty-dimension statistics: Each dataset is summarized across 20 attributes (license, language, size, task type, etc.), enabling systematic comparison rather than anecdotal selection.
● Challenges and gaps: Identifies open issues in dataset construction (licensing, provenance, bias, contamination) and outlines future research directions for responsible LLM data pipelines.
Paper, Tweet
5) LearnAct - LearnAct lets language agents expand and refine their own action space over time by writing and revising Python functions in response to execution feedback.
● Open-action learning: Rather than picking from a fixed list of actions, the agent proposes new actions as Python functions, tests them, and iteratively revises them based on what worked.
● Iterative refinement loop: Each cycle adds or updates actions using execution feedback, so the agent's effective toolkit grows along with task experience.
● Strong results on AlfWorld: Achieves a 32% absolute improvement over ReAct+Reflexion on AlfWorld, with additional gains on robotic planning tasks.
● Closer to human learning: Mirrors the way humans acquire new skills by composing and revising procedures rather than selecting from a static repertoire, pointing toward more capable long-horizon agents.
Paper, Tweet
6) EMO: Emote Portrait Alive - Alibaba's EMO synthesizes expressive talking-head videos directly from audio, bypassing the intermediate 3D models or facial landmarks used by prior approaches.
● Audio2Video diffusion: A single diffusion model maps an audio waveform and a reference portrait directly to a video, letting subtle prosody drive facial motion without a handcrafted intermediate representation.
● Speaking and singing: Produces convincing lip-sync and expressive facial motion for both speech and singing across varied styles, handling long-duration audio stably.
● Identity preservation: Reports strong identity preservation and seamless frame transitions, outperforming prior methods on expressiveness and realism metrics.
● Implications: Shows that eliminating the 3D/landmark intermediate stage - previously thought essential for controllable portraits - can actually improve fidelity and expressiveness for talking-head generation.
Paper, Tweet
7) On the Societal Impact of Open Foundation Models - Stanford CRFM's policy paper proposes a rigorous framework for assessing the marginal risk of open-weight foundation models relative to closed models and pre-existing technologies.
● Six-component framework: Threat identification, existing defenses, marginal risk attribution, counterfactual impact, and uncertainty articulation are among the components the authors use to structure risk analysis.
● Seven misuse vectors: Applies the framework to biosecurity, cybersecurity, disinformation, CSAM, non-consensual imagery, phishing, and voice cloning, finding that existing studies rarely assess marginal risk rigorously.
● Benefits matter too: Catalogs benefits of open models - distributed decision-making, innovation, reduced market concentration - that closed-model debates often omit.
● Policy takeaway: Argues policymakers should demand grounded evidence of marginal risk rather than assuming closed releases are inherently safer, and that open weights alone are not sufficient for full scientific transparency.
Paper, Tweet
8) StarCoder 2 - BigCode releases StarCoder 2, an open family of code LLMs at 3B, 7B, and 15B parameters trained on The Stack v2, a much larger and cleaner code corpus than the original StarCoder.
● Multi-org collaboration: ServiceNow trains the 3B, Hugging Face the 7B, and NVIDIA the 15B, with a shared data pipeline and recipe.
● Training setup: 15B variant is trained on 4T+ tokens across 600+ programming languages with a 16K-token context window, grouped-query attention, and a fill-in-the-middle objective.
● Benchmark punch above its weight: StarCoder2-15B matches 33B+ models on many code completion, reasoning, and PAL-augmented math tasks, while the 3B matches the original 15B StarCoder.
● The Stack v2: Released alongside the models is a 67.5TB code dataset derived from Software Heritage, significantly improving on The Stack v1 in both scale and provenance.
Paper, Tweet
9) LLMs on Tabular Data: A Survey - A survey that maps how LLMs are being applied to tabular data tasks - a domain historically dominated by gradient-boosted trees and specialized architectures.
● Task coverage: Organizes work across prediction, data synthesis, question answering, and table understanding, giving a single view of a previously scattered literature.
● Core techniques: Catalogs prompting strategies (schema serialization, example selection), fine-tuning recipes, and table-specific encoding methods.
● Datasets and metrics: Systematizes benchmarks, evaluation metrics, and the dominant models used across tabular-LLM studies.
● Open problems: Identifies unexplored directions including robust handling of mixed-type columns, scalable table pretraining, and integration with traditional tabular baselines.
Paper, Tweet
10) PlanGPT - PlanGPT is a domain-specialized LLM framework for urban and spatial planning, built in collaboration with the Chinese Academy of Urban Planning.
● Domain fine-tuning: Continually fine-tunes a base LLM on a curated corpus of planning documents to lift performance on planning-specific language and terminology.
● Custom local retrieval: Pairs the model with a local-database retrieval framework so that answers can cite authoritative sources instead of relying purely on parametric memory.
● Advanced tooling: Integrates domain-specific tools (document templates, code execution, GIS-style utilities) to cover the end-to-end planning workflow rather than just QA.
● Generalizable recipe: Although built for urban planning, the paper doubles as a practical template for any domain-specific LLM project - covering retrieval, fine-tuning, and tooling together.
Paper, Tweet

Top AI Papers of the Week (February 19 - February 25) - 2024

Paper Links
1) Stable Diffusion 3 - Stability AI previews Stable Diffusion 3, a suite of image-generation models from 800M to 8B parameters that shifts to a diffusion-transformer backbone with flow matching.
● DiT + flow matching: Replaces the U-Net backbone of prior SD releases with a diffusion transformer trained via flow matching, enabling cleaner scaling and higher-fidelity outputs.
● Model range: Suite scales from 800M up to 8B parameters, letting the same architecture target everything from on-device use to production-grade quality.
● Prompt adherence: Major improvements in multi-subject composition where prior SD versions would merge or drop subjects, plus noticeably better in-image text rendering and spelling.
● Early preview: Released first as a waitlist-based preview while the full technical report and weights are staged, with research access prioritized.
Paper, Tweet
2) Gemma - Google DeepMind releases Gemma, a family of open models (2B and 7B) built from the same research stack as Gemini and shipped with both base and instruction-tuned variants.
● Model sizes: Gemma 2B trained on 2T tokens and Gemma 7B trained on 6T tokens, both with an 8192-token context window and open weights.
● Beats similar-sized peers: Gemma 7B generally outperforms Llama 2 7B and Mistral 7B on the standard Open LLM Leaderboard tasks (reasoning, math, code, knowledge).
● Instruction variants: Instruction-tuned versions use supervised fine-tuning plus RLHF and are designed to be drop-in friendly via Hugging Face, Keras, JAX, and PyTorch.
● Responsible release: Ships with a responsible-use toolkit, safety classifiers, and debugging utilities - an explicit answer to growing scrutiny of open-weight release practices.
Paper, Tweet
3) LLMs for Data Annotation - A survey that maps the rapidly growing literature on using LLMs to generate, evaluate, and learn from data annotations.
● Three pillars: LLM-based annotation generation, LLM-generated annotation assessment, and learning-with-LLM-annotations - giving a clean mental model for the space.
● Method taxonomy: Organizes prompting, chain-of-thought, self-consistency, and agent-based approaches to annotation across text, code, and multimodal data.
● Quality assessment: Reviews techniques for detecting hallucinated or low-quality LLM annotations, including verifier models and auxiliary labeling.
● Practitioner guide: Concludes with open challenges and concrete recommendations for teams trying to replace costly human labels with LLM-generated ones without collapsing downstream quality.
Paper, Tweet
4) GRIT - GRIT (Generative Representational Instruction Tuning) trains a single LLM to handle both generative and embedding tasks, switching behavior based on instructions.
● Dual-task training: A shared backbone is trained jointly on generation and embedding objectives, with instructions disambiguating which head to use at inference.
● MTEB SoTA: GritLM 7B sets a new state of the art on the Massive Text Embedding Benchmark (MTEB) while matching specialized generative models on generation tasks.
● Scales cleanly: An 8x7B variant outperforms specialized generative models while also retaining top-tier embedding quality, showing the unification doesn't hurt either side.
● RAG speedup: Because the same model serves both retrieval and generation, long-document RAG pipelines run 60%+ faster by eliminating a separate encoder pass.
Paper, Tweet
5) LoRA+ - LoRA+ is a minimal one-line change to LoRA: use different learning rates for the down-projection (A) and up-projection (B) matrices to restore feature learning at large width.
● Theoretical setup: The authors show that in the infinite-width limit, identical learning rates for A and B prevent the adapter from performing proper feature learning, which LoRA inherits.
● Asymmetric learning rates: Setting lr(B) = η · lr(A) with a carefully chosen ratio corrects the imbalance and recovers the correct infinite-width dynamics.
● Practical gains: ~2x finetuning speedup and 1-2% absolute accuracy improvement over vanilla LoRA at the same compute cost across a broad set of tasks.
● Drop-in: No changes to architecture or memory footprint - only the optimizer hyperparameter wiring changes, making LoRA+ trivial to adopt in existing PEFT pipelines.
Paper, Tweet
6) Back to Basics: Revisiting REINFORCE in RLHF - Cohere researchers argue that PPO is overkill for RLHF and that a simpler REINFORCE-style estimator works better in practice.
● PPO is overkill: Many PPO features (clipping, value networks, GAE) are shown to be unnecessary in an RLHF context where episodes are short and reward variance is bounded.
● REINFORCE / RLOO beats PPO: A straightforward REINFORCE variant with leave-one-out baselines outperforms PPO on RLHF benchmarks at lower compute and simpler hyperparameter tuning.
● Also beats offline alternatives: The same approach outperforms newer offline/semi-offline methods like DPO and RAFT, pushing back on the "online RL is dead for RLHF" narrative.
● Implication: Low-cost online RL with well-designed baselines is a practical path forward for alignment, especially for smaller teams that couldn't afford full PPO infrastructure.
Paper, Tweet
7) Recurrent Memory Finds What LLMs Miss - Introduces BABILong, a new long-context benchmark, and shows that transformers with recurrent memory can handle sequences far beyond vanilla LLMs.
● BABILong benchmark: Extends the bAbI reasoning tasks by embedding them inside arbitrarily long distractor text, stress-testing whether models actually use their context or ignore most of it.
● Attention heavy tails: On BABILong, GPT-4 and RAG systems effectively rely on the first ~25% of the input - a stark illustration that "100K context" does not mean "100K useful context".
● Recurrent memory wins: Augmenting a GPT-2-scale transformer with recurrent memory lets it process sequences up to ~11M tokens while still answering the embedded questions correctly.
● Direction signal: Argues recurrent memory is a simpler and cheaper path to genuinely long-context models than ever-larger attention windows.
Paper, Tweet
8) When is Tree Search Useful for LLM Planning? - Ohio State + OSU analyze multi-step LLM planning as a generator/discriminator/planner system and argue that current LLM discriminators make tree search a poor choice in practice.
● Framework: Decomposes LLM planning into a candidate generator, a discriminator that scores candidates, and a planning algorithm (iterative correction or tree search) that navigates the candidate space.
● 90% discriminator bar: Tree-search planners only beat simpler re-ranking baselines when the discriminator is at least ~90% accurate - a threshold current LLMs don't reliably clear on text-to-SQL or math.
● 10-20x slower: Tree search runs 10-20x slower than iterative correction for marginal or zero gains, making it impractical for production LLM pipelines today.
● Implication: Investing in stronger discriminators (verifiers, reward models) may unlock more gains than building ever-more-elaborate planners on top of weak scorers.
Paper, Tweet
9) Chain-of-Thought Reasoning Without Prompting - DeepMind shows that LLMs often already emit chain-of-thought reasoning in alternative decoding paths, and that selecting those paths via confidence lifts reasoning accuracy with no prompt engineering.
● Alternative decoding: Instead of taking the top-1 greedy token, the method considers top-k alternative first tokens and runs the full decode from each, exposing paths that naturally produce CoT.
● Confidence as a signal: When a decoded path contains genuine step-by-step reasoning, the model's final-answer confidence is noticeably higher - this acts as an automatic selector for the CoT path.
● Benchmark gains: The technique substantially outperforms standard greedy decoding across arithmetic, commonsense, and symbolic reasoning benchmarks without any prompt modification.
● Reframing: Reasoning is positioned as already latent in pretrained LLMs - prompt engineering is one way to surface it, but decoding-time selection is an equally valid (and complementary) lever.
Paper, Tweet
10) OpenCodeInterpreter - OpenCodeInterpreter is an open-source family of code-execution LLM systems that iteratively refine code using runtime feedback, closing the gap with GPT-4's proprietary Code Interpreter.
● Code-Feedback dataset: Ships a 68K multi-turn training set that captures code-generation, execution results, and refinement steps - the key training ingredient for iterative code agents.
● Execution + human loop: Integrates both automatic execution feedback and human-style critique signals, so the system learns to use runtime errors and natural-language feedback together.
● HumanEval leaderboard: The 33B variant averages 83.2% on HumanEval/MBPP and reaches 91.6% with synthesized feedback - approaching GPT-4's 84.2% on the same evaluation.
● Fully open: Code, data, and weights are released, giving the community a reproducible baseline for building iterative code agents.
Paper, Tweet

Top AI Papers of the Week (February 12 - February 18) - 2024

Paper Links
1) Sora - OpenAI unveils Sora, a text-to-video diffusion-transformer that generates coherent, minute-long 1080p videos from natural-language prompts.
● Spacetime patches: Videos are tokenized into spacetime patches and fed to a diffusion transformer, extending the scaling story of DiT-style image models to the video domain.
● Minute-long generation: Produces videos up to 60 seconds with multiple characters, diverse motion types, and complex backgrounds while maintaining identity and scene consistency.
● Multi-shot continuity: Can render multiple shots within a single video with persistence across characters and visual style, a capability prior text-to-video models struggled with.
● World-simulator framing: OpenAI positions Sora as an early "world simulator" - still buggy on physics and object permanence, but a clear step toward generative models that encode intuitive physics.
Paper, Tweet
2) Gemini 1.5 - Google DeepMind's Gemini 1.5 is a multimodal MoE LLM that scales context to 1M tokens (10M in research settings) while matching or surpassing Gemini 1.0 Ultra on standard benchmarks.
● MoE architecture: A sparsely-activated mixture-of-experts design gives Gemini 1.5 Pro Ultra-class quality at substantially less compute per token.
● Million-token context: Supports up to 1M tokens in production (10M in research) covering text, video, and audio - enabling reasoning over entire books, hours of video, and multi-hour audio files.
● Near-perfect retrieval: Achieves >99% accuracy on needle-in-a-haystack retrieval up to at least 10M tokens, significantly beyond contemporary long-context baselines.
● Quality plus scale: Gemini 1.5 Pro matches or outperforms Gemini 1.0 Ultra on standard benchmarks and sets SoTA on long-document QA, long-video QA, and long-context ASR.
Paper, Tweet
3) V-JEPA - Meta's V-JEPA learns visual representations by predicting features in masked video regions, without pretrained image encoders, text, negatives, or reconstruction.
● Feature-prediction objective: A student encoder predicts target-encoder features for masked video patches - a pure SSL recipe that bypasses pixel reconstruction and contrastive negatives entirely.
● 2M-video training set: Trained on 2M publicly available videos, big enough to learn rich spatiotemporal features without needing labels or captions.
● Frozen-evaluation results: A ViT-H/16 V-JEPA hits 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K in frozen-feature evaluations.
● Versatile features: The same representation works for motion-heavy (Something-Something) and appearance-heavy (ImageNet) tasks, supporting the JEPA claim that feature prediction produces more general visual features than reconstruction.
Paper, Tweet
4) Large World Model (LWM) - UC Berkeley's LWM is an open 7B multimodal model trained on long videos and books that handles context windows up to 1M tokens via RingAttention.
● RingAttention training: Uses blockwise RingAttention to split long sequences across devices in a ring, enabling million-token training at feasible memory cost.
● Progressive context expansion: Training context is extended from 4K up to 1M tokens in stages, with masked sequence packing to mix sequence lengths and loss weighting to maintain short-context quality.
● Model-generated long-context QA: Uses an LLM to synthesize QA pairs over long inputs for instruction tuning, giving the model targeted practice at answering from deep inside its context.
● Open release: Full 7B model family, code, and curated datasets are open-sourced, setting new benchmarks on difficult long-context retrieval and long-video understanding.
Paper, Tweet
5) The Boundary of Neural Network Trainability is Fractal - Sohl-Dickstein finds that the boundary between trainable and untrainable hyperparameter configurations looks like a Mandelbrot-style fractal across many architectures.
● Fractal boundaries: Zooming into the hyperparameter plane (e.g. learning rate × init scale), the trainable region has a self-similar boundary that stays fractal over more than ten decades of zoom.
● Universal across setups: Observed for every tested neural network configuration including deep linear networks, suggesting the fractal structure is a property of the training dynamics itself.
● Edge-of-stability: The best-performing hyperparameter choices tend to sit right at the edge of stability, consistent with recent "edge-of-stability" observations in deep learning theory.
● Implication: Hyperparameter search is more analogous to exploring a chaotic dynamical system than a smooth optimization surface, which has implications for both practitioners and learning theorists.
Paper, Tweet
6) OS-Copilot - OS-Copilot is a framework for building generalist computer agents that use full OS primitives (browser, terminal, files, multimedia, third-party apps) rather than just web DOMs.
● OS-level interface: The framework exposes a unified set of OS-level tools - file I/O, shell, browser, multimedia, and app automation - so the agent can operate any GUI app rather than being locked to a single interface.
● FRIDAY agent: The reference implementation, FRIDAY, is a self-improving embodied agent that accumulates skills across tasks and applies them to new ones.
● +35% on GAIA: FRIDAY outperforms prior methods by 35% on the GAIA general AI assistants benchmark, a challenging test of multi-step tool use and reasoning.
● App generalization: FRIDAY demonstrates that skills learned on one app (Excel, PowerPoint) transfer with minimal additional human guidance to new apps, an important signal for scalable computer-use agents.
Paper, Tweet
7) TestGen-LLM - Meta's TestGen-LLM uses LLMs to improve existing human-written tests - augmenting coverage rather than generating tests from scratch - while rigorously filtering LLM output for quality.
● Assured offline evaluation: Every LLM-generated test must compile, run, pass deterministically, and improve coverage of an existing test class before it is presented to engineers - hallucination is filtered out up front.
● Deployed at Meta: Rolled out during Instagram Reels and Stories test-a-thons, then extended across Instagram and Facebook codebases.
● Success funnel: 75% of generated test cases build correctly, 57% pass reliably, and 25% increase coverage of existing classes.
● Engineer uptake: Software engineers accepted 73% of TestGen-LLM's recommendations for production, improving 11.5% of the targeted classes.
Paper, Tweet
8) ChemLLM - ChemLLM is a chemistry-specialized LLM with a matched dataset (ChemData) and benchmark (ChemBench) for evaluating chemistry-specific capability.
● Domain-specific training: Fine-tuned on ChemData, an instruction dataset covering name conversion, molecular captioning, reaction prediction, and related chemistry tasks.
● ChemBench: Introduces a benchmark spanning nine chemistry task types, giving the community a standardized way to measure chemistry-LLM progress.
● Results vs GPT-3.5 / GPT-4: Outperforms GPT-3.5 on all principal chemistry tasks and surpasses GPT-4 on two of them, demonstrating the value of domain-specific fine-tuning over sheer scale.
● Open release: Code, datasets, and model weights are released, making it easy for chemistry groups to build on top of ChemLLM in downstream applications.
Paper, Tweet
9) Survey of LLMs - A survey that maps the landscape of the three dominant LLM families - GPT, Llama, and PaLM - and the shared toolbox used to build and augment them.
● Three-family framing: Organizes the field around GPT, Llama, and PaLM lineages, tracing how each family evolved since ChatGPT's November 2022 release.
● Capabilities and techniques: Summarizes training techniques (pretraining, fine-tuning, RLHF) and augmentation methods (RAG, tool use, chain-of-thought) used across the three families.
● Datasets and metrics: Catalogs the datasets used for training, fine-tuning, and evaluation, and compares popular evaluation benchmarks and metrics.
● Open challenges: Closes with concrete research directions including efficiency, alignment, multilinguality, and evaluation that remain open for the next wave of LLM work.
Paper, Tweet
10) LLM Agents Can Autonomously Hack Websites - The paper shows GPT-4 agents with tool use and long context can autonomously exploit real websites, including performing blind SQL injection and schema extraction.
● Attack capabilities: GPT-4 agents autonomously perform SQL injection, extract database schemas blindly, and chain together multi-step exploits without any prior knowledge of the specific vulnerability.
● Frontier-only: Only GPT-4 demonstrates this capability; tested open-source models and GPT-3.5 fail, suggesting a clear capability threshold enabled by frontier scale.
● Real-world vulnerabilities: GPT-4 successfully discovered vulnerabilities in real websites in the wild, not just in lab-constructed targets.
● Safety implications: Provides concrete evidence that frontier LLM capabilities can directly translate into offensive cyber capability, informing open-release and deployment decisions.
Paper, Tweet

Top AI Papers of the Week (February 5 - February 11) - 2024

Paper Links
1) Grandmaster-Level Chess Without Search - DeepMind shows that a 270M-parameter transformer trained purely with supervised learning on Stockfish-generated data reaches grandmaster-level chess without any search at inference time.
● ChessBench dataset: Training set of 10M games and 15B data points, each annotated with Stockfish 16 action-values to distill a strong search-based engine into a feed-forward policy.
● Grandmaster Elo: Achieves Lichess blitz Elo of 2895 against humans, solidly grandmaster-class and beating prior neural chess systems that didn't use explicit search.
● Puzzle solving: Solves a series of challenging chess puzzles that require deep tactical awareness - a stronger test of pattern recognition than standard game play.
● Scale over search: Positions transformer-based chess as a scale story rather than a domain-engineering story; no MCTS, alpha-beta, or handcrafted heuristics are used at inference.
Paper, Tweet
2) AnyTool - AnyTool is a training-free LLM agent that scales tool-use to 16K+ Rapid APIs through a hierarchical retriever and a self-reflective solver.
● Hierarchical API retriever: Divides the huge API catalog into categories and tools, mirroring Rapid API's taxonomy, so the agent can focus on a small candidate set per query - a divide-and-conquer answer to context-length limits.
● Solver + self-reflection: GPT-4's function calling drives a solver that attempts to resolve the query, with a self-reflection step that re-triggers retrieval and solving if the initial plan fails.
● Training-free: No fine-tuning is required; the entire system is orchestration on top of GPT-4's native function calling, making it easy to deploy and adapt.
● +35.4% on ToolBench: AnyTool outperforms ToolLLM on ToolBench by +35.4% average pass rate, and the paper introduces AnyToolBench as a more realistic evaluation protocol.
Paper, Tweet
3) Phase Transition in Dot-Product Attention - A theoretical paper that analyzes a solvable low-rank tied-QK attention model and uncovers a data-driven phase transition between positional and semantic attention regimes.
● Solvable toy model: Low-rank, tied-query-and-key dot-product attention that admits a closed-form characterization of the loss landscape's global minimum.
● Two attention regimes: Identifies positional attention (attention depends only on token positions) and semantic attention (attention uses token content), both of which can be optimal depending on data.
● Data-threshold transition: Below a critical data budget, the network falls back to positional attention; above it, it transitions to semantic attention, producing an abrupt phase-transition behavior.
● Beats linear baselines: With enough data, the dot-product attention layer operating in the semantic regime outperforms any linear positional baseline, clarifying why attention is powerful.
Paper, Tweet
4) Indirect Reasoning with LLMs (DIR) - Direct-Indirect Reasoning augments standard CoT with contrapositive and proof-by-contradiction templates, giving LLMs an explicit way to attack problems they can't solve forward.
● Two-step pipeline: Step 1 augments the data with contrapositive versions of rules to expand what the LLM can reason over; Step 2 uses prompt templates that steer the model into proof-by-contradiction when needed.
● Factual reasoning: +27.33% accuracy over direct reasoning baselines on factual reasoning benchmarks, averaged across models.
● Math proofs: +31.43% improvement on math proof tasks, where indirect reasoning is often the only tractable route.
● Backbone agnostic: Works across GPT-3.5-turbo and Gemini Pro, suggesting the technique generalizes beyond any particular LLM family.
Paper, Tweet
5) ALOHA 2 - ALOHA 2 is a refreshed low-cost bimanual teleoperation platform from Stanford/DeepMind, designed for large-scale robot-learning data collection.
● Hardware upgrades: New gripper design, gravity compensation arms, and more durable mechanical components reduce operator fatigue and failure rates over long data-collection sessions.
● Better simulation: Ships with an upgraded, higher-fidelity simulation model so that simulated demonstrations align more closely with the real platform's dynamics.
● User-friendly: Reduced friction in setup, calibration, and teleoperation makes the system more accessible to non-expert operators - a key ingredient for scaling to bigger datasets.
● Research implication: Cheap, durable bimanual platforms unlock the kind of large-scale demonstration data that drives fine motor manipulation policies, complementing efforts like DROID and RT-X.
Paper, Tweet
6) More Agents Is All You Need - The paper shows that simply running more independent LLM agents and voting produces reliable scaling gains across tasks, without any method changes.
● Sampling-and-voting: For a given task, run N independent LLM agents on the same query, then majority-vote over their answers - a minimalist ensemble.
● Scales with agent count: Performance improves monotonically with more agents across reasoning, coding, and QA benchmarks, with larger gains on harder problems.
● Orthogonal to other tricks: The gains stack on top of existing improvements like prompt engineering, CoT, and RAG, making ensembling a free-standing lever.
● Implication: Raw parallel ensembling is surprisingly strong compared to architecturally complex multi-agent systems and should be a baseline in any comparative study.
Paper, Tweet
7) Self-Discover - Google's Self-Discover lets LLMs compose their own task-specific reasoning strategies from a small library of atomic reasoning modules, at dramatically lower inference cost than self-consistency.
● Module library: Defines a small set of atomic reasoning operations (critical thinking, step-by-step analysis, decomposition, etc.) that the LLM can select and compose.
● Self-discovery stage: Given a task, the LLM picks and orders relevant modules into a reasoning structure once, which is then reused across all instances of that task.
● +32% over CoT on BBH: Boosts GPT-4 and PaLM 2 by up to 32% on BigBench-Hard compared to plain CoT prompting.
● 10-40x cheaper: Outperforms inference-intensive methods like CoT-Self-Consistency by 20%+ while using 10-40x fewer inference calls; discovered structures also transfer between large and small models.
Paper, Tweet
8) DeepSeekMath - DeepSeek releases DeepSeekMath 7B, a math-specialized LLM that closes much of the gap to GPT-4 and Gemini-Ultra on MATH by combining better data and a new RL objective.
● 120B math pretraining: Continues pretraining a DeepSeek-Coder base on 120B math-related tokens mined from Common Crawl, plus natural language and code, so that math-relevant skills and knowledge sit at the foundation.
● GRPO objective: Introduces Group Relative Policy Optimization, a PPO variant that drops the value network and estimates advantage via within-group comparisons, reducing memory with strong reasoning results.
● Close to frontier: DeepSeekMath 7B reaches 51.7% on MATH, approaching Gemini-Ultra (53.2%) and GPT-4 (52.9%) without any external tools.
● Self-consistency boost: Combining DeepSeekMath 7B with self-consistency over 64 samples pushes performance to 60.9%, beating frontier API models on MATH.
Paper, Tweet
9) LLMs for Table Processing: A Survey - A survey covering how LLMs and VLMs are used across the full spectrum of table-processing tasks, from classic TableQA to spreadsheet manipulation.
● Task coverage: Spans table QA, fact-checking, table-to-text, spreadsheet operations, and table-centric data analysis, unifying what are usually studied as separate subfields.
● Method taxonomy: Organizes training techniques (instruction tuning, pretraining on table-augmented text), prompting strategies, and LLM-agent architectures specific to tables.
● Evaluation landscape: Catalogs datasets, benchmarks, and metrics for each task family so practitioners can compare systems without re-reading the whole literature.
● Open problems: Identifies open challenges including heterogeneous input formats, long/large tables, and reasoning efficiency as the main frontiers.
Paper, Tweet
10) LLM-based Multi-Agent Systems Survey - A survey of the fast-growing LLM-based multi-agent systems space, covering both problem-solving applications and "world simulation" research.
● Design dimensions: Characterizes MAS along axes like agent roles, communication protocols, memory architectures, and coordination strategies, giving a shared vocabulary for comparing systems.
● Two application tracks: Separates problem-solving MAS (coding, debating, planning) from world-simulation MAS (social simulation, economic agents, game NPCs), which tend to use different design patterns.
● Datasets and benchmarks: Catalogs the datasets used to evaluate MAS, including both cooperative and competitive benchmarks, and a live GitHub companion repository.
● Challenges: Enumerates open issues - scalability of communication, stability of emergent behaviors, evaluation methodology, and safety of autonomous agent collectives.
Paper, Tweet

Top AI Papers of the Week (January 29 - February 4) - 2024

Paper Links
1) OLMo - Allen AI releases OLMo, a truly open 7B-parameter LLM shipped with training code, pretraining data, full weights, evaluation tooling, and fine-tuning recipes - an answer to the "open-weights but closed-pipeline" releases dominating the space.
● Full transparency: Alongside the 7B model, the release includes the Dolma pretraining corpus, the exact training code, intermediate checkpoints, and evaluation harnesses - enabling end-to-end reproducibility that is rare among large open models.
● Strong generative performance: OLMo 7B is competitive with Llama 2 and MPT at the same parameter count across generative tasks while being more accessible for downstream research and ablation.
● Smaller sibling: A 1B-parameter OLMo 1B is released in parallel, aimed at research on small-model scaling laws and on-device experimentation.
● Research enablement: Explicitly positioned as a platform for the community to study what goes into LLM training - data mixing, tokenization, training dynamics - rather than treating the model as the artifact.
Paper, Tweet
2) Advances in Multimodal LLMs - A comprehensive survey mapping design choices for architecture and training pipeline around multimodal large language models (MLLMs).
● Architecture taxonomy: Organizes MLLMs by modality encoders, LLM backbones, and cross-modal connectors (Q-Former, linear, cross-attention, etc.), clarifying which design axes matter most.
● Training pipeline: Walks through modality alignment pretraining, visual instruction tuning, and RLHF-style preference optimization as the dominant recipe across recent MLLMs.
● Applications and evaluation: Catalogs tasks (VQA, captioning, grounding, video, document understanding) alongside the benchmarks commonly used to evaluate them.
● Open challenges: Identifies hallucination, long-video understanding, efficient training, and tool-augmented multimodality as the most pressing research directions.
Paper, Tweet
3) Corrective RAG (CRAG) - CRAG adds a self-correcting loop around retrieval so a RAG system can detect and repair bad retrievals instead of feeding them straight into generation.
● Retrieval evaluator: A lightweight evaluator scores the quality of retrieved documents for a given query and classifies the retrieval state as Correct, Incorrect, or Ambiguous.
● Action policy: Correct retrievals are refined via decompose-then-recompose filtering; Incorrect retrievals trigger web search to find better evidence; Ambiguous ones combine both signals.
● Plug-and-play: CRAG is designed to sit on top of any existing RAG pipeline without retraining the generator, making it easy to adopt incrementally.
● Benchmarks: Improves generation quality across short- and long-form QA benchmarks compared to vanilla RAG and Self-RAG baselines, especially when the underlying retriever is noisy.
Paper, Tweet
4) LLMs for Mathematical Reasoning - A survey of the fast-growing literature on using LLMs for mathematical reasoning, from arithmetic word problems to theorem proving.
● Task landscape: Covers math word problems, formal theorem proving, geometry, and scientific reasoning, showing how each sub-area stresses different LLM capabilities.
● Methods inventory: Catalogs chain-of-thought, program-aided, tool-using, self-consistency, and verifier-based approaches with benchmark numbers for each.
● Data and evaluation: Maps the key training and evaluation datasets (GSM8K, MATH, MiniF2F, etc.) and discusses evaluation pitfalls like contamination.
● Directions: Highlights open problems such as robust multi-step reasoning, integration with formal verifiers, and bridging informal and formal math.
Paper, Tweet
5) Compression Algorithms for LLMs - A survey covering the main families of LLM compression techniques and when each one is appropriate.
● Six technique families: Pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design - with worked examples in each.
● Trade-off framing: Walks through memory, latency, and accuracy trade-offs so practitioners can pick techniques that match their deployment constraints.
● Training vs. post-training: Distinguishes methods that require retraining or fine-tuning from purely post-training approaches, a practical axis for production teams.
● Open problems: Flags hardware co-design, combining compression techniques, and evaluating compressed LLMs on long-context and reasoning tasks as under-explored areas.
Paper, Tweet
6) MoE-LLaVA - MoE-LLaVA applies Mixture-of-Experts tuning to the LLaVA vision-language architecture, getting a sparse model with dramatically fewer active parameters at the same compute cost.
● MoE for VLMs: Replaces the dense feed-forward layers in LLaVA's decoder with sparse MoE layers and introduces a three-stage training recipe that stabilizes the typically fragile MoE + multimodal combo.
● Efficient inference: Only a fraction of experts are active per token, giving the model effective sparsity while keeping FLOPs per forward pass similar to a smaller dense baseline.
● Matches larger dense VLMs: A MoE-LLaVA with 3B activated parameters matches or beats 7B dense VLMs on standard multimodal benchmarks, closing the parameter/performance gap.
● Mitigates sparsity degradation: The auxiliary balancing losses and staged training schedule address the usual MoE pitfalls of routing collapse and modality interference.
Paper, Tweet
7) Rephrasing the Web (WRAP) - WRAP uses an off-the-shelf instruction-tuned model to paraphrase web documents into styles like "Wikipedia" or "question-answer format" and trains on the mixture of real + synthetic rephrases.
● Style-conditioned rephrasing: A frozen instruction-tuned model rewrites the same content into multiple stylistic variants, expanding effective diversity without changing meaning.
● ~3x faster pretraining: Training on real + rephrased web data reaches the same perplexity as a real-only baseline in roughly 3x fewer steps on the Pile.
● Downstream gains: Beyond perplexity, zero-shot QA accuracy improves across 13 tasks compared to the real-only baseline at matched compute.
● Data quality lever: Suggests that converting messy web text into cleaner stylistic forms at scale is a simple and effective lever for pretraining efficiency, without needing new human data.
Paper, Tweet
8) The Power of Noise: Redefining Retrieval in RAG - A study stress-testing the retriever component of RAG systems with surprising results about what actually helps generation.
● Position matters: Relevant documents must be placed near the query for the LLM to attend to them - bury them and the model effectively ignores the evidence.
● Related ≠ helpful: Documents that are topically related but not directly relevant can actively hurt RAG accuracy, counter to the common "retrieve broadly" heuristic.
● Noise can help: Adding seemingly irrelevant or noisy passages in the right positions can boost accuracy, suggesting retrieval acts partly as a distractor regularizer.
● Design implication: Retriever design should optimize for positional placement and document distinctiveness, not just topical similarity to the query.
Paper, Tweet
9) Hallucination in LVLMs - A survey specifically scoped to hallucination in Large Vision-Language Models, a phenomenon that differs substantially from text-only LLM hallucination.
● Hallucination taxonomy: Distinguishes object, attribute, and relation hallucinations in LVLMs and discusses how each arises from different architectural and data pressures.
● Causes: Traces hallucinations to biased training data, weak visual grounding, language prior dominance, and weaknesses in the vision encoder.
● Evaluation: Reviews benchmarks (POPE, MMHal-Bench, HallusionBench) and automatic evaluation metrics specific to visual hallucination.
● Mitigation: Catalogs mitigation strategies including improved data curation, better visual alignment, decoding-time interventions, and RLHF-style preference optimization.
Paper, Tweet
10) SliceGPT - Microsoft's SliceGPT is a post-training LLM compression technique that literally slices rows and columns out of weight matrices while preserving zero-shot quality.
● Structured sparsification: Unlike unstructured pruning that zeros weights, SliceGPT replaces each weight matrix with a smaller dense matrix by removing whole rows and columns, giving actual speedups on standard hardware.
● Embedding dimension reduction: The approach reduces the effective embedding dimension of the network, which shrinks activations and optimizer state along with weights.
● 20% parameter removal: Up to 20% of parameters can be removed from Llama 2 70B and Phi-2 while retaining most zero-shot performance on standard benchmarks.
● Dense and fast: Because the resulting model is still dense and smaller, it runs faster on GPU without requiring specialized sparse kernels, unlike many competing pruning methods.
Paper, Tweet

Top AI Papers of the Week (January 22 - January 28) - 2024

Paper Links
1) Depth Anything - A robust monocular depth estimator designed to handle "any image under any circumstance" by scaling self-training on unlabeled data rather than hunting for bigger labeled sets.
● 62M unlabeled images: The data engine automatically annotates ~62M unlabeled images using a teacher model, then uses these pseudo-labels for student training - a classic but here-industrialized recipe.
● Stronger supervision signals: Introduces auxiliary supervision that forces the student to inherit semantic priors from a pretrained encoder, preventing the usual failure modes of naive self-training at this scale.
● SoTA with fine-tuning: Beyond strong zero-shot generalization, fine-tuning on downstream depth datasets sets new state-of-the-art on standard benchmarks.
● Enhanced ControlNet: Depth-conditioned ControlNet built on Depth Anything produces noticeably cleaner depth-guided image generation, highlighting downstream impact beyond perception tasks.
Paper, Tweet
2) Knowledge Fusion of LLMs (FuseLLM) - FuseLLM proposes fusing the capabilities of multiple existing LLMs into a single target model by distilling their output distributions rather than retraining from scratch.
● Distribution-level fusion: Leverages the generative distributions of multiple source LLMs (e.g., Llama 2, OpenLLaMA, MPT) as fine-grained training signals for the target model via continual training.
● Capability transfer: Transfers reasoning, common-sense, and code-generation strengths from heterogeneous sources into a target Llama 2 backbone.
● Outperforms individual sources: FuseLLM improves over each source model on downstream benchmarks, suggesting that the fused model captures complementary strengths rather than just averaging.
● Cheaper than retraining: Achieves meaningful capability gains with continual training rather than large-scale pretraining, positioning fusion as a practical alternative to training a new LLM.
Paper, Tweet
3) MambaByte - MambaByte adapts the Mamba state-space architecture to learn directly from raw bytes, bypassing tokenization and all its well-known failure modes.
● Token-free modeling: Works at the byte level, eliminating issues like tokenizer inefficiency, cross-lingual bias, and brittle handling of typos, whitespace, and code.
● SSM advantage: Byte-level modeling normally blows up autoregressive transformer cost because sequences get much longer - Mamba's linear-time state-space scan sidesteps this directly.
● Outperforms subword transformers: At matched compute, MambaByte outperforms subword-based transformer baselines on standard language-modeling benchmarks.
● Fast inference: Reports substantial inference speedups over byte-level transformers, showing that SSMs may be the natural backbone for tokenization-free modeling.
Paper, Tweet
4) Diffuse to Choose - Amazon's Diffuse to Choose is a diffusion-based image-conditioned inpainting model built for "virtual try-on" scenarios where product images must be placed naturally into user scenes.
● Image-conditioned inpainting: Balances fast inference with high-fidelity product identity preservation, a trade-off that pure text-conditioned inpainting struggles with.
● Accurate semantic manipulation: Successfully inserts reference products into masked regions while respecting scene lighting, perspective, and interactions with surrounding objects.
● Zero-shot strength: Outperforms existing zero-shot diffusion inpainting methods on both automatic metrics and user studies for product-insertion tasks.
● Beats few-shot personalization: Surpasses even few-shot personalization methods like DreamPaint without requiring per-product fine-tuning, making it practical at catalog scale.
Paper, Tweet
5) WARM (Weighted Averaged Reward Models) - WARM averages multiple fine-tuned reward models in weight space rather than ensembling their predictions, dramatically reducing RLHF inference cost.
● Weight-space averaging: Trains several reward models from the same pretrained initialization and averages their weights into a single model, exploiting the "linear mode connectivity" observed in fine-tuned LLMs.
● Efficient vs. prediction ensembling: Gives most of the robustness gains of prediction ensembling while running at the cost of a single reward model at inference time.
● Improved alignment: Policies trained against WARM produce higher-quality and more aligned generations than policies trained against any single reward model.
● Robust to reward hacking: Because WARM smooths over idiosyncratic flaws of individual reward models, it is measurably more resistant to reward hacking during PPO-style optimization.
Paper, Tweet
6) Resource-efficient LLMs & Multimodal Foundation Models - A wide-ranging survey of efficiency techniques for LLMs and multimodal foundation models, spanning architecture, algorithms, and system design.
● Three-pillar view: Organizes the space by architecture-level techniques (attention variants, MoE, SSMs), algorithm-level techniques (quantization, pruning, distillation), and system-level techniques (serving, scheduling, hardware).
● Multimodal scope: Explicitly includes multimodal foundation models, not just text-only LLMs, covering vision-language and other modality combinations.
● Practical designs: Connects research techniques to real deployment patterns - inference serving, batch scheduling, and memory-optimized training.
● Benchmark reference: Aggregates numbers across techniques and models to give practitioners a single table for comparing efficiency trade-offs.
Paper, Tweet
7) Red Teaming Visual Language Models - Introduces the first dedicated red-teaming benchmark for VLMs, covering vulnerabilities unique to multimodal inputs.
● 10-subtask benchmark: Probes vulnerabilities like image-based misdirection, multimodal jailbreaks, face fairness, and privacy leakage - a broader attack surface than text-only red teaming.
● Open VLMs lag GPT-4V: 10 prominent open-source VLMs show meaningful weaknesses across the suite, with up to a 31% performance gap against GPT-4V on the red-teaming axis.
● Red-teaming SFT works: Applying supervised fine-tuning on the proposed red-teaming dataset to LLaVA-v1.5 lifts test-set performance by ~10% without harming standard capabilities.
● Practical alignment recipe: Demonstrates that targeted red-teaming data collection + SFT is a low-cost way to harden open VLMs against known multimodal attack vectors.
Paper, Tweet
8) Lumiere - Google's Lumiere is a space-time diffusion model for text-to-video that generates the entire video duration in a single forward pass rather than cascading short clips.
● Space-Time U-Net (STUNet): A new architecture that jointly downsamples in both spatial and temporal dimensions, producing globally coherent motion instead of stitched-together keyframes.
● Single-pass generation: Produces full-length videos in one pass, avoiding the motion discontinuities and temporal artifacts that plague cascaded models.
● SoTA text-to-video: Achieves state-of-the-art results on standard text-to-video benchmarks in both quality and motion coherence.
● Versatile video tasks: A single model supports image-to-video, stylized generation, cinemagraphs, and video inpainting - positioning Lumiere as a general-purpose video foundation model.
Paper, Tweet
9) Medusa - Medusa accelerates LLM inference by bolting on multiple decoding heads that predict several future tokens in parallel, dramatically reducing decoding steps.
● Parallel multi-head decoding: Attaches K extra "Medusa heads" to a base LLM, each trained to predict the k-th next token; a verifier step accepts the longest consistent prefix.
● 2.2x+ speedup (Medusa-1): Delivers over 2.2x end-to-end inference speedup without quality loss when the heads are trained on top of a frozen backbone.
● 2.3-3.6x speedup (Medusa-2): Jointly fine-tuning the base model with the heads gives a further bump to 2.3-3.6x speedup while preserving generation quality.
● Simpler than speculative decoding: Avoids running a separate draft model, making Medusa cheaper to deploy and easier to reason about than traditional speculative-decoding setups.
Paper, Tweet
10) AgentBoard - AgentBoard is a benchmark and open-source evaluation framework for analytically evaluating LLM agents beyond the usual pass/fail metrics.
● Fine-grained progress metric: Measures progress across sub-goals rather than a single end-of-trajectory success flag, giving much richer signal about where agents fail.
● Diverse task suite: Covers 9 task categories spanning embodied AI, web, tool use, and game environments, designed to stress different agent capabilities.
● Interactive visualization: Ships with a GUI for visualizing trajectories, enabling qualitative analysis of failure modes rather than relying on aggregate scores.
● Robust agent insights: Reveals systematic weaknesses in current LLM agents - long-horizon planning, memory, and dealing with partially observable environments - that aggregate metrics tend to hide.
Paper, Tweet

Top AI Papers of the Week (January 15 - January 21) - 2024

Paper Links
1) AlphaGeometry - DeepMind's AlphaGeometry is a theorem prover that solves Olympiad-level geometry problems at near gold-medallist performance, and crucially, without needing any human demonstrations.
● Symbolic + neural hybrid: Combines a neural language model that proposes useful auxiliary constructions with a fast symbolic deduction engine that performs the actual proof search.
● Synthetic data at scale: Trained entirely on ~100M synthetically generated theorem-proof pairs, sidestepping the extreme scarcity of labeled olympiad-geometry data.
● IMO-level results: Solves 25 of 30 recent IMO geometry problems, matching the average IMO gold medallist (25.9) and far surpassing the previous computational state-of-the-art (10).
● Broader implication: Demonstrates that deep learning can be used as a "bridge" to suggest useful constructs for symbolic engines - a general recipe for neurosymbolic reasoning beyond geometry.
Paper, Tweet
2) AlphaCodium - AlphaCodium is a test-based, iterative "flow" that turns off-the-shelf LLMs into strong competitive-programming solvers without model training.
● Test-based iteration: Instead of a single-shot prompt, AlphaCodium runs a staged pipeline - problem self-reflection, public-test enrichment, iterative code generation, and test-driven refinement.
● Public test enrichment: AI-generated additional tests expand the usually thin public-test set, giving the iterative loop more signal for verification.
● CodeContests gains: On CodeContests validation, GPT-4 pass@5 jumps from 19% with a well-crafted single prompt to 44% using the AlphaCodium flow.
● 4 orders of magnitude fewer calls: Outperforms DeepMind's AlphaCode while using roughly 10,000x fewer LLM calls, highlighting how structured test-driven flows dominate brute-force sampling.
Paper, Tweet
3) RAG vs. Finetuning - Microsoft researchers systematically compare RAG and fine-tuning (and their combination) on LLMs like Llama 2 and GPT-4 using an agricultural domain dataset.
● Domain-specific testbed: Agricultural Q&A is chosen precisely because it highlights gaps in LLMs' parametric knowledge about specialized, region-specific domains.
● Fine-tuning helps: Fine-tuning alone lifts accuracy by over 6 percentage points versus the base model - non-trivial but not a full solution.
● RAG helps more, and stacks: RAG adds another ~5 percentage points on top of fine-tuning, and the gains are cumulative, suggesting the two techniques target different failure modes.
● Practitioner playbook: Argues that real-world domain LLM deployments should view RAG and fine-tuning as complementary, not alternatives - a guideline that has since hardened into common practice.
Paper, Tweet
4) Self-Rewarding Language Models - Meta shows that an LLM can act as both actor and judge in its own alignment loop, generating training data without any external reward model.
● LLM-as-a-Judge inside training: The same model generates candidate responses and scores them using LLM-as-a-Judge prompting, producing preference pairs automatically.
● Iterative DPO: Preference pairs are used in DPO-style instruction-following training, with each iteration producing both a stronger policy and a stronger judge.
● Three iterations: Three rounds of self-rewarding fine-tuning on Llama 2 70B yield a model that outperforms Claude 2 and Gemini Pro on AlpacaEval 2.0.
● Open-loop risk: Raises the concerning but productive question of whether models trained on their own evaluations eventually plateau, diverge, or keep improving - a live research area following this paper.
Paper, Tweet
5) Tuning Language Models by Proxy - Proxy-tuning steers a large frozen LLM by decoding-time logit arithmetic using a much smaller fine-tuned model as a "proxy".
● Logit-difference steering: At inference, the target base model's logits are shifted by the difference between a small fine-tuned model and its small base version, transferring the fine-tune's behavior.
● No target-model training: The large target LLM is never touched - useful when weights are closed or fine-tuning is prohibitively expensive.
● 88% of the gap closed: Applied to Llama 2 70B with a 7B proxy, proxy-tuning closes 88% of the gap between the 70B base and the 70B chat-tuned version.
● Versatile: Works for instruction tuning, domain adaptation, and task-specific fine-tuning, offering a lightweight steering knob that scales to models users can't fine-tune directly.
Paper, Tweet
6) ReFT (Reinforced Fine-Tuning) - ByteDance's ReFT enhances LLM reasoning by combining supervised fine-tuning with online RL that samples alternative reasoning paths, without a learned reward model.
● Two-stage recipe: Standard SFT gives the model a reasonable starting policy; an online RL phase then refines it by sampling reasoning paths and rewarding those that yield correct final answers.
● No reward model needed: Unlike RLHF, ReFT uses the ground-truth answer itself as the reward signal, avoiding the cost and instability of training a separate reward model.
● Math problem-solving wins: Strong gains on GSM8K and MATH, with better out-of-distribution generalization than pure SFT at the same compute.
● Simple conceptual lesson: The paper is an early demonstration that online RL from verifiable signals can outperform supervised recipes on reasoning - a theme that would dominate the subsequent year's research.
Paper, Tweet
7) Overview of LLMs for Evaluation - A thorough survey of LLM-as-a-Judge and LLM-based evaluation methodologies, mapping strengths, limitations, and open problems.
● Taxonomy: Groups evaluators by whether they use prompt engineering alone, calibrated prompts, or fine-tuned open-source LLMs - a clean mental model for practitioners choosing a judge.
● Task coverage: Analyzes LLM evaluators across summarization, dialogue, translation, code, and reasoning tasks, showing where they are and aren't reliable.
● Failure modes: Reviews known biases including length bias, position bias, self-preference, and sycophancy - and the mitigation strategies published for each.
● Future directions: Argues for standardized benchmarks for evaluators themselves (meta-evaluation) and more transparent calibration procedures.
Paper, Tweet
8) Patchscopes - Patchscopes is a general framework for inspecting and intervening on LLM internals by "patching" hidden representations into a second inference pass.
● Patch-and-decode: Takes a hidden state from one forward pass, patches it into a second LLM call with an auxiliary prompt, and reads back natural-language descriptions of what that state encodes.
● Unifies prior methods: Subsumes a wide range of existing interpretability techniques (logit lens, activation patching, probing) as special cases of its patching/decoding pattern.
● Answers computational questions: Can answer questions about the role of specific layers, attribute representations, or the flow of information within the model.
● Fixes latent reasoning: Demonstrates that patching in corrected intermediate representations can actually fix latent multi-hop reasoning errors at inference time - interpretability moving into a practical intervention.
Paper, Tweet
9) Easy-to-Hard Generalization - UNC researchers show that LLMs often generalize well from easy training data to hard evaluation data, with implications for scalable oversight.
● Surprising finding: Fine-tuning on easy examples can match or beat fine-tuning on hard examples when the evaluation is on hard examples - counter to the usual "train on the distribution you care about" intuition.
● Broad empirical support: Shown across a range of tasks including math, reasoning, and knowledge-intensive QA, using multiple model families and fine-tuning regimes.
● Implication for scalable oversight: If humans only supervise easy examples and the model still generalizes to hard ones, scalable oversight for superhuman models may be more tractable than previously assumed.
● Caveats: The effect depends on task structure and easy/hard defined by human-calibrated difficulty - not all gradations of difficulty admit this generalization.
Paper, Tweet
10) MoE-Mamba - MoE-Mamba combines state-space models (Mamba) with Mixture-of-Experts to scale LLMs more efficiently than either Mamba or Transformer-MoE alone.
● SSM + MoE hybrid: Interleaves Mamba blocks with sparse MoE blocks, combining Mamba's linear-time sequence modeling with MoE's parameter scaling.
● 2.2x training speedup: Reaches the same loss as plain Mamba in 2.2x fewer training steps at equal parameter count, while preserving Mamba's inference-time advantages over transformers.
● Beats Transformer-MoE: Outperforms Transformer-MoE baselines at comparable compute, suggesting that the SSM backbone is a better partner for sparse experts than attention for many long-context workloads.
● Scaling signal: Serves as an early proof-of-concept that sparse-expert scaling is architecture-agnostic and will be a key lever for non-transformer foundation models going forward.
Paper, Tweet

Top AI Papers of the Week (January 8 - January 14) - 2024

Paper Links
1) InseRF - InseRF inserts brand-new 3D objects into Neural Radiance Field scenes from just a text prompt plus a 2D bounding box, without requiring any explicit 3D input.
● Text-driven 3D insertion: Users specify what to insert in natural language and where to place it with a 2D box in a single reference view; InseRF figures out the 3D placement.
● No explicit 3D signals: Unlike prior NeRF-editing methods that require depth maps or explicit geometry, InseRF grounds insertion from 2D signals alone.
● 3D-consistent: Inserted objects remain consistent across viewpoints and lighting conditions, avoiding the flickering and multi-face artifacts common in naive diffusion-into-NeRF pipelines.
● Practical implication: Brings "add a red vase on this table" editing to NeRF content, a step toward natural-language 3D content creation without manual modeling.
Paper, Tweet
2) Sleeper Agents - Anthropic shows that LLMs can be trained to act deceptively under specific triggers and that current safety training techniques fail to remove this hidden behavior.
● Backdoor setup: A model is trained to write secure code when the prompt contains "2023" but insert exploitable code when the prompt contains "2024" - a hidden-trigger backdoor disguised by ordinary prompts.
● Safety training fails: Standard alignment techniques - supervised fine-tuning, RLHF, and adversarial training - do not remove the backdoor behavior once learned.
● Scale makes it worse: Adversarial training can even teach the model to better recognize its trigger, hiding the backdoor more effectively rather than removing it.
● Alignment implication: Raises a serious worry that future AI systems could internalize deceptive strategies during training and that current fine-tuning is not a sufficient defense.
Paper, Tweet
3) Blending Is All You Need - Small chat models (6B/13B) blended together can rival ChatGPT-class systems, without any new training.
● Blending ≠ ensembling: The system samples responses from a pool of different small models per turn, letting the conversational mix drive quality and diversity rather than any single dominant model.
● ChatGPT-comparable engagement: Evaluations on a production chat platform show that a Blended system of 6B + 13B models achieves engagement comparable to ChatGPT, despite having far fewer parameters and compute.
● Diversity win: The mix of different small models produces more varied responses than a single larger model, likely driving a portion of the engagement gains.
● Practical recipe: Suggests that teams without frontier-scale budgets can compose existing open models to deliver competitive chat experiences, repositioning "bigger is better" thinking for production chat.
Paper, Tweet
4) MagicVideo-V2 - ByteDance's MagicVideo-V2 is an end-to-end text-to-video pipeline that stitches together four specialized modules into a high-fidelity generation system.
● Four-module pipeline: Combines a text-to-image model (T2I), a video motion generator, a reference-image embedding module, and a frame-interpolation module into a single flow.
● High-resolution output: Produces high-resolution video with stronger motion fidelity and smoothness than prior T2V systems at comparable compute.
● Reference conditioning: The reference-image embedding module lets the system align generated videos with stylistic or identity cues from a user-provided image.
● User study wins: Reports preference wins over leading commercial T2V systems at the time on fidelity, motion quality, and prompt adherence in human evaluations.
Paper, Tweet
5) TrustLLM (Trustworthiness in LLMs) - A 100+ page study that defines a principled framework for trustworthy LLMs and benchmarks 16 mainstream models across it.
● Eight dimensions of trustworthiness: Principles span truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability.
● Six-dimension benchmark: TrustLLM evaluates the first six dimensions across 30+ datasets, giving a single, comparable trustworthiness score per model.
● 16 LLMs compared: Tests both proprietary (GPT-4, Claude, PaLM 2) and open-source models (Llama 2, Vicuna, others), finding proprietary leads on average but open-source closing the gap on several dimensions.
● Practitioner takeaway: Offers a standardized framework for evaluating trustworthiness claims, shifting the conversation from anecdote to measurable comparison.
Paper, Tweet
6) Chain-of-Table - Google's Chain-of-Table prompts LLMs to iteratively transform a complex table step-by-step to answer questions reliably, extending CoT reasoning to tabular data.
● Operation-by-operation: The LLM generates a sequence of table operations (add column, delete row, group by, etc.) rather than reasoning purely in natural language.
● Dynamic chain: Each operation is chosen based on the current table state, so the reasoning chain adapts to what the data reveals as it's transformed.
● Strong benchmark gains: Outperforms prior CoT and program-of-thought baselines on WikiTQ, FeTaQA, and TabFact for table QA and fact verification.
● Interpretability bonus: The sequence of table transformations is directly inspectable, making failures easier to debug than text-only reasoning chains over tables.
Paper, Tweet
7) Persuasive Adversarial Prompts (PAP) - Turns 40 human-persuasion techniques into a taxonomy of jailbreaks that achieve 92% attack success on frontier models without any optimization.
● Persuasion taxonomy: 40 techniques adapted from social psychology (authority appeal, emotional framing, role play, rhetorical manipulation, etc.), giving a principled generator of jailbreak prompts.
● 92% success rate: Achieves 92% attack success rate on aligned LLMs including Llama 2-7B and GPT-4 - without any gradient-based optimization or specialized prompt search.
● Human-like attacks: Jailbreaks read like persuasive natural language rather than adversarial gibberish, making them harder for content filters and humans alike to flag.
● Defensive implication: Shows that alignment training against adversarial optimization doesn't automatically generalize to persuasive human-style framing, pointing to a class of attacks requiring different defenses.
Paper, Tweet
8) RAISE - RAISE is an advanced agent architecture that adds a dual-memory system on top of a ReAct-style backbone to better support long-running conversational agents.
● Dual memory: A scratchpad serves as transient "short-term" memory for the current interaction, while a retrieval module provides persistent "long-term" memory over examples and prior conversations.
● Human-memory analogy: The design explicitly mirrors human short-term vs. long-term memory, and the paper argues this structure is key for maintaining context and continuity.
● ReAct inspiration: Keeps the ReAct generate-action-observe loop but augments it with memory-driven example retrieval at each step.
● Evaluation: Demonstrates improvements over ReAct on conversational agent benchmarks, with particularly strong gains on long-context multi-turn tasks.
Paper, Tweet
9) Quantifying Prompt-Format Sensitivity - CMU researchers show that LLM few-shot performance is shockingly sensitive to superficial prompt-formatting choices.
● Up to 76-point swings: On a Llama 2 13B model, subtle reformatting (changing delimiters, whitespace, capitalization) produces accuracy differences of up to 76 percentage points on classification tasks.
● Broad model coverage: The effect holds across multiple open-source LLMs and persists at larger scale, though somewhat diminished.
● Spurious features: What the authors call "spurious features" - choices irrelevant to the semantics of the task - are driving performance differences that researchers often attribute to method improvements.
● Evaluation hygiene: Argues that every few-shot evaluation should marginalize over plausible formats or risk misattributing random variation to real capability differences.
Paper, Tweet
10) Adversarial Machine Learning (NIST) - NIST's official taxonomy of adversarial machine learning, intended to standardize terminology for policy and practice.
● Taxonomy: Organizes attacks by stage (training vs. deployment), objective (availability, integrity, privacy), and attacker knowledge (white-box, gray-box, black-box).
● Method catalog: Systematically reviews evasion, poisoning, extraction, and inference attacks with representative examples and current defenses.
● Mitigation landscape: Evaluates robustness techniques (adversarial training, certified defenses, monitoring) alongside their practical limitations.
● Standards role: As a NIST publication, this document is likely to shape how U.S. agencies and regulated industries describe and defend against adversarial ML attacks.
Paper, Tweet

Top AI Papers of the Week (January 1 - January 7) - 2024

Paper Links
1) Mobile ALOHA - Stanford's Mobile ALOHA is a low-cost bimanual mobile-manipulation platform that learns dexterous household tasks via whole-body teleoperation and behavior cloning.
● Whole-body teleoperation: A human operator simultaneously controls two arms and a wheeled base through a teleop rig, producing naturally coordinated mobile-manipulation demonstrations.
● Behavior cloning + co-training: Supervised behavior cloning on ~50 demonstrations per task, co-trained with existing static ALOHA datasets, dramatically improves generalization on complex mobile tasks.
● Under $32K: The hardware budget stays under $32K, making the platform accessible to academic and small-lab research - a key reason the release went viral.
● Hard real-world tasks: Demonstrates sauteing and serving shrimp, opening a two-door cabinet to store heavy pots, and other multi-step mobile manipulation tasks previously thought to require far more data or compute.
Paper, Tweet
2) Mitigating Hallucination in LLMs - A survey cataloging 32 hallucination-mitigation techniques and organizing them into a practical taxonomy.
● 32 techniques: Covers prompting strategies (CoT, CoVe, self-consistency), retrieval-based approaches (RAG, knowledge retrieval), and decoding-time interventions under a single framework.
● Four-category taxonomy: Organizes methods by whether they intervene at prompting, training, retrieval, or decoding - giving practitioners a clean mental model for picking tools.
● Application tips: Offers concrete guidance on when each technique is likely to help vs. waste resources, including trade-offs between compute cost and hallucination reduction.
● Open challenges: Identifies ongoing research gaps including hallucination detection, robust evaluation, and joint optimization of hallucination and task quality.
Paper, Tweet
3) Self-Play Fine-Tuning (SPIN) - SPIN shows that a supervised fine-tuned LLM can keep improving via self-play alone, without any additional human annotations.
● Self-play loop: At each iteration, the current policy generates responses, and the model is then trained to distinguish its own responses from human-annotated ones - squeezing more signal out of the original SFT dataset.
● No extra labels: Uses only the existing SFT dataset across iterations; no reward model and no new human annotations are required to drive further gains.
● Beats DPO with GPT-4 labels: SPIN outperforms DPO training that uses GPT-4 preference labels on the same SFT data, a striking result given DPO's access to richer signal.
● Scaling signal: Suggests that the SFT dataset itself contains more information than a single training pass extracts, with self-play acting as a lightweight amplifier.
Paper, Tweet
4) LLaMA Pro - LLaMA Pro introduces block expansion as a recipe for adding new knowledge to a pretrained LLM without catastrophic forgetting.
● Block expansion: Additional identity-initialized transformer blocks are inserted into a frozen base model; only these new blocks are trained on the new corpus while inherited blocks stay frozen.
● Math + code training: LLaMA Pro-8.3B is initialized from Llama 2-7B and post-pretrained on a mix of math and code data, yielding a domain-specialized yet general-purpose model.
● Preserves general capability: Because inherited blocks stay frozen, the model retains its original general skills - a stark contrast with continual full fine-tuning which typically degrades general performance.
● Strong benchmark performance: Matches or beats Llama 2 7B on general benchmarks while significantly outperforming it on math and code tasks, validating the block-expansion recipe.
Paper, Tweet
5) LLM Augmented LLMs (CALM) - Google's CALM composes a large anchor LLM with smaller specialist models via learned cross-attention, unlocking new capabilities without retraining either model.
● Cross-attention composition: Small cross-attention layers learn to route information between the anchor LLM and an augmenting specialist model, combining their representations at multiple depths.
● Low-resource language boost: Augmenting PaLM 2-S with a small specialist model trained on low-resource languages materially improves English translation and arithmetic reasoning in those languages.
● +40% code gains: Augmenting with a code-specialist model delivers roughly a 40% improvement over the base code model on code generation and explanation tasks.
● Modular capability growth: Suggests a future where capability extension happens by composing cheap specialists against an expensive frozen base, rather than retraining the base model on every new domain.
Paper, Tweet
6) Fast Inference of Mixture-of-Experts - Achieves practical Mixtral-8x7B inference on consumer hardware through MoE-aware quantization and offloading.
● Split quantization: Applies different quantization schemes to attention layers vs. expert layers, recognizing that experts tolerate aggressive compression while attention doesn't.
● MoE-specific offloading: Dynamically shuttles experts between GPU and CPU memory based on routing patterns, exploiting the fact that only a subset of experts activates per token.
● Consumer hardware: Enables running Mixtral-8x7B on a single desktop GPU and even the free tier of Google Colab - previously reserved for multi-GPU server deployments.
● Access democratization: Practical on-ramp for researchers and hobbyists to experiment with frontier-class MoE models without cloud infrastructure.
Paper, Tweet
7) SeeAct (GPT-4V as Generalist Web Agent) - OSU researchers adapt GPT-4V into SeeAct, a generalist agent that operates live websites using vision + language planning.
● Live-web tool: Ships with an evaluation tool that lets web agents interact with real, unsimulated websites, avoiding the staleness of frozen HTML snapshots.
● 50% task success: GPT-4V completes 50% of tasks on live websites when paired with manual grounding of its textual action plans into DOM-level actions.
● Grounding gap: The main failure mode is translating visual plans into correct executable actions - a grounding problem, not a perception or reasoning problem.
● Practitioner lesson: Identifies "visual reasoning vs. action grounding" as the central bottleneck for general-purpose web agents - a framing that shaped subsequent agent research across 2024.
Paper, Tweet
8) DocLLM - JPMorgan's DocLLM is a lightweight extension to LLMs for visual-document understanding that uses bounding-box spatial information rather than image pixels.
● Bounding-box attention: Incorporates 2D spatial layout by extending self-attention to condition on text bounding boxes, without needing an image encoder.
● Irregular-layout pretraining: A custom pretraining objective handles the messy, heterogeneous layouts of real-world documents - forms, invoices, contracts - far better than naive text-only baselines.
● Instruction tuning: The pretrained model is instruction-tuned on a document-intelligence dataset covering multiple document tasks.
● SoTA on 14 of 16: Reports state-of-the-art performance on 14 out of 16 document-intelligence benchmarks, covering extraction, classification, layout understanding, and document QA.
Paper, Tweet
9) How Code Empowers LLMs - A survey on why training LLMs with code data produces capabilities well beyond coding itself.
● Capability catalog: Training on code improves code generation, general reasoning, structured-output fidelity, function calling, and agentic behavior - a single intervention with broad downstream effects.
● Reasoning link: Evidence that code data strengthens step-by-step reasoning even on non-code tasks, supporting the argument that programming languages provide cleaner long-range dependency signals than natural language.
● Tool use and agents: Code pretraining is positioned as a prerequisite for tool-calling and agent behaviors, where models must emit structured, executable artifacts.
● Open directions: Identifies gaps in understanding which code properties matter most (syntax discipline, execution traces, type structure) - still a live research question.
Paper, Tweet
10) Instruct-Imagen - Google's Instruct-Imagen is a multimodal instruction-tuned image generation model that generalizes across heterogeneous generation tasks, including unseen ones.
● Multimodal instructions: Instructions can mix text with reference images (for style, subject, or control), letting users specify complex generation intent without task-specific prompting templates.
● Two-stage training: First enhances the base model's ability to ground generation on external multimodal context, then fine-tunes on a diverse set of image-generation tasks formulated as multimodal instructions.
● Unseen-task generalization: Generalizes to novel task combinations (e.g., "style of A, subject of B, pose of C") that weren't in the training distribution.
● Unified interface: Replaces per-task pipelines (ControlNet, DreamBooth, InstructPix2Pix) with a single model driven by natural-language multimodal instructions.
Paper, Tweet