This page collects every weekly issue of AI Papers of the Week from 2024. For other years, see the main index.
| Paper | Links | | ------------- | ------------- | | 1) DeepSeek-V3 - a 671B-parameter MoE language model that activates 37B parameters per token, utilizing MLA and DeepSeekMoE architectures for efficient operation; it introduces an auxiliary-loss-free load balancing approach and employs multi-token prediction during training to enhance performance; following pre-training on 14.8 trillion tokens, the model underwent SFT and RL stages, achieving performance comparable to leading closed-source models while surpassing other open-source alternatives; the model requires only 2.788M H800 GPU hours for training, with stable training that avoids any irrecoverable loss spikes. | Paper, Tweet | | 2) *Large Concept Models* - presents an approach that operates on sentence-level semantic representations called concepts, moving beyond token-level processing typical in current LLMs; the model leverages SONAR sentence embeddings to support 200 languages across text and speech modalities, training on autoregressive sentence prediction using various approaches from MSE regression to diffusion-based generation; experiments with both 1.6B and 7B parameter variants trained on 1.3T and 7.7T tokens respectively demonstrate strong performance on generative tasks like summarization and summary expansion. | Paper, Tweet | | 3) ModernBERT - a new encoder-only transformer model that achieves state-of-the-art performance on classification and retrieval tasks while being more efficient than previous encoders; it was trained on 2T tokens with 8192 sequence length and incorporates modern optimizations that represent a significant improvement over BERT; the model is specifically designed for practical deployment, offering superior speed and memory efficiency on common GPUs. | Paper, Tweet | | 4) Automating the Search for Artificial Life - presents a new approach that uses foundation models to automatically discover interesting artificial life simulations across multiple platforms like Boids, Lenia, and Game of Life; the system can find simulations that produce specific target behaviors, discovers simulations that generate temporally open-ended novelty, and map out diverse simulation spaces; it discovers new lifeforms in Lenia and Boids, while also enabling quantitative measurement of previously qualitative phenomena in a human-aligned way. | Paper, Tweet | | 5) A Survey on LLM Inference-Time Self-Improvement - presents a survey that analyzes three categories of LLM inference-time self-improvement techniques - independent methods like enhanced decoding, context-aware approaches using external data, and model collaboration strategies. | Paper, Tweet | | 6) Explore Theory-of-Mind - introduces ExploreToM, a framework that uses A* search to generate diverse, complex theory-of-mind scenarios that reveal significant limitations in current LLMs' social intelligence capabilities; testing showed even advanced models like GPT-4 and Llama-3 perform poorly (as low as 5% accuracy) on these challenging scenarios, despite their strong performance on simpler benchmarks; fine-tuning on ExploreToM data improved performance on existing benchmarks by 27 points. | Paper, Tweet | | 7) LearnLM - a new LearnLM model that can follow pedagogical instructions, allowing it to adapt its teaching approach based on specified educational needs rather than defaulting to simply presenting information; experimental results show that LearnLM is preferred over other leading models, outperforming GPT-4 by 31%, Claude 3.5 by 11%, and Gemini 1.5 Pro by 13%; this instruction-following approach avoids committing to a single pedagogical framework, instead enabling teachers and developers to specify their desired teaching behaviors while allowing for continuous improvement alongside other capabilities. | Paper, Tweet | | 8) Empowering MLLM with o1-like Reasoning and Reflection - proposes a new learning-to-reason method called CoMCTS that enables multimodal language models to develop step-by-step reasoning capabilities by leveraging collective knowledge from multiple models; the approach was used to create Mulberry-260k, a dataset with explicit reasoning trees, which was then used to train the Mulberry model series; the method demonstrates strong performance on benchmarks, with the models showing improved reasoning and reflection capabilities. | Paper, Tweet | | 9) Reinforcement Learning Overview - presents a comprehensive overview of reinforcement learning. | Paper, Tweet | | 10) DRT-o1 - applies long chain-of-thought reasoning to machine translation, particularly for handling metaphors and similes across different cultures; the system uses a multi-agent framework with a translator working iteratively with an advisor and evaluator to produce better translations; testing with Qwen2.5 models showed significant improvements in BLEU and CometScore metrics, with DRT-o1-7B outperforming larger models like QwQ-32B-Preview. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) Genesis - a new universal physics simulation platform that combines a high-performance physics engine with generative AI capabilities; it enables natural language-driven creation of robotic simulations, character animations, and interactive 3D environments at speeds up to 430,000 times faster than in real-time. | Paper, Tweet | | 2) Alignment Faking in LLMs - demonstrates that the Claude model can engage in "alignment faking"; it can strategically comply with harmful requests to avoid retraining while preserving its original safety preferences; this raises concerns about the reliability of AI safety training methods. | Paper, Tweet | | 3) TheAgentCompany - a new benchmark for evaluating AI agents on real-world professional tasks in a simulated software company environment; tasks span multiple professional roles including software engineering, project management, finance, and HR; when tested with various LLMs, including both API-based models like Claude-3.5-Sonnet and open-source models like Llama 3.1, the results show the current limitations of AI agents. The best-performing model, Claude-3.5-Sonnet, achieved only a 24% success rate on completing tasks fully while scoring 34.4% when accounting for partial progress. | Paper, Tweet | | 4) Graphs to Text-Attributed Graphs - automatically generates textual descriptions for nodes in a graph which leads to effective graph to text-attributed graph transformation; evaluates the approach on text-rich, text-limited, and text-free graphs, demonstrating that it enables a single GNN to operate across diverse graphs. | Paper, Tweet | | 5) Qwen-2.5 Technical Report - Alibaba releases Qwen2.5, a new series of LLMs trained on 18T tokens, offering both open-weight models like Qwen2.5-72B and proprietary MoE variants that achieve competitive performance against larger models like Llama-3 and GPT-4. | Paper, Tweet | | 6) PAE (Proposer-Agent-Evaluator) - a learning system that enables AI agents to autonomously discover and practice skills through web navigation, using reinforcement learning and context-aware task proposals to achieve state-of-the-art performance on real-world benchmarks. | Paper | | 7) DeepSeek-VL2 - a new series of vision-language models featuring dynamic tiling for high-resolution images and efficient MoE architecture, achieving competitive performance across visual tasks; achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. | Paper, Tweet | | 8) AutoFeedback - a two-agent AI system that generates more accurate and pedagogically sound feedback for student responses in science assessments, significantly reducing common errors like over-praise compared to single-agent models. | Paper | | 9) A Survey of Mathematical Reasoning in the Era of Multimodal LLMs - presents a comprehensive survey analyzing mathematical reasoning capabilities in multimodal large language models (MLLMs), covering benchmarks, methodologies, and challenges across 200+ studies since 2021. | Paper, Tweet | | 10) Precise Length Control in LLMs - adapts a pre-trained decoder-only LLM to produce responses of a desired length; integrates a secondary length-difference positional encoding into the input embeddings which enables counting down to a user-set response terminal length; claims to achieve mean token errors of less than 3 tokens without compromising quality. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) Training LLMs to Reason in a Continuous Latent Space - presents Coconut (Chain of Continuous Thought), a novel paradigm that enables LLMs to reason in continuous latent space rather than natural language; Coconut takes the last hidden state of the LLM as the reasoning state and feeds it back to the LLM as the subsequent input embedding directly in the continuous space; this leads to what the authors refer to as "continuous thought" which augments an LLM's capability on reasoning tasks; it demonstrates improved performance on complex reasoning tasks through emergent breadth-first search capabilities. | Paper, Tweet | | 2) Phi-4 Technical Report - presents phi-4, a 14B model that surpasses its teacher model on STEM-QA capabilities. It also reports strong performance on reasoning-focused benchmarks due to improved data, training curriculum, and innovations in the post-training scheme. | Paper, Tweet | | 3) Asynchronous Function Calling - proposes AsyncLM, a system for asynchronous LLM function calling; they design an in-context protocol for function calls and interrupts, provide fine-tuning strategy to adapt LLMs to the interrupt semantics, and implement these mechanisms efficiently on LLM inference process; AsyncLM can reduce task completion latency from 1.6x-5.4x compared to synchronous function calling; it enables LLMs to generate and execute function calls concurrently. | Paper, Tweet | | 4) MAG-V - a multi-agent framework that first generates a dataset of questions that mimic customer queries; it then reverse engineers alternate questions from responses to verify agent trajectories; reports that the generated synthetic data can improve agent performance on actual customer queries; finds that for trajectory verification simple ML baselines with feature engineering can match the performance of more expensive and capable models. | Paper, Tweet | | 5) Clio - proposes a platform using AI assistants to analyze and surface private aggregated usage patterns from millions of Claude.ai conversations; enables insights into real-world AI use while protecting user privacy; the system helps identify usage trends, safety risks, and coordinated misuse attempts without human reviewers needing to read raw conversations. | Paper, Tweet | | 6) A Survey on LLMs-as-Judges - presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. | Paper, Tweet | | 7) AutoReason Improves Multi-step Reasoning - proposes a method to automatically generate rationales for queries using CoT prompting; this transforms zero-shot queries into few-shot reasoning traces which are used as CoT exemplars by the LLM; claims to improve reasoning in weaker LLMs. | Paper, Tweet | | 8) The Byte Latent Transformer (BLT)- introduces a byte-level language model architecture that matches tokenization-based LLM performance while improving efficiency and robustness; uses a dynamic method of grouping bytes into patches based on the entropy of the next byte, allocating more compute resources to complex predictions while using larger patches for more predictable sequences; BLT demonstrates the ability to match or exceed the performance of models like Llama 3 while using up to 50% fewer FLOPs during inference. | Paper, Tweet | | 9) Does RLHF Scale? - This new paper explores the impacts of key components in the RLHF framework. Summary of main findings: 1) RLHF doesn't scale as effectively as pretraining in LLMs, with larger policy models benefiting less from RLHF when using a fixed reward model, 2) when increasing the number of responses sampled per prompt during policy training, performance improves initially but plateaus quickly, typically around 4-8 samples, 3) using larger reward models leads to better performance in reasoning tasks, but the improvements can be inconsistent across different types of tasks, and 4) increasing training data diversity for reward models is more effective than increasing response diversity per prompt, but policy training shows diminishing returns after the early stages regardless of additional data. | Paper, Tweet | | 10) Granite Guardian - IBM open-sources Granite Guardian, a suite of safeguards for risk detection in LLMs; the authors claim that With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) OpenAI o1 - a model series trained with large-scale reinforcement learning to reason using chain of thought; o1 shows significant improvements across benchmarks related to math, code, and science; o1 is claimed to be 50% faster in generating thinking steps than o1-preview; results demonstrate that o1 is significantly better at reasoning tasks and produces more comprehensive and reliable responses. | Paper, Tweet | | 2) Genie 2 - a foundation world model that generates playable 3D environments from single prompt images, enabling endless training scenarios for AI agents with features like physics simulation, character animation, and object interactions; Genie 2 is trained on video data using a combination of autoencoder and transformer for generating virtual worlds; the model can create real-time interactive environments, with a faster but lower-quality version available for immediate play. | Paper, Tweet | | 3) Reverse Thinking - shows that training LLMs to learn "reverse thinking" helps to improve performance in commonsense, math, and logical reasoning tasks. It claims to outperform a standard fine-tuning method trained on 10x more forward reasoning. | Paper, Tweet | | 4) ALAMA - a new framework that helps language agents automatically learn when to use different mechanisms (ReAct, CoT, Reflection, etc.) for automatically completing tasks, improving on current approaches that use fixed or predefined mechanisms; the framework adaptively activates the appropriate mechanisms according to the potential characteristics of the task; experimental results demonstrate significant improvements in downstream agent tasks, including mathematical reasoning and knowledge-intensive reasoning. | Paper, Tweet | | 5) Auto-RAG- an autonomous iterative retrieval model with superior performance across many datasets; Auto-RAG is a fine-tuned LLM that leverages the decision-making capabilities of an LLM; it interacts with the retriever through multiturn dialogues, systematically planning retrievals and refining queries to acquire valuable knowledge — it performs this process until sufficient external information is obtained; the authors also show that based on question difficulty, the method can adjust the number of iterations without any human intervention. | Paper, Tweet | | 6) GenCast - an ML weather prediction model that outperforms the world's leading operational weather forecasting system (ECMWF's ENS) in both accuracy and speed; it generates probabilistic 15-day global weather forecasts for over 80 variables in just 8 minutes, with better skill than ENS on 97.2% of evaluated targets; GenCast produces an ensemble of forecasts that better capture uncertainty and predict extreme weather events, tropical cyclone tracks, and wind power production. | Paper, Tweet | | 7) Challenges in Human-Agent Communication - present a comprehensive analysis of key challenges in human-agent communication, focusing on how humans and AI agents can effectively establish common ground and mutual understanding; identifies 12 core challenges across three categories: conveying information from agents to users, enabling users to communicate information to agents, and general communication challenges that affect all interactions. | Paper | | 8) Retrieval-Augmented Reasoning for LLMs - extends the rStar reasoning framework to enhance reasoning accuracy and factual reliability of LLMs; it leverages a Monte Carlos Tree Search (MCTS) framework with explicit retrieval-augmented reasoning to produce multiple candidate reasoning trajectories; then it leverages a retrieval-augmented factuality scorer to evaluate the factual accuracy of the reasoning trajectories; the trajectory with the highest factuality score is selected as the final answer by the system; on medical reasoning tasks, RARE (which uses Llama 3.1) surpasses larger models such as GPT-4; on commonsense reasoning tasks, RARE outperformed Claude-3.5 Sonnet and GPT-4o-mini, achieving performance competitive with GPT-4o. | Paper, Tweet | | 9) DataLab - a unified business intelligence platform powered by LLM-based agents that integrates task planning, reasoning, and computational notebooks to streamline the entire BI workflow; the system achieves SOTA performance on research benchmarks and demonstrates significant improvements in accuracy and efficiency on real enterprise data from Tencent; achieves up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks. | Paper, Tweet | | 10) Procedural Knowledge in Pretraining Drives Reasoning in LLMs - studies what documents in the pertaining influence model outputs; by looking at the pertaining data, it tries to understand better what kind of generalization strategies LLMs use to perform reasoning tasks; when performing reasoning tasks, it finds that influential documents contain procedural knowledge (e.g., demonstrating how to obtain a solution using formulae or code). | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) LLM Surpass Human Experts in Predicting Neuroscience Results - proposes BrainBench to study how good LLMs are at predicting experimental outcomes in neuroscience; they tuned an LLM, BrainGPT, on neuroscience literature that surpasses experts in predicting neuroscience results; report that when LLMs indicated high confidence in their predictions, their responses were more likely to be correct. | Paper, Tweet | | 2) Fugatto - a new generative AI sound model (presented by NVIDIA) that can create and transform any combination of music, voices, and sounds using text and audio inputs, trained on 2.5B parameters and capable of novel audio generation like making trumpets bark or saxophones meow. | Paper, Tweet | | 3) o1 Replication Journey - Part 2 - shows that combining simple distillation from o1's API with supervised fine-tuning significantly boosts performance on complex math reasoning tasks; a base model fine-tuned on simply tens of thousands of samples o1-distilled long-thought chains outperform o1-preview on the American Invitational Mathematics Examination (AIME). | Paper, Tweet | | 4) LLM-Brained GUI Agents - presents a survey of LLM-brained GUI Agents, including techniques and applications. | Paper, Tweet | | 5) High-Level Automated Reasoning - extends in-context learning through high-level automated reasoning; achieves state-of-the-art accuracy (79.6%) on the MATH benchmark with Qwen2.5-7B-Instruct, surpassing GPT-4o (76.6%) and Claude 3.5 (71.1%); rather than focusing on manually creating high-quality demonstrations, it shifts the focus to abstract thinking patterns; it introduces five atomic reasoning actions to construct chain-structured patterns; then it uses Monte Carlo Tree Search to explore reasoning paths and construct thought cards to guide inference. | Paper, Tweet | | 6) Star Attention: Efficient LLM Inference over Long Sequences - introduces Star Attention, a two-phase attention mechanism that processes long sequences by combining blockwise-local attention for context encoding with sequence-global attention for query processing and token generation; achieves up to 11x faster inference speeds while maintaining 95-100% accuracy compared to traditional attention mechanisms by efficiently distributing computation across multiple hosts; a key innovation is the "anchor block" mechanism, where each context block is prefixed with the first block, enabling effective approximation of global attention patterns while reducing computational overhead. | Paper, Tweet | | 7) Survey on LLM-as-a-Judge - provides a comprehensive survey of LLM-as-a-Judge, including a deeper discussion on how to build reliable LLM-as-a-Judge systems. | Paper, Tweet | | 8) TÜLU 3 - releases a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. | Paper, Tweet | | 9) Generative Agent Simulations of 1,000 People - introduces a new agent architecture that uses LLMs to create behavioral simulations of real individuals, achieving 85% accuracy in replicating human responses on the General Social Survey and reducing demographic biases compared to traditional approaches. | Paper, Tweet | | 10) Measuring Bullshit in Language Games Played by ChatGPT - proposes that LLM-based chatbots play the ‘language game of bullshit’; by asking ChatGPT to generate scientific articles on topics where it has no knowledge or competence, the authors were able to provide a reference set of how this “bullshit” is manifested. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) AlphaQubit - a new AI-based decoder that sets a state-of-the-art benchmark for identifying errors in quantum computers; using transformer architecture, AlphaQubit demonstrated 6% fewer errors than tensor network methods and 30% fewer errors than correlated matching when tested on the Sycamore data; shows promising results in simulations of larger systems up to 241 qubits; while this represents significant progress in quantum error correction, the system still needs improvements in speed before it can correct errors in real-time for practical quantum computing applications. | Paper, Tweet | | 2) The Dawn of GUI Agent - explores Claude 3.5 computer use capabilities across different domains and software; they also provide an out-of-the-box agent framework for deploying API-based GUI automation models; Claude 3.5 Computer Use demonstrates unprecedented ability in end-to-end language to desktop actions. | Paper, Tweet | | 3) A Statistical Approach to LLM Evaluation - proposes five key statistical recommendations for a more rigorous evaluation of LLM performance differences. The recommendations include: 1) using the Central Limit Theorem to measure theoretical averages across all possible questions rather than just observed averages; 2) clustering standard errors when questions are related rather than independent; 3) reducing variance within questions through resampling or using next-token probabilities; 4) analyzing paired differences between models since questions are shared across evaluations, and 5) using power analysis to determine appropriate sample sizes for detecting meaningful differences between models; the authors argue that these statistical approaches will help researchers better determine whether performance differences between models represent genuine capability gaps or are simply due to chance, leading to more precise and reliable model evaluations. | Paper, Tweet | | 4) Towards Open Reasoning Models for Open-Ended Solutions - proposes Marco-o1 which is a reasoning model built for open-ended solutions; Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and more recent reasoning strategies; Marco-o1 achieves accuracy improvements of +6.17% on the MGSM (English) dataset and +5.60% on the MGSM (Chinese) dataset. | Paper, Tweet | | 5) LLM-based Agents for Automated Bug Fixing - analyzes seven leading LLM-based bug fixing systems on the SWE-bench Lite benchmark, finding MarsCode Agent (developed by ByteDance) achieved the highest success rate at 39.33%; reveals that for error localization line-level fault localization accuracy is more critical than file-level accuracy, and bug reproduction capabilities significantly impact fixing success; shows that 24/168 resolved issues could only be solved using reproduction techniques, though reproduction sometimes misled LLMs when issue descriptions were already clear; concludes that improvements are needed in both LLM reasoning capabilities and Agent workflow design to enhance automated bug fixing effectiveness. | Paper, Tweet | | 6) Cut Your Losses in Large-Vocabulary Language Models - introduces Cut Cross-Entropy (CCE), a novel method to significantly reduce memory usage during LLM training by optimizing how the cross-entropy loss is computed; currently, the cross-entropy layer in LLM training consumes a disproportionate amount of memory (up to 90% in some models) due to storing logits for all possible vocabulary tokens. CCE addresses this by only computing logits for the correct token and evaluating the log-sum-exp over all logits on the fly using flash memory; the authors show that the approach reduces the memory footprint of Gemma 2 from 24GB to just 1MB; the method leverages the inherent sparsity of softmax calculations to skip elements that contribute negligibly to gradients; finally, it demonstrates that CCE achieves this dramatic memory reduction without sacrificing training speed or convergence, enabling larger batch sizes during training and potentially more efficient scaling of LLM training. | Paper | | 7) BABY-AIGS - a multi-agent system for automated scientific discovery that emphasizes falsification through automated ablation studies. The system was tested on three ML tasks (data engineering, self-instruct alignment, and language modeling), demonstrating the ability to produce meaningful scientific discoveries. However, the performance is below experienced human researchers. | Paper, Tweet | | 8) Does Prompt Formatting Impact LLM Performance - examines how different prompt formats (plain text, Markdown, JSON, and YAML) affect GPT model performance across various tasks; finds that GPT-3.5-turbo's performance can vary by up to 40% depending on the prompt format, while larger models like GPT-4 show more robustness to format changes; argues that there is no universally optimal format across models or tasks - for instance, GPT-3.5-turbo generally performed better with JSON formats while GPT-4 preferred Markdown; models from the same family showed similar format preferences, but these preferences didn't transfer well between different model families; suggests that prompt formatting significantly impacts model performance and should be carefully considered when performing prompt engineering and model evaluation, and how to apply it to applications. | Paper | | 9) FinRobot - an AI agent framework for equity research that uses a multi-agent Chain-of-Thought prompting, combining data analysis with human-like reasoning to produce professional investment reports comparable to major brokerages; it leverage three agents: a Data-CoT Agent to aggregate diverse data sources for robust financial integration; the Concept-CoT Agent, for analyst’s reasoning to generate actionable insights; and the Thesis-CoT Agent to synthesizes these insights into a coherent investment thesis and report. | Paper | | 10) Bi-Mamba - a scalable 1-bit Mamba architecture designed for more efficient LLMs with multiple sizes across 780M, 1.3B, and 2.7B; Bi-Mamba achieves performance comparable to its full-precision counterparts (e.g., FP16 or BF16); it significantly reduces memory footprint with better accuracy than posttraining-binarization Mamba baselines. | Paper |
| Paper | Links | | ------------- | ------------- | | 1) Impacts of AI on Innovation - suggests that top scientists leverage their domain knowledge to prioritize promising AI suggestions, while others waste significant resources testing false positives; finds that implementing AI materials discovery technology leads to substantial increases in productivity, with 44% more materials discovered, 39% more patent filings, and 17% more product innovation; reports that these gains came with concerning tradeoffs, as 82% of scientists reported reduced job satisfaction due to decreased creativity and skill underutilization. | Paper, Tweet | | 2) Scaling Laws for Precision - introduces "precision-aware" scaling laws that predict how model performance is affected by both training and inference precision in LLMs; key findings include: 1) post-training quantization becomes more harmful as models are trained on more data, eventually making additional pretraining actively detrimental, 2) training in lower precision requires increasing model size to maintain performance, and 3) when jointly optimizing model size, data, and precision, the compute-optimal training precision is around 7-8 bits and independent of compute; also reports that when the model size is fixed, compute-optimal precision increases approximately logarithmically with data; the authors validate their predictions on models up to 1.7B parameters trained on up to 26B tokens, showing that both very high (16-bit) and very low (sub 4-bit) training precisions may be suboptimal. | Paper, Tweet | | 3) Evo - a 7B parameter AI model designed to understand and generate DNA sequences across multiple biological scales; the model, trained on 2.7 million prokaryotic and phage genomes, can process sequences up to 131 kilobases long while maintaining single-nucleotide resolution, enabling it to understand both molecular-level interactions and genome-wide patterns; Evo demonstrates superior performance in predicting and generating functional DNA, RNA, and protein sequences, including the first successful AI-generated CRISPR-Cas complexes and transposable systems that have been experimentally validated. | Paper, Tweet | | 4) OpenCoder - introduces OpenCoder, a fully open-source LLM specialized for code generation and understanding; the authors identify several critical factors for building high-performing code LLMs: (1) effective data cleaning with code-optimized heuristic rules for deduplication, (2) recall of relevant text corpus related to code, and (3) high-quality synthetic in both annealing and supervised fine-tuning stages; OpenCoder surpasses previous fully open models at the 6B+ parameter scale and releases not just the model weights but also the complete training pipeline, datasets, and protocols to enable reproducible research. | Paper, Tweet | | 5) The Surprising Effectiveness of Test-Time Training for Abstract Reasoning - explores test-time training (TTT) - updating model parameters temporarily during inference - for improving an LLM's abstract reasoning capabilities using the ARC benchmark; identifies three crucial components: initial fine-tuning on similar tasks, auxiliary task format and augmentations, and per-instance training; TTT significantly improves performance, achieving up to 6x improvement in accuracy compared to base fine-tuned models; when applying TTT to an 8B LLM, they achieve 53% accuracy on ARC's public validation set, improving the state-of-the-art for neural approaches by nearly 25%; by ensembling their method with program generation approaches, they achieve state-of-the-art public validation accuracy of 61.9%, matching average human performance; the findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in LLMs; test-time training applied to continued training on few-shot examples can be highly effective. | Paper, Tweet | | 6) A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents - analyzes AgentOps platforms and tools, highlighting the need for comprehensive observability and traceability features to ensure reliability in foundation model-based autonomous agent systems across their development and production lifecycle. | Paper, Tweet | | 7) Toward Optimal Search and Retrieval for RAG - examines how retrieval affects performance in RAG pipelines for QA tasks; conducts experiments using BGE-base and ColBERT retrievers with LLaMA and Mistral, finding that including more gold (relevant) documents improves QA accuracy; finds that using approximate nearest neighbor search with lower recall only minimally impacts performance while potentially improving speed and memory efficiency; reports that adding noisy or irrelevant documents consistently degrades performance, contradicting previous research claims; concludes that optimizing retrieval of gold documents is crucial for RAG performance, and that operating at lower search accuracy levels can be a viable approach for practical applications. | Paper, Tweet | | 8) Mitigating LLM Jailbreaks with Few Examples - introduces a new approach called for defending LLMs against jailbreak attacks, focusing on quickly adapting defenses after detecting new attacks rather than aiming for perfect adversarial upfront robustness; using a new benchmark, the most effective method, based on fine-tuning an input classifier, reduced attack success rates by over 240x for known attack types and 15x for novel variations after seeing just one example of each attack strategy; demonstrates that rapidly responding to new jailbreaks can be an effective alternative to traditional static defenses. | Paper, Tweet | | 9) Mixture of Transformers - introduce Mixture-of-Transformers (MoT), a new sparse multi-modal transformer architecture that matches the performance of traditional models while using only about half the computational resources for text and image processing; MoT matches a dense baseline's performance using only 55.8% of the FLOPs. | Paper | | 10) HtmlRAG - a novel approach that proposes using HTML instead of plain text as the format for building RAG systems; the key finding is that preserving HTML structure provides richer semantic and structural information compared to plain text conversion, which typically loses important formatting like headings, tables, and semantic tags; to address the challenge of HTML documents being too long for LLM context windows, the authors develop a two-step pruning method: first cleaning unnecessary HTML elements (reducing length by 94%), then using a block-tree-based pruning approach that combines embedding-based and generative pruning to further reduce the content while maintaining important information; experiments across six different QA datasets demonstrate that HtmlRAG outperforms existing plain-text based methods, validating the advantages of preserving HTML structure in RAG systems. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) Many-agent Simulations toward AI Civilization - demonstrates how 10-1000+ AI agents behave and progress with agent societies; proposes PIANO, an architecture that enables agents to interact with humans and other agents in real-time; shows that agents can autonomously develop specialized roles, adhere to and change collective rules, and engage in cultural and religious transmissions. | Paper, Tweet | | 2) A Comprehensive Survey of Small Language Models - a survey on small language models (SLMs) and discussion on issues related to definitions, applications, enhancements, reliability, and more. | Paper, Tweet | | 3) Magentic-One - a new generalist multi-agent system designed to handle complex web and file-based tasks; it uses an Orchestrator agent that directs four specialized agents: WebSurfer for browser operations, FileSurfer for file management, Coder for programming tasks, and ComputerTerminal for console operations; Magentic-One achieves competitive performance on multiple benchmarks including GAIA, AssistantBench, and WebArena, without requiring modifications to its core architecture. | Paper, Tweet | | 4) Mixtures of In-Context Learners - uses subsets of demonstrations to train experts via in-context learning; given a training set, a trainable weighting function is used to combine the experts' next-token predictions; this approach applies to black-box LLMs since access to the internal parameters of the LLM is not required. Good properties include the following: 1) competitive with standard ICL while being significantly more data, memory, and computationally efficient, and 2) resilient to noisy demonstrations and label imbalance. | Paper, Tweet | | 5) Attacking Vision-Language Agents via Pop-ups - shows that integrating adversarial pop-ups into existing agent testing environments leads to an attack success rate of 86%; this decreases the agents' task success rate by 47%; they also add that basic defense techniques (e.g., instructing the agent to ignore pop-ups) are ineffective. | Paper, Tweet | | 6) Multi-expert Prompting with LLMs - improves LLM responses by simulating multiple experts and aggregating their responses; it guides an LLM to fulfill input instructions by simulating multiple experts and selecting the best response among individual and aggregated views; it achieves a new state-of-the-art on TruthfulQA-Generation with ChatGPT, surpassing the current SOTA of 87.97%; it also improves performance across factuality and usefulness while reducing toxicity and hurtfulness. | Paper, Tweet | | 7) Number Understanding of LLMs - provides a comprehensive analysis of the numerical understanding and processing ability (NUPA) of LLMs; finds that naive finetuning can improve NUPA a lot on many but not all tasks; it also reports that techniques designed to enhance NUPA prove ineffective for finetuning pretrained models; explores chain-of-thought techniques applied to NUPA and suggests that chain-of-thought methods face scalability challenges, making them difficult to apply in practical scenarios. | Paper, Tweet | | 8) WebRL - proposes a self-evolving online curriculum RL framework to bridge the gap between open and proprietary LLM-based web agents; it improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM4-9B; the open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%); the self-evolving curriculum addresses the scarcity of web agent training tasks; this is underpinned by a robust outcome-supervised reward model to evaluate task success; an adaptive RL strategy helps to deal with distribution drift in online learning and ensures consistent improvements. | Paper, Tweet | | 9) Adapting while Learning - proposes a two-part fine-tuning approach that first helps LLMs learn from tool-generated solutions and then trains them to determine when to solve problems directly versus when to use tools; testing on math, climate science, and epidemiology benchmarks shows significant improvements, with a 28% boost in accuracy and 14% better tool usage precision compared to leading models like GPT-4 and Claude-3.5; the two-stage approach helps the LLM to adaptively solve scientific problems of varying complexity. | Paper, Tweet | | 10) Personalization of LLMs - presents a comprehensive framework for understanding personalized LLMs; introduces taxonomies for different aspects of personalization and unifying existing research across personalized text generation and downstream applications. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) Geometry of Concepts in LLMs - examines the geometric structure of concept representations in sparse autoencoders (SAEs) at three scales: 1) atomic-level parallelogram patterns between related concepts (e.g., man:woman::king:queen), 2) brain-like functional "lobes" for different types of knowledge like math/code, 3) and galaxy-level eigenvalue distributions showing a specialized structure in middle model layers. | Paper, Tweet | | 2) SimpleQA - a challenging benchmark of 4,326 short factual questions adversarially collected against GPT-4 responses; reports that frontier models like GPT-4o and Claude achieve less than 50% accuracy; finds that there is a positive calibration between the model stated confidence and accuracy, signaling that they have some notion of confidence; claims that there is still room to improve the calibration of LLMs in terms of stated confidence. | Paper, Tweet | | 3) Automating Agentic Workflow Generation - a novel framework for automating the generation of agentic workflows; it reformulates workflow optimization as a search problem over code-represented workflows, where edges connect LLM-invoking nodes; it efficiently explores the search space using a variant of MCTS, iteratively refining workflows through code modification, tree-structured experience, and execution feedback; experiments across six benchmark datasets demonstrate AFlow’s effectiveness, showing a 5.7% improvement over manually designed methods and a 19.5% improvement over existing automated approaches; AFlow also enables smaller models to outperform GPT-4o on specific tasks at just 4.55% of its inference cost. | Paper, Tweet | | 4) LLMs Solve Math with a Bag of Heuristics - uses causal analysis to find neurons that explain an LLM's behavior when doing basic arithmetic logic; discovers and hypothesizes that the combination of heuristic neurons is the mechanism used to produce correct arithmetic answers; finds that the unordered combination of different heuristic types is the mechanism that explains most of the model’s accuracy on arithmetic prompts. | Paper, Tweet | | 5) o1 Replication Journey - reports to be replicating the capabilities of OpenAI's o1 model; their journey learning technique encourages learning not just shortcuts, but the complete exploration process, including trial and error, reflection, and backtracking; claims that with only 327 training samples, their journey learning technique surpassed shortcut learning by 8.0% on the MATH dataset. | Paper, Tweet | | 6) Distinguishing Ignorance from Error in LLM Hallucinations - a method to distinguish between two types of LLM hallucinations: when models lack knowledge (HK-) versus when they hallucinate despite having correct knowledge (HK+); they build model-specific datasets using their proposed approach and show that model-specific datasets are more effective for detecting HK+ hallucinations compared to generic datasets. | Paper, Tweet | | 7) Multimodal RAG - provides a discussion on how to best integrate multimodal models into RAG systems for the industrial domain; it also provides a deep discussion on the evaluation of these systems using LLM-as-a-Judge. | Paper, Tweet | | 8) The Role of Prompting and External Tools in Hallucination Rates of LLMs - tests different prompting strategies and frameworks aimed at reducing hallucinations in LLMs; finds that simpler prompting techniques outperform more complex methods; it reports that LLM agents exhibit higher hallucination rates due to the added complexity of tool usage. | Paper, Tweet | | 9) MrT5 - a more efficient variant of byte-level language models that uses a dynamic token deletion mechanism (via a learned delete gate) to shorten sequence lengths by up to 80% while maintaining model performance; this enables faster inference and better handling of multilingual text without traditional tokenization; MrT5 maintains competitive accuracy with ByT5 on downstream tasks such as XNLI and character-level manipulations while improving inference runtimes. | Paper, Tweet | | 10) Relaxed Recursive Transformers - introduces a novel approach, Relaxed Recursive Transformer, that significantly reduces LLM size through parameter sharing across layers while maintaining performance; the model is initialized from standard pretrained Transformers, but only uses a single block of unique layers that is repeated multiple times in a loop; then it adds flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules; shows that the approach has the potential to lead to significant (2-3×) gains in inference throughput. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) Agentic Information Retrieval - provides an introduction to agentic information retrieval, which is shaped by the capabilities of LLM agents; discusses different types of cutting-edge applications of agentic information retrieval and challenges. | Paper, Tweet | | 2) Aya Expanse - a family of open-weight foundation models for multilingual capabilities; releases an 8B and 32B parameter model, including one of the largest multilingual dataset collections to date, with 513 million examples; the release also includes Aya-101 which the authors claim is the most comprehensive multilingual models covering 101 languages; Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B, a model 2x its size. | Paper, Tweet | | 3) A Theoretical Understanding of CoT - finds that adding correct and incorrect reasoning paths in demonstrations improves the accuracy of intermediate steps and CoT; the proposed method, Coherent CoT, significantly improves performance on several benchmarks; in the Tracking Shuffled Objects dataset, Gemini Pro shows a 6.60% improvement (from 58.20% to 64.80%), and in Penguins in a Table, DeepSeek 67B demonstrates an increase of 6.17% (from 73.97% to 80.14%). | Paper, Tweet | | 4) A Survey on Data Synthesis and Augmentation for LLMs - provides a comprehensive summary of data generation techniques in the lifecycle of LLMs; includes discussions on data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. | Paper, Tweet | | 5) LongRAG - enhances RAG's understanding of long-context knowledge which includes global information and factual details; consists of a hybrid retriever, an LLM-augmented information extractor, a CoT-guided filter, and an LLM-augmented generator; these are key components that enable the RAG system to mine global long-context information and effectively identify factual details; LongRAG outperforms long-context LLMs (up by 6.94%), advanced RAG (up by 6.16%), and Vanilla RAG (up by 17.25%). | Paper, Tweet | | 6) Evaluation Feature Steering in LLMs - evaluates featuring steering in LLMs using an experiment that artificially dials up and down various features to analyze changes in model outputs; it focused on 29 features related to social biases and study if feature steering can help mitigate social biases; among its findings, it reports that feature steering sometimes leads to off-target effects and that a neutrality feature can help decreases social biases in 9 social dimensions without negatively affecting text quality. | Paper, Tweet | | 7) Granite 3.0 - presents lightweight foundation models ranging from 400 million to 8B parameters; supports coding, RAG, reasoning, and function calling, focusing on enterprise use cases, including on-premise and on-device settings; demonstrates strong performance across academic benchmarks for language understanding, reasoning, coding, function calling, and safety. | Paper, Tweet | | 8) LLMs Reflect the Ideology of their Creators - finds that LLMs exhibit a diverse ideological stance which reflects the worldview of its creators; finds consistent normative differences between how the same LLM responds in Chinese compared to English; identifies normative disagreements between Western and non-Western LLMs about prominent actors in geopolitical conflicts. | Paper, Tweet | | 9) Scalable Watermarking for LLMs - proposes SynthID-Text, a text-watermarking scheme that can preserve text quality in LLMs, enable high detection accuracy, and minimize latency overhead; it integrates watermarking with speculative sampling that consists of the final pattern of scores for a model’s word choices combined with the adjusted probability scores; the authors test the feasibility and scalability of the approach by assessing feedback on nearly 10 million Gemini responses. | Paper, Tweet | | 10) Reasoning Patterns of OpenAI’s o1 Model - when compared with other test-time compute methods, o1 achieved the best performance across most datasets; the authors observe that the most commonly used reasoning patterns in o1 are divide and conquer and self-refinement; o1 uses different reasoning patterns for different tasks; for commonsense reasoning tasks, o1 tends to use context identification and emphasize constraints; for math and coding tasks, o1 mainly relies on method reuse and divide and conquer. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) Thinking LLMs - proposes a training method to equip LLMs with thinking abilities for general instruction-following without human-annotated data; uses an iterative search and optimization procedure to explore thought generation which enables the model to learn without direct supervision; thought candidates for each user instruction are scored with a judge model; only responses are evaluated by the Judge which determines the best and worst ones; then the corresponding full outputs are used as chosen and rejected pairs for DPO (referred to as Thought Preference Optimization in this paper). reports superior performance on AlpacaEval and Arena-Hard. | Paper, Tweet | | 2) Model Swarms - propose a new collaborative search algorithm to adapt LLM via swarm intelligence; a pool of LLM experts collaboratively move in the weight space and optimize a utility function representing various adaptation objectives; experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests. improves over 12 model composition baselines by up to 21.0% across tasks and contexts. | Paper, Tweet | | 3) First-Person Fairness in Chatbots - studies first-person fairness which involves fairness towards users interacting with ChatGPT; specifically, it measures the biases, if any, towards the users’ names; it leverages a model powered by GPT-4o to analyze patterns and name-sensitivity in the chatbot’s responses for different user names; claims that, overall, post-training significantly mitigate harmful stereotypes; also reports that in domains like entertainment and art, with open-ended tasks, demonstrate the highest level of bias (i.e., tendency to write stories with protagonists whose gender matches gender inferred from the user’s name) | Paper, Tweet | | 4) Introspection in LLMs - reports that LLMs can acquire knowledge through introspection that cannot be inferred from their training data; suggests that LLMs contain privileged information about themselves that can potentially lead to more interpretable and controllable systems; they report that this introspection ability is limited and models struggle to predict their behavior on tasks requiring reasoning over long outputs. | Paper, Tweet | | 5) Janus - proposes a unified autoregressive framework for multimodal understanding and generation; it decouples visual encoding into independent pathways and leverages a single transformer architecture to improve flexibility and performance on both visual understanding and generation; claims to alleviate trade-offs related to performing the vision tasks, something common in methods that rely on a single visual encoder; surpasses previous unified models and matches or exceeds the performance of task-specific models. | Paper, Tweet | | 6) Inference Scaling for Long-Context RAG - uses two strategies to investigate scaling laws for RAG: in-context learning (DRAG) and iterative prompting (IterRAG); finds that RAG performance consistently improves with the expansion of the effective context length under optimal configurations; when optimally allocated, increasing inference computation can lead to linear gains in long-context RAG performance; this leads to the development of a computation allocation model that can provide practical guidance for optimal computation allocation in long-context RAG scenarios. | Paper, Tweet | | 7) Agent S - a new open agentic framework that enables autonomous interaction with computers through a GUI; Agent S tackles challenges such as acquiring knowledge, planning over long-task horizons, and handling dynamic interfaces; it introduces experience-augmented hierarchical planning which leverages both search and retrieval; leverages an agent-computer interface to perform reasoning and control GUI agents; evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% in success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. | Paper, Tweet | | 8) Model Kinship for Merging LLMs - proposes model kinship to measure the degree of similarity between LLMs; model kinship is used to build a model merging strategy (Top-k Greedy Merging with Model Kinship) which yields better performance; the authors find that this new criterion can be used to effectively and continuously perform model merging. | Paper, Tweet | | 9) On the Planning Abilities of OpenAI’s o1 Models - reports that o1-preview is particularly strong in self-evaluation and constraint-following; also mentions that these o1 models demonstrate bottlenecks in decision-making and memory management, which are more pronounced in spatial reasoning; in particular, the models produce redundant action and struggle to generalize in spatially complex tasks. | Paper, Tweet | | 10) CoTracker3 - proposes a new point tracking model and a new semi-supervised training recipe; enables usage of real videos without annotations during training by generating pseudo-labels using off-the-shelf teachers; the approach is simpler in architecture and training scheme leading to better results while using 1000x less data. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) MLE-Bench - proposes a new benchmark for the evaluation of machine learning agents on machine learning engineering capabilities; includes 75 ML engineering-related competition from Kaggle testing on MLE skills such as training models, preparing datasets, and running experiments; OpenAI’s o1-preview with the AIDE scaffolding achieves Kaggle bronze medal level in 16.9% of competitions. | Paper, Tweet | | 2) Differential Transformer - proposes a differential attention mechanism that amplifies attention to the relevant context while canceling noise; Differential Transformer outperforms Transformer when scaling up model size and training tokens; the authors claim that since this architecture gets less "distracted" by irrelevant context, it can do well in applications such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. | Paper, Tweet | | 3) Astute RAG - proposes a novel RAG approach to deal with the imperfect retrieval augmentation and knowledge conflicts of LLMs; Astute RAG adaptively elicits essential information from LLMs' internal knowledge; then it iteratively consolidates internal and external knowledge with source awareness; Astute RAG is designed to better combine internal and external information through an interactive consolidation mechanism (i.e., identifying consistent passages, detecting conflicting information in them, and filtering out irrelevant information). | Paper, Tweet | | 4) ToolGen - integrates tool knowledge directly into LLMs by representing tools as a unique token which allows the LLM to generate tool calls and arguments, enabling seamless tool invocation and language generation; experimental results with over 47,000 tools show that ToolGen achieves superior results in both tool retrieval and autonomous task completion. | Paper, Tweet | | 5) Long-Context LLMs Meet RAG - finds that for many long-context LLMs, the quality of outputs declines as the number of passages increases; reports that the performance loss is due to retrieved hard negatives; they propose two ways to improve long-context LLM-based RAG: retrieval reordering and RAG-specific tuning with intermediate reasoning to help with relevance identification; that approaches demonstrate significant accuracy and robustness improvements on long-context RAG performance. | Paper, Tweet | | 6) GSM-Symbolic - tests several SoTA models on a benchmark created with symbolic templates that enable diverse mathematical problems; they find that LLMs exhibit variance when responding to variations of the same questions; the performance of all the models declines by adjusting the numerical values in the question; as questions are made more challenging (e.g., increasing the number of clauses) the performance significantly deteriorates; the authors hypothesize that the observed decline in performance is due to a lack of logical reasoning in current LLMs. | Paper, Tweet | | 7) Optima - a novel framework to enhance both communication efficiency and task effectiveness in LLM-based multi-agent systems through LLM training; proposes an iterative generate, rank, select, and train paradigm with a reward function to improve performance, token use, and communication efficiency; integrates Monte Carlo Tree Search-inspired techniques for DPO data generation to encourage diverse exploration; shows consistent improvements over single-agent baselines and vanilla MAS based on Llama 3 8B, with 2.8x performance gain with less than 10% tokens on tasks requiring heavy information exchange. | Paper, Tweet | | 8) ScienceAgentBench - a new benchmark to rigorously assess agents built for scientific workflows; after testing it on open-weight and proprietary LLMs, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. | Paper, Tweet | | 9) Addition Is All You Need - proposes an algorithm that approximates floating point multiplication with integer addition operations; it is less computationally intensive than 8-bit floating point but achieves higher precision; the authors report that applying the purposed L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products. | Paper, Tweet | | 10) Persuasion and Anti-social Ability of LLMs - studies the interaction patterns of LLMs in a multi-agent setting with social hierarchy; the study was done in a specific setting involving a guard and a prisoner who seeks additional yard time or escaping from prison; finds that in the multi-agent setting where power dynamics are involved, the LLMs fail to have a conversation; they also report that agents' personas are critical in driving the behaviors of the agents. In addition, and without explicit prompting, simply assigning agents' roles lead to anti-social behavior. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) Movie Gen - a set of foundation models to generate high-quality, 1080p HD videos, including different aspect ratios and synchronized audio; the 30B parameter model supports a context length of 73K video tokens, which enables generation of 16-second videos at 16fps; it also presents a 13B parameter video-to-audio generation model and a novel video editing model that’s attained via post-training; achieves state-of-the-art performance on tasks such as text-to-video synthesis, video personalization, video-to-audio generation and more. | Paper, Tweet | | 2) Were RNNs All We Needed? - revisits RNNs and shows that by removing the hidden states from input, forget, and update gates RNNs can be efficiently trained in parallel; this is possible because with this change architectures like LSTMs and GRUs no longer require backpropagate through time (BPTT); they introduce minLSTMs and minGRUs that are 175x faster for a 512 sequence length. | Paper, Tweet | | 3) LLMs Know More Than They Show - finds that the "truthfulness" information in LLMs is concentrated in specific tokens; this insight can help enhance error detection performance and further mitigate some of these issues; they also claim that internal representations can be used to predict the types of errors the LLMs are likely to make. | Paper, Tweet | | 4) Architecture Search Framework for Inference-Time Techniques - introduces a modular framework for building and optimizing LLMs by combining multiple inference-time techniques; this approach reframes the challenge of LLM system design as a hyperparameter optimization problem; tested on benchmarks including MT-Bench and CodeContests, Archon surpasses leading models such as GPT-4o and Claude 3.5 Sonnet, achieving a 15.1% average accuracy improvement. | Paper, Tweet | | 5) RATIONALYST - a model for process-supervision of reasoning that enables generalization across diverse reasoning tasks; this process is achieved with pre-training on a collection of 79k rationales from the Pile and a combination of reasoning datasets with minimal human intervention; fine-tuned from LLaMa-3-8B, the proposed model improves the accuracy of reasoning by an average of 3.9% on 7 reasoning benchmarks. | Paper | | 6) An Analysis of o1-preview - reports that large reasoning models like o1-preview, while improving on more difficult tasks, display similar qualitative trends as previous LLMs; o1 is sensitive to the probability of examples and tasks, performing better and requiring fewer “thinking tokens” in high-probability settings than in low-probability ones. | Paper, Tweet | | 7) FRAMES - a unified framework to evaluate an LLM’s ability to provide factual responses, assess retrieval capabilities, and the reasoning required to generate final responses; includes multi-hop questions that require the integration of information from multiple sources; reports that state-of-the-art LLMs struggle on the task and only achieve 40% accuracy with no retrieval; the proposed multi-step retrieval approach improves performance to 66% accuracy. | Paper, Tweet | | 8) Not All LLM Reasoners Are Created Equal - investigates in depth the grade-school math problem-solving capabilities of LLMs; reports that LLMs show a significant gap in reasoning; finds that LLMs display a huge performance difference when solving compositional pairs and solving questions independently. | Paper, Tweet | | 9) Evaluation of o1 - provides a comprehensive evaluation of OpenAI's o1-preview LLM; shows strong performance across many tasks such as competitive programming, generating coherent and accurate radiology reports, high school-level mathematical reasoning tasks, chip design tasks, anthropology and geology, quantitative investing, social media analysis, and many other domains and problems. | Paper, Tweet | | 10) Designing Priors for Better Few-Shot Image Synthesis - training generative models like GAN with limited data is difficult; current Implicit Maximum Likelihood Estimation approaches (IMLE) have an inadequate correspondence between latent code selected for training and those selected during inference; the proposed approach, RS-IMLE, changes the prior distribution for training which improves test-time performance and leads to higher quality image generation. | Paper, Tweet |
| Paper | Links | | ------------- | ------------- | | 1) Llama 3.2 - presents small and medium-sized vision LLMs (11B and 90B parameters), and lightweight, text-only models (1B and 3B); the text-only models are trained to support context length of 128K tokens and outperform other models in their class on a range of tasks; vision models exceed other models such as Claude 3 Haiku on image understanding tasks. | Paper, Tweet | | 2) *Molmo* - - presents a family of open, state-of-the-art multimodal AI models; the 72B model in the Molmo family outperforms others in the class of open weight and data models; it also compares favorably against proprietary models like GPT-4o, Claude 3.5, and Gemini 1.5 on several benchmarks. | Paper, Tweet | | 3) AlphaChip - a reinforcement learning-based method trained to design the physical layout of chips; AlphaChip is reportedly used in three additional generations of Google’s TPU; this release includes an open-source implementation of the method to help pre-train on a variety of chip blocks to apply to new blocks; also releases a model checkpoint pre-trained on 20 TPU blocks. | Paper, Tweet | | 4) LLMs Still Can’t Plan - evaluates whether large reasoning models such as o1 can plan; finds that a domain-independent planner can solve all instances of Mystery Blocksworld but LLMs struggle, even on small instances; o1-preview is effective on the task but tend to degrade in performance as plan length increases, concludes that while o1 shows progress on more challenging planning problems, the accuracy gains cannot be considered general or robust. | Paper, Tweet | | 5) Scaled-up Instructable Model Become Less Reliable - suggests that larger and more instructable LLMs may become less reliable; investigates LLMs across three elements: difficulty concordance, task avoidance, and prompting stability; finds that early models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook. | Paper, Tweet | | 6) Logic-of-Thought - proposes a new prompting technique called Logic-of-Thought (LoT) which employs propositional logic to generate and inject expanded logical information from the input context; it enhances CoT performance on the ReClor dataset by +4.35%; it improves CoT+SelfConsistency’s performance on LogiQA by +5%; it also boosts the performance of ToT on the ProofWriter dataset by +8%. | Paper, Tweet | | 7) RAG and Beyond - presents a survey that introduces a RAG task categorization method that helps to classify user queries into four levels according to the type of external data required and the focus of the task; summarizes key challenges in building robust data-augmented LLM applications and the most effective techniques for addressing them. | Paper, Tweet | | 8) A Preliminary Study of o1 in Medicine - provides a preliminary exploration of the o1-preview model in medical scenarios; shows that o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios; identifies hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. | Paper, Tweet | | 9) Small Language Models Survey - a comprehensive survey on small language models (SLMs) across architectures, training datasets, and training algorithms; analyzes 59 state-of-the-art open-source SLMs and capabilities such as reasoning, in-context learning, maths, and coding; other discussions include on-device runtime costs, latency, memory footprint, and valuable insights. | Paper, Tweet | | 10) Minstrel - a multi-generative agent system with reflection capabilities to automate structural prompt generation; it presents LangGPT, an extensible framework for designing prompts; Minstrel is built on top of LangGPT and experiments demonstrate that structural prompts (either generated by Minstrel or written manually) perform better in guiding LLMs to perform tasks. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Moshi - introduces a speech-text foundation model and full-duplex spoken dialogue framework; they present several components of the systems; Helium is a 7B parameter text LLM; Mimi is a semantic-acoustic neural audio code with state-of-the-art performance on audio quality; a hierarchical multi-stream architecture that can generate arbitrary conversation in a speech-to-speech manner. | Paper, Tweet |
| 2) Training LLMs to Self-Correct via RL - develops a multi-turn online reinforcement learning to improve the capabilities of an LLM to self-correct; it’s based entirely on self-generated data; SFT is shown to be ineffective at learning self-correction and suffers from distribution mismatch between training data and model responses; proposes a two-stage approach that first optimizes correction behavior and then uses a reward bonus to amplify self-correction during training; when applied to Gemini 1.0 Pro and 1.5 Flash models, it achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks. | Paper, Tweet |
| 3) Qwen2.5 Coder - a series of models including 1.5B and 7B parameters; it’s built upon the Qwen2.5 architecture which is continuously pretrained on 5.5 trillion tokens; achieves state-of-the-art performance across more than 10 benchmarks; includes strong capabilities in code generation, completion, reasoning, and repairing. | Paper, Tweet |
| 4) Diagram of Thought (DoT) - enhances the reasoning capabilities of LLMs through mathematical rigor; DAT models iterative reasoning in LLM as the construction of a directed acyclic graph; it integrates propositions, critiques, refinement, and verification into a unified DAG structure; this allows DoT to capture complex logical deduction beyond linear or tree-based approaches. | Paper, Tweet |
| 5) Agents in Software Engineering - provides a comprehensive overview of frameworks of LLM-based agents in software engineering. | Paper, Tweet |
| 6) To CoT or not to CoT? - investigates what kinds of tasks benefit the most from chain-of-thought (CoT) prompting; after a meta-analysis on 100+ papers and several evaluations, it finds that CoT produces strong performance benefits primarily on tasks involving math and logic; they find that most of the CoT gain comes from improving symbolic execution, but a symbolic solver outperforms it. | Paper, Tweet |
| 7) A Comprehensive Evaluation of Quantized Instruction-Tuned LLMs - evaluates the performance of instruction-tuned LLMs across various quantization methods on models ranging from 7B to 405B; the key findings are 1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, 2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models, and 3) task difficulty does not significantly impact accuracy degradation due to quantization. | Paper, Tweet |
| 8) Iteration of Thought - proposes the Iteration of Thought (IoT) framework to enhance the LLM responses and reasoning capabilities with adaptive reasoning paths; it leverages an inner dialogue agent, acting as a guide, to dynamically adjust reasoning paths which allows adaptive cross-path exploration and enhance response accuracy; it's different from CoT and ToT (both rigid processes) in that its prompt generation is a dynamic process that allows it to adapt. | Paper, Tweet |
| 9) Schrodinger’s Memory - uses the Universal Approximation Theorem to explain the memory mechanism of LLMs. It also proposes a new approach to evaluate LLM performance by comparing the memory capacities of different models; the Transformer architecture functions as a dynamic fitting UAT model, with a strong ability to adaptively fit inputs; this enables LLMs to recall entire content based on minimal input information. | Paper, Tweet |
| 10) Math Jailbreaking Prompts - uses GPT-4o to generate mathematically encoded prompts that serve as an effective jailbreaking technique; shows an average attack success rate of 73.6% across 13 state-of-the-art; this highlights the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Learning to Reason with LLMs - a new family of LLMs trained with reinforcement learning to reason before it responds to complex tasks; it produces a long internal chain of thought and exceeds in science, code, and math-related tasks; ranked in the 49th percentile in the 2024 International Olympiad in Informatics and exceeds human PhD-level accuracy on science-related benchmarks. - | Paper, Tweet |
| 2) Chai-1 - a new multi-modal foundation model for molecular structure prediction that can predict proteins, small molecules, DNA, RNA, and more; it achieves state-of-the-art results on a variety of tasks in drug discovery; achieves a 77% success rate on the PoseBusters benchmark (vs. 76% by AlphaFold 3), as well as an Cα LDDT of 0.849 on the CASP15 protein monomer structure prediction set (vs. 0.801 by ESM3-98B). | Paper, Tweet |
| 3) Can LLMs Generation Novel Research Ideas - finds that LLM-generated research ideas are judged as more novel (p <0.05) than human expert ideas; however, they were rated slightly weaker in terms of flexibility; they also report that LLM agents lack diversity in the idea generation process and are not reliable evaluators. | Paper, Tweet |
| 4) DataGemma - includes a series of fine-tuned Gemma 2 models to help LLMs access and incorporate numerical and statistical data; proposes a new approach called Retrieval Interleaved Generation (RIG) which can reliably incorporate public statistical data from Data Commons into LLM responses; RIG is a tool-inspired approach, can interleave statistical tokens with natural language questions suitable for retrieval from Data Commons; to attain such capability, they fine-tune the LLM on an instruction-response dataset generated with the help of Gemini 1.5; the RIG approach improves factuality from 5-7% to about 58%. | Paper, Tweet |
| 5) Agent Workflow Memory - introduces Agent Workflow Memory to induce commonly reused workflows and provide these to the agent on demand; works offline and online and is meant to guide the agent's subsequent generations; it’s inspired by how humans learn reusable workflows from past experiences and use them to guide future actions; claims to substantially improve the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while doing it in a more efficient way. | Paper, Tweet |
| 6) The Role of Small Language Models in the LLM Era - closely examines the relationship between LLMs and SLMs; common applications of SLMs include data curation, training stronger models, efficient inference, evaluators, retrievers, and much more; includes insights for practitioners to better understand the value of these SLMs. | Paper, Tweet |
| 7) LLaMa-Omni - a model architecture for low-latency speech interaction with LLMs; it is based on Llama-3.1-8B-Instruct and can simultaneously generate both text and speech responses given speech instructions; responses can be generated with a response latency as low as 226ms; architecture-wise, it involves a speech encoder (Whispter-large-v3), a speech adaptor, an LLM, and a speech decoder; they also created a dataset of 200K speech interactions and responses. | Paper, Tweet |
| 8) Can LLMs Unlock Novel Scientific Research Ideas - investigates whether LLM can generate novel scientific research ideas; reports that Claude and GPT models tend to align more with the author's perspectives on future research ideas; this is measured across different domains like science, economics, and medicine. | Paper, Tweet |
| 9) Theory, Analysis, and Best Practices for Sigmoid Self-Attention - proposes Flash-Sigmoid, a hardware-aware and memory-efficient implementation of sigmoid attention; it yields up to a 17% inference kernel speed-up over FlashAttention-2 on H100 GPUs; show that SigmoidAttn matches SoftwaxAttn in various tasks and domains. | Paper, Tweet |
| 10) Achieving Peak Performance for LLMs - a systematic review of methods for improving and speeding up LLMs from three points of view: training, inference, and system serving; summarizes the latest optimization and acceleration strategies around training, hardware, scalability, and reliability. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) AlphaProteo - presents a family of ML models trained for protein design; reports a 3-to 300-fold better binding affinities and higher experimental success rates compared to other existing methods on seven target proteins; shows that AlphaProteo’s performance on hundreds of target proteins from the PDB is comparable to the seven targets. | Paper, Tweet |
| 2) RAG in the Era of Long-Context LLMs - reports that longer-context LLMs suffer from a diminished focus on relevant information, which is one of the primary issues that a RAG system addresses (i.e., uses more relevant information); they propose an order-preserving RAG mechanism that improves performance on long-context question answering; it's not perfect and in fact, as retrieved chunks increase the quality of responses go up and then declines; they mention a sweet spot where it can achieve better quality with a lot fewer tokens than long-context LLMs. | Paper, Tweet |
| 3) Strategic Chain-of-Thought - a method to refine LLM performance by incorporating strategic knowledge before the intermediate CoT reasoning steps; the problem-solving strategy helps to guide the generation of the CoT paths and final answers; claims to achieve a 21.05% increase on the GSM8K datasets using the Llama3-8b model. | Paper |
| 4) Effective of AI on High Skilled Work - studies the impact of generative AI on software developers; reveals a 26.08% increase in the number of completed tasks among the developers that use AI tools like GitHub Copilot; also shows that less experienced developers are likely to adopt the AI tools and have greater productivity gains. | Paper, Tweet |
| 5) OLMoE - introduces a fully-open LLM that leverages sparse Mixture-of-Experts. OLMoE is a 7B parameter model and uses 1B active parameters per input token; there is also an instruction-tuned version that claims to outperform Llama-2-13B-Chat and DeepSeekMoE 16B. | Paper, Tweet |
| 6) LongCite - synthesizes a large-scale SFT dataset with off-the-shelf LLMs to improve long-context question answering with citations; it trains 8B and 9B parameter models that enhance citation generation capabilities from lengthy contexts while improving response correctness; claims to even surpass GPT-4o on their proposed LongBench-Cite benchmark. | Paper, Tweet |
| 7) MemLong - utilizes an external retriever for retrieving historical information which enhances the capabilities of long-context LLMs; it consistently outperforms other SoTA LLMs on long-context benchmarks and can extend the context length on a single 3090 GPU from 4k up to 80k. | Paper, Tweet |
| 8) Role of RAG Noise in LLMs - proposes a benchmark (NoiserBench) to measure how different kinds of noisy information affect RAG's performance; reports that from different kinds of beneficial noise studied (e.g., semantic, datatype, and illegal sentence), illegal sentence noise exhibits the most improved model performance across models and datasets. | Paper, Tweet |
| 9) Beyond Preference in AI Alignment - challenges the dominant practice of AI alignment known as human preference tuning; explains in what ways human preference tuning fails to capture the thick semantic content of human values; argues that AI alignment needs reframing, instead of aligning on human preferences, AI should align on normative standards appropriate to their social roles. | Paper, Tweet |
| 10) LLM-Based Agents for Software Engineering - a survey paper on LLM-based agents for software engineering, covering perspectives ranging from requirement engineering to test generation to software maintenance. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) GameGen - a game engine powered by a diffusion model that enables real-time interaction with complex environments over long trajectories; uses a two-phase training process involving an RL agent to learn and a diffusion model to generate frames; it can interactively simulate DOOM over at 20 fps on a single TPU. | Paper, Tweet |
| 2) Agentic RAG for Time Series Analysis - proposes an agentic RAG framework for time series analysis; uses a multi-agent architecture where an agent orchestrates specialized sub-agents to complete time-series tasks; the sub-agents leverage tuned small language models and can retrieve relevant prompts containing knowledge about historical patterns and trends; this helps to improve predictions on new data. | Paper, Tweet |
| 3) AutoGen Studio - a low-code interface for rapidly prototyping AI agents. It's built on top of the AutoGen framework and can also be used for debugging and evaluating multi-agent workflows. | Paper, Tweet |
| 4) Persuasion Games with LLMs - claims that a multi-agent framework can be used to improve the persuasive efficacy of LLMs; the primary agent engages in persuasive dialogue while auxiliary agents perform key tasks like response analysis and information retrieval; finds that LLMs are capable of creating a perspective change in the users and persuading them to make a purchase decision; for instance, Sales agents can achieve a 71% positive shift in user perspectives. | Paper, Tweet |
| 5) Smaller, Weaker, Yet Better - finds that weaker + cheaper (WC) models can generate better synthetic data for fine-tuning models compared to data generated with stronger but more expensive models; overall, results suggest that WC models may be a compute-optimal approach for training advanced LLM reasoners. | Paper, Tweet |
| 6) Transfusion - presents a training recipe to train multi-modal models over discrete and continuous data; combines next token prediction with diffusion to train transformer models over mixed-modality sequences; shows that it’s possible to scale from 7B parameter models to 2T multi-modal tokens that can compete in performance with similar scale diffusion and language models. | Paper, Tweet |
| 7) ReMamba - investigates the long-context capabilities and efficiencies of Mamba models; the long-context deficiency issues are due to Mamba's RNN-like nature; it achieves this by condensing information via the following compression strategy: the top-k hidden states during the first forward pass and leverages Mamba’s selective mechanism to incorporate them into the state space during the second forward pass; achieves a 3.2 improvement over the baseline on LongBench and 1.6 improvement on L-Eval; the strategy seems to also transfer to Mamba 2. | Paper, Tweet |
| 8) Text2SQL is Not Enough - proposes Table-Augmented Generation (TAG), a unified framework for answering natural language questions over databases; it represents a wider range of unexplored interactions between LLMs and databases; develops a benchmark and finds that standard methods answer no more than 20% of queries correctly. | Paper, Tweet |
| 9) Foundation Models for Music - provides a comprehensive overview of state-of-the-art pre-trained models and foundation models in music. | Paper, Tweet |
| 10) Guide to Continual Multimodal Pretraining - a comprehensive guide on continual multimodal pertaining; introduces FoMo-In-Flux, a large-scale fine-grained and long horizon continual pretraining benchmark. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Automate Design of Agentic Systems - presents Meta Agent Search, a meta agent that iteratively programs and tests new agents based on a growing archive of previous discoveries; claims that with their approach it is possible to learn any possible agentic system including prompts, tool use, control flows, and more; they achieve this by focusing on three main components referred to as search space (define agents), search algorithm (explore search space), and the evaluation function (evaluate candidate agents). | Paper, Tweet |
| 2) LLM Pruning and Distillation in Practice - provides a comprehensive report on effective methods for compressing Llama 3.1 and Mistral NeMo models; it presents pruning and distillation approaches applied to the original models to produce 4B and 8B parameter models, respectively; before pruning, they also fine-tune the teacher model on their datasets leading to better distillation; their compression strategy yields a state-of-the-art 8B model (MN-Minitron-8B) which outperforms all similarly-sized models on common language modeling benchmarks. | Paper, Tweet |
| 3) Vizier Gaussian Process Bandit Algorithm - presents Vizier, an algorithm based on Gaussian process bandit optimization used by Google for millions of optimizations and research; it provides an open-source Python implementation of the Vizier algorithm, including benchmarking results that demonstrate its wider applicability. | Paper, Tweet |
| 4) Language Modeling on Tabular Data - presents a comprehensive survey of language modeling techniques for tabular data; includes topics such as categorization of tabular data structures and data types, datasets used for model training and evaluation, modeling techniques and training objectives, data processing methods, popular architectures, and challenges and future research directions. | Paper, Tweet |
| 5) Enhancing Robustness in LLMs - proposes a two-stage prompting technique to remove irrelevant information from context; it serves as a self-mitigation process that first identifies the irrelevant information and then filters it out; this leads to enhancement in robustness of the model and overall better performance on reasoning tasks. | Paper, Tweet |
| 6) A Comprehensive Overview of GraphRAG Methods - focuses on techniques applied to the GraphRAG workflow (graph-based indexing, graph-guided retrieval, and graph-enhanced generation); examines tasks, applications, evaluation, and industrial use cases of GraphRAG. | Paper, Tweet |
| 7) MagicDec - shows how speculative decoding can enhance throughput, reduce latency, and maintain accuracy in long context generation scenarios; it finds that as sequence length and batch size increase, bottlenecks shift from compute-bound to memory-bound; using these insights, they show it's possible to more effectively use speculative decoding for longer sequences, even when using large batch sizes. | Paper, Tweet |
| 8) Controllable Text Generation for LLMs - provides a comprehensive survey on methods for controllable text generation in LLMs; discusses issues like safety, consistency, style, and helpfulness. | Paper, Tweet |
| 9) PEDAL - uses a hybrid self-ensembling approach (based on diverse exemplars) to improve the overall performance of LLMs; specifically, it uses diverse exemplars to generate multiple candidate responses and then aggregates them using an LLM to generate a final response; this approach achieves better accuracy compared to greedy decoding and lower cost compared to self-consistency approaches. | Paper, Tweet |
| 10) Challenges and Responses in the Practice of LLMs - curates a set of important questions with insightful answers; questions are categorized across topics such as infrastructure, software architecture, data, application, and brain science. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) The AI Scientist - a novel AI agent that can develop and write a full conference-level scientific paper costing less than $15; it automates scientific discovery by enabling frontier LLMs to perform independent research and summarize findings; it also uses an automated reviewer to evaluate the generated papers; claims to achieve near-human performance in evaluating paper scores; claims to produce papers that exceed the acceptance threshold at a top machine learning conference as judged by their automated reviewer. | Paper, Tweet |
| 2) Grok-2 - a new frontier model with strong code, math, and reasoning capabilities which includes a large and small model; outperforms both Claude 3.5 Sonnet and GPT-4-Turbo on the LMSYS Chatbot Arena; claims to improve capabilities including instruction following, retrieval, tool use, and enhancing factuality; competes with Claude 3.5 Sonnet (June release) and GPT-4o (May release) on MMLU and HumanEval. | Paper, Tweet |
| 3) LongWriter - proposes AgentWrite to enable off-the-shelf LLMs to generate coherent outputs beyond 20K words; AgentWrite breaks down the long generation task into subtasks and in a divide-and-conquer approach generates; the agent breaks the task into multiple writing subtasks and concatenates the outputs to get a final output (i.e., plan + write); the approach is then used to build SFT datasets that are used to tune LLMs to generate coherent longer outputs automatically; a 9B parameter model, further improved through DPO, achieves state-of-the-art performance on their benchmark, and surpasses proprietary models. | Paper, Tweet |
| 4) EfficientRAG - trains an auto-encoder LM to label and tag chunks; it retrieves relevant chunks, tags them as either or , and annotates chunks for continuous processing; then a filter model is trained to formulate the next-hop query based on the original question and previous annotations; this is done iteratively until all chunks are tagged as or the maximum # of iterations is reached; after the process above has gathered enough information to answer the initial question, the final generator (an LLM) generates the final answer. | Paper, Tweet |
| 5) RAGChecker - a fine-grained evaluation framework for diagnosing retrieval and generation modules in RAG; shows that RAGChecker has better correlations with human judgment; reports several revealing insightful patterns and trade-offs in design choices of RAG architectures. | Paper, Tweet |
| 6) HybridRAG - combines GraphRAG and VectorRAG leading to a HybridRAG system that outperforms both individually; it was tested on a set of financial earning call transcripts. Combining the advantages of both approaches provides more accurate answers to queries. | Paper, Tweet |
| 7) rStar - introduces self-play mutual reasoning to improve the reasoning capabilities of small language models without fine-tuning or superior models; MCTS is augmented with human-like reasoning actions, obtained from SLMs, to build richer reasoning trajectories; a separate SLM provides unsupervised feedback on the trajectories and the target SLM selects the final reasoning trajectory as the answer; rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B and consistently improves the accuracy of other SLMs. | Paper, Tweet |
| 8) Scaling LLM Test-Time Compute Optimally - investigates the scaling behaviors of inference-time computation in LLMs; in particular, it analyses how much an LLM can be improved provided a fixed amount of inference-time compute; finds that the effectiveness of different scaling approaches varies by difficulty of prompt; it then proposes an adaptive compute-optimal strategy that can improve efficiency by more than 4x compared to a best-of-N baseline; reports that in a FLOPs-matched evaluation, optimally scaling test-time compute can outperform a 14x larger model. | Paper, Tweet |
| 9) MedGraphRAG - a graph-based framework for the medical domain with a focus on enhancing LLMs and generating evidence-based results; leverages a hybrid static-semantic approach to chunk documents to improve context capture; entities and medical knowledge are represented through graphs which leads to an interconnected global graph; this approach improves precision and outperforms state-of-the-art models on multiple medical Q&A benchmarks. | Paper, Tweet |
| 10) Survey of NL2QL - a comprehensive overview of NL2SQL techniques powered by LLMs; covers models, data collection, evaluation methods, and error analysis. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) SAM 2 - an open unified model for real-time, promptable object segmentation in images and videos; can be applied to unseen visual content without the need for custom adaptation; to enable accurate mask prediction in videos, a memory mechanism is introduced to store information on the object and previous interactions; the memory module also allows real-time processing of arbitrarily long videos; SAM2 significantly outperforms previous approaches on interactive video segmentation across 17 zero-shot video datasets while requiring three times fewer human-in-the-loop interactions. | Paper, Tweet |
| 2) Structured Generation Limits Reasoning - investigates if structured generation can impact an LLM’s reasoning and domain knowledge comprehensive capabilities; observes that there is a significant decline in LLM’s reasoning abilities when applying format restrictions compared to free-form responses; this degradation effect is further amplified when applying stricter format constraints to reasoning tasks. | Paper, Tweet |
| 3) From LLMs to LLM-based Agents for Sofware Engineering - a survey paper on current practices and solutions for LLM-based agents for software engineering; covers important topics such as requirement engineering, code generation, test generation, and autonomous decision making; it also includes benchmarks, metrics, and models used in different software engineering applications. | Paper, Tweet |
| 4) Transformer Explainer - presents an open-source interactive tool to learn about the inner workings of a Transformer model; it runs a GPT-2 instance locally in the user's browser and allows experimenting with your own inputs. | Paper, Tweet |
| 5) Enhancing LLMs for RAG - introduces RAGFoundry, an open-source framework for augmented LLMs for RAG use cases; it supports data creation, training, inference, and evaluation; one useful application is the creation of data-augmented datasets for tuning and evaluating LLMs in RAG settings. | Paper, Tweet |
| 6) Synthesizing Text-to-SQL Data from Weak and Strong LLMs - proposes integrated synthetic data to build a highly specialized SoTA text-to-SQL model called SENSE; the synthetic data from strong models enhances data diversity while valuable erroneous data from weaker models combined with an executor to learn from execution feedback; preference learning is used to instruction-tune LLMs to learn from both correct and incorrect samples; SENSE achieves state-of-the-art results on the SPIDER and BIRD benchmarks, which bridges the performance gap between open-source models and methods that use closed-source models. | Paper, Tweet |
| 7) Conversational Prompt Engineering - proposes an approach to help users create personalized prompts by articulating the preferred outputs via interactions; it involves two stages: 1) an initial instruction shaped by the model based on user-provided unlabeled data, and 2) the model shares the output and the user provides feedback with refinements on outputs and instruction; this iterative process results in a personalized few-shot prompt that performs better and more optimally on the desired task. | Paper, Tweet |
| 8) Self-Taught Evaluators - an approach to improve model-based evaluators using synthetic training data only; it first generates contrasting outputs (good and bad model responses) and trains an LLM-as-a-Judge to produce reasoning traces and final judgments; the self-improvement scheme repeats the training process in an iterative way using its improved predictions; claims to outperform LLM-judges such as GPT-4 and match top-performing reward models trained on labeled examples; improves a strong LLM (Llama3-70BInstruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. | Paper, Tweet |
| 9) RAGEval - proposes a simple framework to automatically generate evaluation datasets to assess knowledge usage of different LLM under different scenarios; it defines a schema from seed documents and then generates diverse documents which leads to question-answering pairs; the QA pairs are based on both the articles and configurations. | Paper, Tweet |
| 10) Survey of Mamba - provides a systematic review of existing Mamba-based models across domains and tasks; specifically, focuses on advancements of Mamba-based models, techniques for adapting Mamba to diverse data, applications where Mamba excels, and promising research directions | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Meta-Rewarding LLMs - proposes a self-improving alignment technique (no human supervision) where the LLM judges its own judgements and uses the feedback to improve its judgment skills; shows that leveraging this LLM-as-a-Meta-Judge approach improves the LLM's ability to judge and follow instructions; just doing self-improvement to generate better responses (act) saturates quickly; this work improves the LLM's ability to judge itself (judge) to avoid issues like reward hacking; in addition to the act and judge roles, a third role called meta-judge is used to evaluate the model's own judgements. | Paper, Tweet |
| 2) MindSearch - presents an LLM-based multi-agent framework to perform complex web-information seeking and integration tasks; a web planner effectively decomposes complex queries followed by a web searcher that performs hierarchical information retrieval on the Internet to improve the relevancy of the retrieved information; the planning component is powered by an iterative graph construction which is used to better model complex problem-solving processes; the multi-agent framework handles long context problems better by distributing reasoning and retrieval tasks to specialized agents. | Paper, Tweet |
| 3) Improved RAG with Self-Reasoning - presents an end-to-end self-reasoning framework to improve the reliability and traceability of RAG systems; leverages the reasoning trajectories generated by the LLM itself; the LLM is used to carry out the following 3 processes: 1) relevance-aware: judges the relevance between the retrieved documents and the question, 2) evidence-aware selective: chooses and cites relevant documents, and then automatically selects snippets of key sentences as evidence from the cited documents, and 3) trajectory analysis: generates a concise analysis based on all gathered self-reasoning trajectories generated by the previous 2 processes and then provides the final inferred answer; this method helps the model to be more selective, reason and distinguish relevant and irrelevant documents, therefore improving the accuracy of the overall RAG system; the framework achieves comparable performance to GPT-4 with only 2K training samples (generated by GPT-4). | Paper, Tweet |
| 4) Constrained-CoT - limits the model reasoning output length without sacrificing performance; shows that constraining the reasoning of LLaMA2-70b to 100 words improves the accuracy from 36.01% (CoT) to 41.07% (CCoT) on GSM8K, while reducing the average output length by 28 words. | Paper, Tweet |
| 5) Adaptive RAG for Conversations Sytems - develops a gating model that predicts if a conversational system requires RAG to improve its responses; shows that RAG-based conversational systems have the potential to generate high-quality responses and high generation confidence; it also claims to identify a correlation between the generation's confidence level and the relevance of the augmented knowledge. | Paper, Tweet |
| 6) ShieldGemma - offers a comprehensive suite of LLM-based safety content moderation models built on Gemma 2; includes classifiers for key harm types such as dangerous content, toxicity, hate speech, and more. | Paper, Tweet |
| 7) Evaluating Persona Agents - proposes a benchmark to evaluate persona agent capabilities in LLMs; finds that Claude 3.5 Sonnet only has a 2.97% relative improvement in PersonaScore compared to GPT 3.5 despite being a much more advanced model. | Paper, Tweet |
| 8) Machine Unlearning Survey - provides a comprehensive survey on machine unlearning in generative AI. | Paper, Tweet |
| 9) ThinK - proposes an approach to address inefficiencies in KV cache memory consumption; it focuses on the long-context scenarios and the inference side of things; it presents a query-dependent KV cache pruning method to minimize attention weight loss while selectively pruning the least significant channels | Paper, Tweet |
| 10) The Art of Refusal - a survey of the current methods used to achieve refusal in LLMs; provides evaluation benchmarks and metrics used to measure abstention in LLMs. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Llama 3.1 - a collection of LLMs that include 8B, 70B, and 405B parameters models; supports eight languages and extends the context window to 128K tokens; performs competitively and in some cases outperforms state-of-the-art models across capabilities like general knowledge, math reasoning, and tool use. | Paper, Tweet |
| 2) AlphaProof & Alpha Geometry 2 - solved 4 out of 6 problems in this year’s IMO which is the equivalent of a silver-medal score; AlphaProof consists of a Gemini model that automatically translates natural language problem statements into formal statements (i.e., formalizer network); then a solver network searches for proofs/disproofs and progressively trains itself using AlphaZero to learn to solve even more complex problems; AlphaGeometry 2, a neuro symbolic hybrid system, proved the geometry problem; based on the Gemini model and trained from scratch on large amounts of synthetic data. | Paper, Tweet |
| 3) RAG vs. Long-Context LLMs - compares RAG and long-context LLMs and finds that long-context LLMs outperform RAG on average performance while RAG is significantly less expensive; proposes Self-Route, leveraging self-reflection to route queries to RAG or LC; reports that Self-Route significantly reduces computational cost while maintaining comparable performance to LC. | Paper, Tweet |
| 4) OpenDevin - presents a platform to develop generalist agents that interact with the world through software; features include 1) an interaction mechanism for interaction between agents, interfaces, and environments, 2) an environment including a sandboxed operating system and web browser available to the agents, 3) interface to create and execute code, 4) multi-agent support, and 5) an evaluation framework. | Paper, Tweet |
| 5) LazyLLM - introduces a novel dynamic token pruning method for efficient long-context LLM inference; it can accelerate the prefilling stage of a Llama 2 7B model by 2.34x and maintain high accuracy; it selectively computes the KV for tokens that are important for the next token prediction in both the prefilling and decoding stages; it allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps. | Paper, Tweet |
| 6) Teaching LLM Agents to Self-Improve - claims it is possible to iteratively fine-tune LLMs with the ability to improve their own response over multiple turns with additional environment feedback; the LLM learns to recursively detect and correct its previous mistakes in subsequent iterations; improves the self-improvement abilities of 7B models on reasoning tasks (GSM8K and MATH), attaining an improvement over turns that’s unseen in strong proprietary models. | Paper, Tweet |
| 7) Text-to-SQL Survey - provides a survey on employing LLMs for Text-to-SQL tasks, including prompt engineering techniques, fine-tuning methods, benchmarks, and more. | Paper, Tweet |
| 8) MINT-1T - open-sources a large-scale multimodal interleaved dataset consisting of 1 trillion tokens which has 3.4 billion images; it also includes new sources such as PDFs and ArXiv papers. | Paper, Tweet |
| 9) Model Collapse on Synthetic Data - investigates the effects of training models on recursively generated data; finds that training on model-generated content can cause irreversible defects where the original content distribution disappears; shows that the effect, referred to as model collapse, occurs in LLMs, VAEs, and GMMs; while tested on smaller scale models (~100M params), the authors suggest this effect is highly likely to transfer to larger models over time. | Paper, Tweet |
| 10) Mitigating Hallucination via Generation Constraint - proposes a new training-free approach to mitigate hallucination in LLMs; they scaled the readout vector that constrains generation in a memory-augmented LLM decoder; recent works claim that LLMs with explicit memory mechanisms can help lower hallucination; this work uses a memory-augmented LLM and constrains generation in the decoder by applying lightweight memory primitives to reduce hallucination. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Improving Legibility of LLM Outputs - iteratively trains small verifiers to predict solution correctness, helpful provers to produce correct solutions accepted by the verifier, and sneaky provers that produce incorrect solutions that fool the verifier; this process helps train models that can produce text that is correct and easy to understand by both humans and AI systems which leads to more trustworthy systems. | Paper, Tweet |
| 2) SpreadsheetLLM - presents an efficient encoding method to optimize an LLM’s understanding and reasoning capability on spreadsheets; develops a sheet compressor consisting of structural-anchor-based compression, inverse index translation, and data-format-aware aggregation modules to efficiently compress and encode spreadsheets; in GPT-4’s in-context learning, it improves performance in spreadsheet table detection by 25.6%. | Paper, Tweet |
| 3) Context Embeddings for Efficient Answer Generation in RAG - proposes an effective context compression method to reduce long context and speed up generation time in RAG systems; the long contexts are compressed into a small number of context embeddings which allow different compression rates that trade-off decoding time for generation quality; reduces inference time by up to 5.69 × and GFLOPs by up to 22 × while maintaining high performance. | Paper, Tweet |
| 4) Weak-to-Strong Reasoning - demonstrates the use of weak supervision to elicit strong reasoning capabilities in LLMs without relying on human annotations or advanced models; reports that strong models can automatically refine their training data without explicitly being trained to do so; enables expanding a model's learning scope and scaling performance on reasoning. | Paper, Tweet |
| 5) A Survey of Prompt Engineering Methods in LLMs - a collection of prompt engineering methods for a variety of NLP tasks. | Paper, Tweet |
| 6) Does Refusal Training in LLMs Generalize to the Past Tense? - finds that simply reformulating an LLM request into past tense can jailbreak many state-of-the-art LLMs; for example "How to make a Molotov cocktail?" can be rephrased as "How did people make a Molotov cocktail?"; finds that the success rate of such requests can increase from 1% to 88% using direct requests on GPT-4o; concludes that current alignment techniques may not always generalize as intended. | Paper, Tweet |
| 7) Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? - proposes a framework (NeedleBench) of progressively challenging tasks to assess the long-context retrieval and reasoning capabilities of LLMs; they also present the Ancestral Trace Challenge that increases the need for complex logical reasoning which is common in real-world long-context tasks; their findings suggest that current LLMs struggle to handle reasoning tasks with complex logical relationships, even with texts shorter than 2K tokens. | Paper, Tweet |
| 8) Distilling System 2 into System 1 - investigates self-supervised methods to distill high-quality outputs from System 2 techniques and then fine-tune System 1 to match the predictions of the System 2 technique but without generating intermediate steps; the process of distilling reasoning into System 1 results in less inference cost. | Paper, Tweet |
| 9) Exploring Advanced LLMs with LLMSuite - shares practical tips for developing with and evaluating LLMs; solutions covered range from ReAct to RAG to parameter-efficient methods. | Paper, Tweet |
| 10) Beyond Euclid - provides an illustrated guide and graphical taxonomy of recent advances in non-Euclidean machine learning. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) FlashAttention-3 - proposes to adapt FlashAttention to take advantage of modern hardware; the techniques used to speed up attention on modern GPUs include producer-consumer asynchrony, interleaving block-wise matmul and softmax operations, and block quantization and incoherent processing; achieves speedup on H100 GPUs by 1.5-2.0x with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. | Paper, Tweet |
| 2) RankRAG - introduces a new instruction fine-tuning framework to perform effective context ranking and answering generation to enhance an LLM’s RAG capabilities; it leverages a small ranking dataset to outperform existing expert ranking models; shows that a Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. | Paper, Tweet |
| 3) Mixture of A Million Experts - introduces a parameter-efficient expert retrieval mechanism that leverages the product key technique for sparse retrieval from a million tiny experts; it attempts to decouple computational cost from parameter count by efficiently routing to a very large number of tiny experts through a learned index structure used for routing; demonstrates superior efficiency compared to dense FFW, coarse-grained MoEs, and Product Key Memory (PKM) layers. | Paper, Tweet |
| 4) Reasoning in LLMs: A Geometric Perspective - explores the reasoning of LLMs from a geometrical perspective; reports that a higher intrinsic dimension implies greater expressive capacity of the LLM; reports that they establish a connection between the expressive power of LLMs and the density of their self-attention graphs; their analysis demonstrates that the density of these graphs defines the intrinsic dimension of the inputs to the MLP blocks. | Paper, Tweet |
| 5) Contextual Hallucinations Mitigation in LLMs - proposes a new method that detects and significantly reduces contextual hallucinations in LLMs (e.g., reduces by 10% in the XSum summarization task); builds a hallucination detection model based on input features given by the ratio of attention weights on the context vs. newly generated tokens (for each attention head); the hypothesis is that contextual hallucinations are related to the extent to which an LLM attends to the provided contextual information; they also propose a decoding strategy based on their detection method which mitigates the contextual hallucination; the detector can also be transferred across models without the need for retraining. | Paper, Tweet |
| 6) RouteLLM - proposes efficient router models to dynamically select between stronger and weak LLMs during inference to achieve a balance between cost and performance; the training framework leverages human preference data and data augmentation techniques to boost performance; shows to significantly reduce costs by over 2x in certain cases while maintaining the quality of responses. | Paper, Tweet |
| 7) A Survey on Mixture of Experts - a survey paper on Mixture of Experts (MoE), including the technical details of MoE, open-source implementations, evaluation techniques, and applications of MoE in practice. | Paper, Tweet |
| 8) Internet of Agents - a new framework to address several limitations in multi-agent frameworks such as integrating diverse third-party agents and adaptability to dynamic task requirements; introduces an agent integration protocol, instant messaging architecture design, and dynamic mechanisms for effective collaboration among heterogeneous agents. | Paper, Tweet |
| 9) 3DGen - a new pipeline for end-to-end text-to-3D asset generation in under a minute; integrates state-of-the-art components like AssetGen and TextureGen to represent 3D objects in three ways, namely view space, in volumetric space, and in UV space; achieves a win rate of 68% with respect to the single-stage model. | Paper, Tweet |
| 10) Learning at Test Time - proposes new sequence modeling layers with linear complexity and an expressive hidden state; defines a hidden state as an ML model itself capable of updating even on test sequence; by a linear model and a two-layer MLP based hidden state is found to match or exceed baseline models like Transformers, Mamba, and modern RNNs; the linear model is faster than Transformer at 8k context and matches Mamba in wall-clock time. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) APIGen - presents an automated data generation pipeline to synthesize high-quality datasets for function-calling applications; shows that 7B models trained on curated datasets outperform GPT-4 models and other state-of-the-art models on the Berkeley Function-Calling Benchmark; a dataset consisting of 60K entries is also released to help with research in function-calling enabled agents. | Paper, Tweet |
| 2) CriticGPT - a new model based on GPT-4 to help write critiques for responses generated by ChatGPT; trained using RLHF using a large number of inputs that contained mistakes for which it had to critique; built to help human trainers spot mistakes during RLHF and claims that CriticGPT critiques are preferred by trainers over ChatGPT critiques in 63% of cases on naturally occurring bugs. | Paper, Tweet |
| 3) Searching for Best Practices in RAG - shows the best practices for building effective RAG workflows; proposes strategies that focus on performance and efficiency, including emerging multimodal retrieval techniques. | Paper, Tweet |
| 4) Scaling Synthetic Data Creation - proposes 1 billion diverse personas to facilitate the creation of diverse synthetic data for different scenarios; uses a novel persona-driven data synthesis methodology to generate diverse and distinct data covering a wide range of perspectives; to measure the quality of the synthetic datasets, they performed an out-of-distribution evaluation on MATH. A fine-tuned model on their synthesized 1.07M math problems achieves 64.9% on MATH, matching the performance of gpt-4-turbo-preview at only a 7B scale. | Paper, Tweet |
| 5) Self-Evaluation as a Defense Against Adversarial Attacks on LLMs - proposes the use of self-evaluation to defend against adversarial attacks; uses a pre-trained LLM to build defense which is more effective than fine-tuned models, dedicated safety LLMs, and enterprise moderation APIs; they evaluate different settings like attacks on the generator only and generator + evaluator combined; it shows that building a dedicated evaluator can significantly reduce the success rate of attacks. | Paper, Tweet |
| 6) Agentless - introduces OpenAutoEncoder-Agentless which offers an agentless system that solves 27.3% GitHub issues on SWE-bench Lite; claims to outperform all other open-source AI-powered software engineering agents. | Paper, Tweet |
| 7) Adaptable Logical Control for LLMs - presents the Ctrl-G framework to facilitate control of LLM generations that reliably follow logical constraints; it combines LLMs and Hidden Markow Models to enable following logical constraints (represented as deterministic finite automata); Ctrl-G achieves over 30% higher satisfaction rate in human evaluation compared to GPT4. | Paper, Tweet |
| 8) LLM See, LLM Do - closely investigates the effects and effectiveness of synthetic data and how it shapes a model’s internal biases, calibration, attributes, and preferences; finds that LLMs are sensitive towards certain attributes even when the synthetic data prompts appear neutral; demonstrates that it’s possible to steer the generation profiles of models towards desirable attributes. | Paper, Tweet |
| 9) Summary of a Haystack - proposes a new task, SummHay, to test a model’s ability to process a Haystack and generate a summary that identifies the relevant insights and cites the source documents; reports that long-context LLMs score 20% on the benchmark which lags the human performance estimate (56%); RAG components is found to boost performance on the benchmark, which makes it a viable option for holistic RAG evaluation. | Paper, Tweet |
| 10) AI Agents That Matter - analyzes current agent evaluation practices and reveals shortcomings that potentially hinder real-world application; proposes an implementation that jointly optimizes cost and accuracy and a framework to avoid overfitting agents. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) ESM3 - a new LLM-based biological model that generates a new green fluorescent protein called esmGFP; builds on a bidirectional transformer, uses masked language models for the objective function, leverages geometric attention to represent atomic coordinates, and applies chain-of-thought prompting to generate fluorescent proteins; estimates that esmGFP represents an equivalent of over 500 million years of natural evolution performed by an evolutionary simulator. | Paper, Tweet |
| 2) Gemma 2 - presents a family of open models ranging between 2B to 27B parameters; demonstrates strong capabilities in reasoning, math, and code generation, outperforming models twice its size. | Paper, Tweet |
| 3) LLM Compiler - a suite of open pre-trained models (7B and 13B parameters) designed for code optimization tasks; it’s built on top of Code Llama and trained on a corpus of 546 billion tokens of LLVM-IR and assembly code; it’s also instruction fine-tuned to interpreter compiler behavior; achieves 77% of the optimizing potential of autotuning search and performs accurate disassembling 14% of the time compared to the autotuning technique on which it was trained. | Paper, Tweet |
| 4) Enhancing RAG with Long-Context LLMs - proposes LongRAG, which combines RAG with long-context LLMs to enhance performance; uses a long retriever to significantly reduce the number of extracted units by operating on longer retrieval units; the long reader takes in the long retrieval units and leverages the zero-shot answer extraction capability of long-context LLMs to improve performance of the overall system; claims to achieve 64.3% on HotpotQA (full-wiki), which is on par with the state-of-the-art model. | Paper, Tweet |
| 5) Improving Retrieval in LLMs through Synthetic Data - proposes a fine-tuning approach to improve the accuracy of retrieving information in LLMs while maintaining reasoning capabilities over long-context inputs; the fine-tuning dataset comprises numerical dictionary key-value retrieval tasks (350 samples); finds that this approach mitigates the "lost-in-the-middle" phenomenon and improves performance on both information retrieval and long-context reasoning. | Paper, Tweet |
| 6) GraphReader - proposes a graph-based agent system to enhance the long-context abilities of LLMs; it structures long text into a graph and employs an agent to explore the graph (using predefined functions guided by a step-by-step rational plan) to effectively generate answers for questions; consistently outperforms GPT-4-128k across context lengths from 16k to 256k. | Paper, Tweet |
| 7) Faster LLM Inference with Dynamic Draft Trees - presents a context-aware dynamic draft tree to increase the speed of inference; the previous speculative sampling method used a static draft tree for sampling which only depended on position but lacked context awareness; achieves speedup ratios ranging from 3.05x-4.26x, which is 20%-40% faster than previous work; these speedup ratios occur because the new method significantly increases the number of accepted draft tokens. | Paper, Tweet |
| 8) Following Length Constraints in Instructions - presents an approach for how to deal with length bias and train instruction following language models that better follow length constraint instructions; fine-tunes a model using DPO with a length instruction augmented dataset and shows less length constraint violations and while keeping a high response quality. | Paper, Tweet |
| 9) On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation - survey on LLM-based synthetic data generation, curation, and evaluation. | Paper, Tweet |
| 10) Adam-mini - a new optimizer that reduces memory footprint (45%-50% less memory footprint) by using fewer learning rates and achieves on-par or even outperforms AdamW; it carefully partitions parameters into blocks and assigns a single high-quality learning that outperforms Adam; achieves consistent results on language models sized from 125M -7B for pre-training, SFT, and RLHF. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Claude 3.5 Sonnet - a new model that achieves state-of-the-art performance on several common benchmarks such as MMLU and HumanEval; it outperforms Claude 3 Opus and GPT-4o on several benchmarks with the exception of math word problem-solving tasks; achieves strong performance on vision tasks which also helps power several new features like image-text transcription and generation of artifacts. | Paper, Tweet |
| 2) DeepSeek-Coder-V2 - competes with closed-sourced models on code and math generation tasks; achieves 90.2% on HumanEval and 75.7% on MATH; these results are higher than GPT-4-Turbo-0409 performance according to their report; includes a 16B and 236B parameter model with 128K context length. | Paper, Tweet |
| 3) TextGrad - a new framework for automatic differentiation through backpropagation on textual feedback provided by an LLM; this improves individual components and the natural language helps to optimize the computation graph; it works by providing an objective function without tuning prompts or components; claims to achieve LeetCodeHard best scores and SoTA performance on GPQA when combined with GPT4o. | Paper, Tweet |
| 4) Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? - conducts a deep performance analysis of long-context LLMs on in-context retrieval and reasoning; they first present a benchmark with real-world tasks requiring 1M token context; reports that long-context LLMs can rival state-of-the-art retrieval and RAG systems, without any explicit training on the tasks; suggests that compositional reasoning (required in SQL-like tasks) is still challenging for these LLMs; they also encourage the need for continued research on advanced prompting strategies as they noted significant boosts in performance when applying them for long context problems. | Paper, Tweet |
| 5) PlanRAG - enhances decision making with a new RAG technique called iterative plan-then-RAG (PlanRAG); involves two steps: 1) an LM generates the plan for decision making by examining data schema and questions and 2) the retriever generates the queries for data analysis; the final step checks if a new plan for further analysis is needed and iterates on previous steps or makes a decision on the data; PlanRAG is found to be more effective than iterative RAG on the proposed Decision QA tasks. | Paper, Tweet |
| 6) Mitigating Memorization in LLMs - presents a modification of the next-token prediction objective called goldfish loss to help mitigate the verbatim generation of memorized training data; it uses a simple technique that excludes a pseudorandom subset of training tokens at training time; they show that the goldfish loss resists memorization and keeps the model useful; however, it may need to train for longer to more effectively learn from the training data. | Paper, Tweet |
| 7) Monte Carlos Tree Self-Refine - report to have achieved GPT-4 level mathematical olympiad solution using an approach that integrates LLMs with Monte Carlo Tree Search; this approach focuses on enhancing the mathematical reasoning performance of the system through capabilities such as systematic exploration, self-refinement, and self-evaluation. | Paper, Tweet |
| 8) From RAG to Rich Parameters - investigates more closely how LLMs utilize external knowledge over parametric information for factual queries; finds that in a RAG pipeline, LLMs take a “shortcut” and display a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. | Paper, Tweet |
| 9) Open-Sora - an open-source video generation model that can generate 16-second 720p videos; it’s a 1.1B parameter model trained on more than 30m data and now supports image-to-video; presents an enhanced diffusion model and video compression network for spatial and temporal compression; increases controllability of generations and reduces training costs. | Paper, Tweet |
| 10) Tree Search for Language Model Agents - proposes an inference-time tree search algorithm for LM agents to perform exploration and enable multi-step reasoning; it’s tested on interactive web environments and applied to GPT-4o to significantly improve performance; demonstrates that performance scales when increasing test-time compute. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Nemotron-4 340B - provides an instruct model to generate high-quality data and a reward model to filter out data on several attributes; demonstrates strong performance on common benchmarks like MMLU and GSM8K; it’s competitive with GPT-4 on several tasks, including high scores in multi-turn chat; a preference data is also released along with the base model. | Paper, Tweet |
| 2) Discovering Preference Optimization Algorithms with LLMs - proposes LLM-driven objective discovery of state-of-the-art preference optimization; no human intervention is used and an LLM is prompted to propose and implement the preference optimization loss functions based on previously evaluated performance metrics; discovers an algorithm that adaptively combined logistic and exponential losses. | Paper, Tweet |
| 3) SelfGoal - a framework to enhance an LLM-based agent's capabilities to achieve high-level goals; adaptively breaks down a high-level goal into a tree structure of practical subgoals during interaction with the environment; improves performance on various tasks, including competitive, cooperative, and deferred feedback environments | Paper, Tweet |
| 4) Mixture-of-Agents - an approach that leverages the collective strengths of multiple LLMs through a Mixture-of-Agents methodology; layers are designed with multiple LLM agents and each agent builds on the outputs of other agents in the previous layers; surpasses GPT-4o on AlpacaEval 2.0, MT-Bench and FLASK. | Paper, Tweet |
| 5) Transformers Meet Neural Algorithmic Reasoners - a new hybrid architecture that enables tokens in the LLM to cross-attend to node embeddings from a GNN-based neural algorithmic reasoner (NAR); the resulting model, called TransNAR, demonstrates improvements in OOD reasoning across algorithmic tasks | Paper, Tweet |
| 6) Self-Tuning with LLMs - improves an LLM’s ability to effectively acquire new knowledge from raw documents through self-teaching; the three steps involved are 1) a self-teaching component that augments documents with a set of knowledge-intensive tasks focusing on memorization, comprehension, and self-reflection, 2) uses the deployed model to acquire knowledge from new documents while reviewing its QA skills, and 3) the model is configured to continually learn using only the new documents which helps with thorough acquisition of new knowledge. | Paper, Tweet |
| 7) Sketching as a Visual Chain of Thought - a framework that enables a multimodal LLM to access a visual sketchpad and tools to draw on the sketchpad; it can equip a model like GPT-4 with the capability to generate intermediate sketches to reason over complex tasks; improves performance on many tasks over strong base models with no sketching; GPT-4o equipped with SketchPad sets a new state of the art on all the tasks tested. | Paper, Tweet |
| 8) Mixture of Memory Experts - proposes an approach to significantly reduce hallucination (10x) by tuning millions of expert adapters (e.g., LoRAs) to learn exact facts and retrieve them from an index at inference time; the memory experts are specialized to ensure faithful and factual accuracy on the data it was tuned on; claims to enable scaling to a high number of parameters while keeping the inference cost fixed. | Paper, Tweet |
| 9) Multimodal Table Understanding - introduces Table-LLaVa 7B, a multimodal LLM for multimodal table understanding; it’s competitive with GPT-4V and significantly outperforms existing MLLMs on multiple benchmarks; also develops a large-scale dataset MMTab, covering table images, instructions, and tasks. | Paper, Tweet |
| 10) Consistent Middle Enhancement in LLMs - proposes an approach to tune an LLM to effectively utilize information from the middle part of the context; it first proposes a training-efficient method to extend LLMs to longer context lengths (e.g., 4K -> 256K); it uses a truncated Gaussian to encourage sampling from the middle part of the context during fine-tuning; the approach helps to alleviate the so-called "Lost-in-the-Middle" problem in long-context LLMs. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) NLLB - proposes a massive multilingual model that leverages transfer learning across 200 languages; it’s based on a sparsely Gated Mixture of Experts architecture and trained on data via an approach tailored for low-resource languages; evaluates on 40K translations and achieves an average of 44% improvement in translation quality. | Paper, Tweet |
| 2) Extracting Concepts from GPT-4 - proposes a new scalable method based on sparse autoencoders to extract around 16 million interpretable patterns from GPT-4; the method demonstrates predictable scaling and is more efficient than previous techniques. | Paper, Tweet |
| 3) Mamba-2 - a new architecture that combines state space models (SSMs) and structured attention; it uses 8x larger states and trains 50% faster; the new state space duality layer is more efficient and scalable compared to the approach used in Mamba; it also improves results on tasks that require large state capacity. | Paper, Tweet |
| 4) MatMul-free LLMs - proposes an implementation that eliminates matrix multiplication operations from LLMs while maintaining performance at billion-parameter scales; the performance between full precision Transformers and the MatMul-free models narrows as the model size increases; claims that by using an optimized kernel during inference, memory consumption is reduced by more than 10x. | Paper, Tweet |
| 5) Buffer of Thoughts - presents a thought-augmented reasoning approach to enhance the accuracy, efficiency, and robustness of LLM-based reasoning; it leverages a meta-buffer containing high-level thoughts (thought templates) distilled from problem-solving processes; the relevant thought template is then retrieved and instantiated with task-specific reasoning structures for the thought-augmented reasoning process; it demonstrates SOTA performance on 10 challenging tasks while requiring 12% of the cost of multi-query prompting methods like Tree-of-Thoughts. | Paper, Tweet |
| 6) SaySelf - a training framework to teach LLMs to express more accurate fine-grained confidence estimates and self-reflective rationales; it performs supervised finetuning on a dataset that contains summaries of the difference between multiple reasoning chains; reinforcement learning is then applied to calibrate confidence estimates, encouraging the LLM to produce accurate, high-confidence predictions and penalize overconfidence in erroneous outputs. | Paper, Tweet |
| 7) The Geometry of Concepts in LLMs - studies the geometry of categorical concepts and how the hierarchical relations between them are encoded in LLMs; finds that simple categorical concepts are represented as simplices by the LLMs and complex concepts are represented as polytopes constructed from direct sums of simplices, which reflect the hierarchical structure. | Paper, Tweet |
| 8) Aligning LLMs with Demonstrated Feedback - proposes a method to align LLMs to a specific setting via a very small number of demonstrations as feedback; it aligns LLM outputs to a user’s demonstrated behaviors and can learn fine-grained style and task alignment across domains; outperforms few-shot prompting, SFT, and self-play methods on the tested benchmarks. | Paper, Tweet |
| 9) Towards Scalable Automated Alignment of LLMs - provides an overview of methods used for alignment of LLMs; explores the 4 following directions: 1) aligning through inductive bias, 2) aligning through behavior imitation, 3) aligning through model feedback, and 4) aligning through environment feedback. | Paper, Tweet |
| 10) AgentGym - a new framework featuring various environments and tasks for broad, real-time, and concurrent agent exploration; builds a generally capable LLM-based agent with self-evolution abilities and explores its potential beyond previously seen data across tasks and environments. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Contextual Position Encoding - proposes a new position encoding method, CoPE, to enable the position to be conditioned on context by incrementing position only on certain tokens; the position encoding is context-dependent and can represent different levels of position abstraction; the general position encoding method can attend to the i-th particular word, noun, or sentence; improves perplexity on language modeling and coding tasks. | Paper, Tweet |
| 2) Symbolic Chain-of-Thought - proposes a method that improves the logical reasoning capabilities of LLMs by integrating symbolic expressions and logical rules with chain-of-thought (CoT) prompting; the prompting technique is called Symbolic Chain-of-Thought and it’s a fully LLM-based framework with the following key steps: 1) translates natural language context to symbolic format, 2) derives step-by-step plan to solve problems following symbolic logical rules, and 3) uses a verifier to check the translation and reasoning chain. | Paper, Tweet |
| 3) Abacus Embeddings - achieves 99% accuracy on 100-digit addition problems by training on only 20-digit numbers with a single GPU; the main challenge this work addresses is the inability of transformers to track the exact position of digits; they do this by adding an embedding to each digit that encodes its position relative to the start of the number; these gains also transfer to multi-step reasoning tasks that include sorting and multiplication. | Paper, Tweet |
| 4) Introduction to Vision-Language Modeling - presents an introduction to vision-language models along with key details of how they work and how to effectively train these models. | Paper, Tweet |
| 5) GNN-RAG - combines the language understanding abilities of LLMs with the reasoning abilities of GNNs in a RAG style; the GNN extracts useful and relevant graph information while the LLM takes the information and leverages its capabilities to perform question answering over knowledge graphs (KGQA); GNN-RAG improves vanilla LLMs on KGQA and outperforms or matches GPT-4 performance with a 7B tuned LLM. | Paper, Tweet |
| 6) Attention as an RNN - presents a new attention mechanism that can be trained in parallel (like Transformers) and be updated efficiently with new tokens requiring constant memory usage for inferences (like RNNs); the attention formulation is based on the parallel prefix scan algorithm which enables efficient computation of attention’s many-to-many RNN output; achieves comparable performance to Transformers on 38 datasets while being more time and memory-efficient. | Paper, Tweet |
| 7) Aya23 - a family of multilingual language models that can serve up to 23 languages; it intentionally focuses on fewer languages and allocates more capacity to these languages; shows that it can outperform other massive multimodal models on those specific languages. | Paper, Tweet |
| 8) Are Long-LLMs A Necessity For Long-Context Tasks? - claims that long-LLMs are not a necessity to solve long-context tasks; proposes a reasoning framework to enable short-LLMs to address long-context tasks by adaptively accessing and utilizing the context based on the presented tasks; it decomposes the long context into short contexts and processes them using a decision-making process. | Paper, Tweet |
| 9) Financial Statement Analysis with LLMs - claims that LLMs can generate useful insights from its analysis of trends and financial ratios; shows that GPT-4 performs on par with narrowly specialized models; and achieves a profitable trading strategy based on GPT’s predictions. | Paper, Tweet |
| 10) SimPO - a simpler and more effective approach for preference optimization with a reference-free reward; uses the average log probability of a sequence as an implicit reward (i.e., no reference model required) which makes it more compute and memory efficient; demonstrates that it outperforms existing approaches like DPO and claims to produce the strongest 8B open-source model. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Extracting Interpretable Features from Claude 3 Sonnet - presents an effective method to extract millions of abstract features from an LLM that represent specific concepts; these concepts could represent people, places, programming abstractions, emotion, and more; reports that some of the discovered features are directly related to the safety aspects of the model; finds features directly related to security vulnerabilities and backdoors in code, bias, deception, sycophancy; and dangerous/criminal content, and more; these features are also used to intuititively steer the model’s output. | Paper, Tweet |
| 2) Agent Planning with World Knowledge Model - introduces a parametric world knowledge model to facilitate agent planning; the agent model can self-synthesize knowledge from expert and sampled trajectories; this is used to train the world knowledge model; prior task knowledge is used to guide global planning and dynamic state knowledge is used to guide the local planning; demonstrates superior performance compared to various strong baselines when adopting open-source LLMs like Mistral-7B and Gemma-7B. | Paper, Tweet |
| 3) Risks and Opportunities of Open-Source Generative AI - analyzes the risks and opportunities of open-source generative AI models; argues that the overall benefits of open-source generative AI outweigh its risks. | Paper, Tweet |
| 4) Enhancing Answer Selection in LLMs - proposes a hierarchical reasoning aggregation framework for improving the reasoning capabilities of LLMs; the approach, called Aggregation of Reasoning (AoR), selects answers based on the evaluation of reasoning chains; AoR uses dynamic sampling to adjust the number of reasoning chains with respect to the task complexity; it uses results from the evaluation phase to determine whether to sample additional reasoning chains; a known flaw of majority voting is that it fails in scenarios where the correct answer is in the minority; AoR focuses on evaluating the reasoning chains to improve the selection of the final answer; AoR outperforms various prominent ensemble methods and can be used with various LLMs to improve performance on complex reasoning tasks. | Paper, Tweet |
| 5) How Far Are We From AGI - presents an opinion paper addressing important questions to understand the proximity to artificial general intelligence (AGI); it provides a summary of strategies necessary to achieve AGI which includes a detailed survey, discussion, and original perspectives. | Paper |
| 6) Efficient Inference of LLMs - proposes a layer-condensed KV cache to achieve efficient inference in LLMs; only computes and caches the key-values (KVs) of a small number of layers which leads to saving memory consumption and improved inference throughput; can achieve up to 26x higher throughput than baseline transformers while maintaining satisfactory performance. | Paper, Tweet |
| 7) Guide for Evaluating LLMs - provides guidance and lessons for evaluating large language models; discusses challenges and best practices, along with the introduction of an open-source library for evaluating LLMs. | Paper, Tweet |
| 8) Scientific Applications of LLMs - presents INDUS, a comprehensive suite of LLMs for Earth science, biology, physics, planetary sciences, and more; includes an encoder model, embedding model, and small distilled models. | Paper, Tweet |
| 9) DeepSeek-Prover - introduces an approach to generate Lean 4 proof data from high-school and undergraduate-level mathematical competition problems; it uses the synthetic data, comprising of 8 million formal statements and proofs, to fine-tune a DeepSeekMath 7B model; achieves whole-proof generation accuracies of 46.3% with 64 samples and 52% cumulatively on the Lean 4 miniF2F test; this surpasses the baseline GPT-4 (23.0%) with 64 samples and a tree search RL method (41.0%). | Paper, Tweet |
| 10) Efficient Multimodal LLMs - provides a comprehensive and systematic survey of the current state of efficient multimodal large language models; discusses efficient structures and strategies, applications, limitations, and promising future directions. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) GPT-4o - a new model with multimodal reasoning capabilities with real-time support across audio, vision, and text; it can accept as input any combination of text, audio, image, and video to generate combinations of text, audio, and image outputs; it’s reported to match GPT-4 Turbo performance while being 50% much faster and cheaper via APIs. | Paper, Tweet |
| 2) Gemini 1.5 Flash - a lightweight transformer decoder model with a 2M context window with multimodal capabilities; it is designed for efficiency and yields the fastest output generation of all models on several evaluated languages; overall, Gemini 1.5 Flash performs uniformly better compared to Gemini 1.0 Pro and even performs at a similar level to 1.0 Ultra on several benchmarks. | Paper, Tweet |
| 3) Veo - Google Deepmind’s most capable video generation model generates high-quality, 1080p resolution videos beyond 1 minute; it supports masked editing on videos and can also generate videos with an input image along with text; the model can extend video clips to 60 seconds and more while keeping consistency with its latent diffusion transformer. | Paper, Tweet |
| 4) Chameleon - a family of token-based mixed-modal models for generating images and text in any arbitrary sequence; reports state-of-the-art performance in image captioning and outperforms Llama 2 in text-only tasks and is also competitive with Mixtral 8x7B and Gemini-Pro; exceeds the performance of Gemini Pro and GPT-4V on a new long-form mixed-modal generation evaluation. | Paper, Tweet |
| 5) Fine-tuning and Hallucinations - studies the impact of fine-tuning on new knowledge on the hallucination tendencies of LLMs; the setup includes fine-tuning examples that include new knowledge; shows that LLMs struggle to acquire new factual knowledge via fine-tuning; also finds that as new knowledge is learned it increases the model’s tendency to hallucinate. | Paper, Tweet |
| 6) Zero-shot Tokenizer Transfer - trains a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings; it demonstrates generalization to new tokenizers both with encoder and decoder LLMs; reports that the method achieves performance close to the original models' performance in cross-lingual and coding tasks while reducing the length of the tokenized sequence. | Paper, Tweet |
| 7) WavCraft - leverages LLMs to connect task-specific models for audio content creation and editing; decomposes users' instructions into several tasks and tackles each task collaboratively with the particular module; it can enable users to interact and produce audio content without explicit commands | Paper |
| 8) RLHF Workflow - provides an easily reproducible recipe for online iterative RLHF; discusses theoretical insights and algorithmic principles of online iterative RLHF and practical implementation. | Paper, Tweet |
| 9) You Only Cache Once - a decoder-decoder LLM architecture that only caches key-value pairs once; it involves a cross-decoder stacked upon a self-decoder which efficiently encodes global key-value caches and the cross-encoder reuses the cache via cross-attention; this leads to a significant reduction in GPU memory use without sacrificing capabilities; achieves comparable performance to Transformer in various settings of scaling up model size and number of training token. | Paper, Tweet |
| 10) CAT3D - presents a method for creating anything in 3D by simulating the real-world capture process using a multi-view diffusion model; it can generate consistent novel views of a scene which can be used as input to 3D reconstruction techniques to produce 3D representation rendered in real-time; the scene from CAT3D can be generated in less than one minute and is reported to outperform existing methods on single image and few-view 3D scene creation tasks. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) AlphaFold 3 -releases a new state-of-the-art model for accurately predicting the structure and interactions of molecules; it can generate the 3D structures of proteins, DNA, RNA, and smaller molecules; the model is an improved version of the Evoformer module and then assembling its predictions using a diffusion network; the diffusion process starts with a cloud of atoms which converges to its final molecular structure. | Paper, Tweet |
| 2) xLSTM: Extended Long Short-Term Memory - attempts to scale LSTMs to billions of parameters using the latest techniques from modern LLMs and mitigating common limitations of LSTMs; to enable LSTMs the ability to revise storage decisions, they introduce exponential gating and a new memory mixing mechanism (termed sLSTM); to enhance the storage capacities of LSTMs, they add a matrix memory and a covariance update rule (termed mLSTM); Both the sLSTM and xLSTM cells stabilize their exponential gates using the same technique; these extensions lead to xLSTM blocks that are residually stacked into the final xLSTM architecture; compared to Transformers, xLSTMs have a linear computation and constant memory complexity concerning the sequence length; the xLSTM architecture is shown to be efficient at handling different aspects of long context problems; achieves better validation perplexities when compared to different model classes like Transformers, SSMs, and RNNs. | Paper, Tweet |
| 3) DeepSeek-V2 -a strong MoE model comprising 236B parameters, of which 21B are activated for each token; supports a context length of 128K tokens and uses Multi-head Latent Attention (MLA) for efficient inference by compressing the Key-Value (KV) cache into a latent vector; DeepSeek-V2 and its chat versions achieve top-tier performance among open-source models. | Paper, Tweet |
| 4) AlphaMath Almost Zero - enhances LLMs with Monte Carlo Tree Search (MCTS) to improve mathematical reasoning capabilities; the MCTS framework extends the LLM to achieve a more effective balance between exploration and exploitation; for this work, the idea is to generate high-quality math reasoning data without professional human annotations; the assumption is that a well pre-trained LLM already possesses mathematical knowledge to generate reasoning steps but needs better stimulation such as an advanced prompting or search strategy; unlike other methods such as Program-of-thought and Chain-of-thought, no solutions are required for the training data, just the math questions and the answers; the integration of LLMs, a value model, and the MCTS framework enables an effective and autonomous process of generating high-quality math reasoning data; the value model also aids the policy model in searching for effective solution paths. | Paper, Tweet |
| 5) DrEureka: Language Model Guided Sim-To-Real Transfer - investigates using LLMs to automate and accelerate sim-to-real design; it requires the physics simulation for the target task and automatically constructs reward functions and domain randomization distributions to support real-world transfer; discovers sim-to-real configurations competitive with existing human-designed ones on quadruped locomotion and dexterous manipulation tasks. | Paper, Tweet |
| 6) Consistency LLMs - proposes efficient parallel decoders that reduce inference latency by decoding n-token sequence per inference step; the inspiration for this work comes from the human's ability to form complete sentences before articulating word by word; this process can be mimicked and learned through fine-tuning pre-trained LLMs to perform parallel decoding; it is trained to perform parallel decoding by mapping randomly initialized n-token sequences to the same result yielded by autoregressive (AR) decoding in as few steps as possible; a consistency loss helps with multiple-token prediction and a standard AR loss prevents deviation from the target LLM and ensures generation quality. Shows 2.4x to 3.4x improvements in generation speed while preserving the generation quality. | Paper, Tweet |
| 7) Is Flash Attention Stable? - develops an approach to understanding the effects of numeric deviation and applies it to the widely-adopted Flash Attention optimization; finds that Flash Attention sees roughly an order of magnitude more numeric deviation as compared to Baseline Attention at BF16. | Paper, Tweet |
| 8) Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond - presents an overview of generative methodologies in video generation, where world models facilitate the synthesis of highly realistic visual content; examines challenges and limitations of world models, and discusses their potential future directions. | Paper, Tweet |
| 9) MAmmoTH2 - harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning; the approach first recalls relevant documents, extracts instruction-response pairs, and then refines the extracted pairs using open-source LLMs; MAmmoTH2-7B's (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K. | Paper, Tweet |
| 10) Granite Code Models -introduce Granite, a series of code models trained with code written in 116 programming languages; it consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from application modernization tasks to on-device memory-constrained use cases; demonstrates that the models reach state-of-the-art performance among available open-source code LLMs. | Paper, Code, Tweet |
| Paper | Links |
|---|---|
| 1) Kolmogorov-Arnold Networks - proposes Kolmogorov-Arnold Networks (KANs) as alternatives to Multi-Layer Perceptrons (MLPs); KANs apply learnable activation functions on edges that represent the weights; with no linear weights used, KANs can outperform MLPs and possess faster neural scaling laws; the authors show that KANs can be used as collaborators to help scientists discover mathematics and physical laws. | Paper, Tweet |
| 2) Better and Faster LLMs via Multi-token Prediction - proposes a multi-token prediction approach that performs language modeling by training the predict the following n tokens using n independent output heads; the output heads operate on top of a shared transformer trunk; multi-token prediction is shown to be useful when using larger model sizes and can speed up inference up to 3x; the proposed 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. | Paper, Tweet |
| 3) Med-Gemini - presents a family of multimodal models specialized in medicines and based on the strong multimodal and long-context reasoning capabilities of Gemini; achieves state-of-the-art performance on 10/14 benchmarks surpassing GPT-4 models; it achieves 91% accuracy on MedQA (USMLE) benchmark using an uncertainty-guided search strategy. | Paper, Tweet |
| 4) When to Retrieve? - presents an approach to train LLMs to effectively utilize information retrieval; it first proposes a training approach to teach an LLM to generate a special token, , when it's not confident or doesn't know the answer to a question; the fine-tuned model outperforms a base LLM in two fixed alternate settings that include never retrieving and always retrieving context | Paper, Tweet |
| 5) A Survey on Retrieval-Augmented Language Models - covers the most important recent developments in RAG and RAU systems; it includes evolution, taxonomy, and an analysis of applications; there is also a section on how to enhance different components of these systems and how to properly evaluate them; it concludes with a section on limitations and future directions. | Paper, Tweet |
| 6) An Open-source LM Specialized in Evaluating Other LMs - open-source Prometheus 2 (7B & 8x7B), state-of-the-art open evaluator LLMs that closely mirror human and GPT-4 judgments; they support both direct assessments and pair-wise ranking formats grouped with user-defined evaluation criteria; according to the experimental results, this open-source model seems to be the strongest among all open-evaluator LLMs; the key seems to be in merging evaluator LMs trained on either direct assessment or pairwise ranking formats. | Paper, Tweet |
| 7) Self-Play Preference Optimization - proposes a self-play-based method for aligning language models; this optimation procedure treats the problem as a constant-sum two-player game to identify the Nash equilibrium policy; it addresses the shortcomings of DPO and IPO and effectively increases the log-likelihood of chose responses and decreases the rejected ones; SPPO outperforms DPO and IPO on MT-Bench and the Open LLM Leaderboard. | Paper, Tweet |
| 8) Inner Workings of Transformer Language Models - presents a technical introduction to current techniques used to interpret the inner workings of Transformer-based language models; it provides a detailed overview of the internal mechanisms implemented in these models. | Paper, Tweet |
| 9) Multimodal LLM Hallucinations - provides an overview of the recent advances in identifying, evaluating, and mitigating hallucination in multimodal LLMs; it also provides an overview of causes, evaluation benchmarks, metrics, and other strategies to deal with challenges related to detecting hallucinations. | Paper, Tweet |
| 10) In-Context Learning with Long-Context Models - studies the behavior in-context learning of LLMs at extreme context lengths with long-context models; shows that performance increases as hundreds or thousands of demonstrations are used; demonstrates that long-context ICL is less sensitive to random input shuffling than short-context ICL; concludes that the effectiveness of long-context LLMs is not due to task learning but from attending to similar examples. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Phi-3 - Microsoft's Phi-3 is a family of small language models (3.8B, 7B, 14B) trained on 3.3-4.8T tokens of heavily filtered web data combined with synthetic data. The flagship phi-3-mini rivals Mixtral 8x7B and GPT-3.5 while being small enough to run locally on a phone. ● Data quality over scale: Training prioritizes curated web and synthetic data over raw token volume, suggesting that data quality is the main lever for small-model capability rather than sheer parameter count. ● Benchmark results: phi-3-mini reaches 69% on MMLU and 8.38 on MT-bench; phi-3-small hits 75% on MMLU and phi-3-medium hits 78%, closing much of the gap with models an order of magnitude larger. ● Long-context variant: A phi-3-mini-128K version extends the default 4K window to 128K tokens while preserving the base model's quality. ● On-device deployment: The 3.8B mini model can be quantized and deployed on an iPhone 14, making it one of the first genuinely usable LLMs at that form factor. |
Paper, Tweet |
| 2) OpenELM - Apple's OpenELM is a fully-open small language model family (270M, 450M, 1.1B, 3B) that uses layer-wise parameter scaling instead of uniform layer widths. At ~1B parameters it improves on OLMo by 2.36% accuracy while using half the pre-training tokens. ● Layer-wise scaling: Rather than allocating parameters uniformly across transformer layers, OpenELM adjusts the width of each layer to place capacity where it matters most for downstream performance. ● Efficient training: The ~1B variant achieves 2.36% higher accuracy than OLMo while requiring 2x fewer pre-training tokens, showing that architectural choices can match or beat raw data scale for small models. ● Complete open release: Apple ships not just weights but training logs, intermediate checkpoints, pre-training configs, and MLX inference code - a far more complete release than most commercial labs offer. ● Reproducibility focus: The release explicitly targets open research, giving the community the artifacts needed to re-trace and audit every step of the training pipeline. |
Paper, Tweet |
| 3) Arctic - Snowflake's Arctic is an Apache 2.0 open LLM with a Dense-MoE Hybrid transformer (480B total / 17B active) that matches Llama 3 70B on enterprise metrics while using under 3K GPU weeks (~$2M) of training compute - roughly 17x less than Llama 3 70B. ● Dense-MoE Hybrid architecture: A 10B dense backbone is paired with 128 fine-grained 3.66B experts selected via top-2 gating, balancing model capacity with communication efficiency across GPUs. ● Enterprise-first curriculum: A three-stage data curriculum emphasizes SQL, coding, and instruction-following over broad world knowledge, yielding strong results on Spider (SQL), HumanEval+/MBPP+ (code), and IFEval. ● Compute efficiency: Training ran in under 3K GPU weeks for roughly $2M, enabled by architecture-system co-design that overlaps communication with computation to hide MoE routing latency. ● Fully open release: Apache 2.0 weights, training code, serving code, fine-tuning pipelines, and a detailed cookbook, available across HuggingFace, NVIDIA API, AWS, and Azure. |
Paper, Tweet |
| 4) Make Your LLM Fully Utilize the Context (FILM-7B) - FILM-7B targets the lost-in-the-middle problem where long-context LLMs fail to retrieve information buried between the start and end of their input. The authors apply an information-intensive (IN2) training recipe to Mistral-7B that forces uniform attention across the full 32K window. ● IN2 synthetic data: Training examples are constructed so the answer requires either fine-grained awareness of a ~128-token segment placed anywhere in a 4K-32K context, or integration of information from two or more such segments. ● Position robustness: After IN2 training, FILM-7B retrieves reliably from arbitrary positions across document, code, and structured-data contexts, with forward, backward, and bidirectional retrieval patterns all supported. ● Real-world gains with minimal regression: NarrativeQA F1 jumps from 23.5 to 26.9 and short-context capability is effectively preserved (MMLU drops only 59.3 to 59.2), showing the fix is targeted rather than disruptive. ● Generalizable recipe: IN2 training is positioned as a cheap, bolt-on remedy that can be applied to existing long-context models to correct position-dependent attention failures systematically. |
Paper, Tweet |
| 5) FineWeb - HuggingFace's FineWeb is a 15 trillion token English web dataset built from 96 CommonCrawl snapshots (2013-2024). In 1.8B-parameter ablations, models trained on FineWeb beat C4, RefinedWeb, Dolma, The Pile, SlimPajama, and RedPajama2 across aggregated benchmarks. ● Scale: 15T tokens spanning 52.5B documents (~50TB on disk), with released sample subsets at 10B, 100B, and 350B tokens for smaller experiments. ● Filtering pipeline: Built on the open-source datatrove library - URL blocklists, Trafilatura text extraction, FastText English filter (>0.65), Gopher + C4 quality filters, custom FineWeb heuristics, MinHash deduplication, and PII anonymization. ● Per-dump deduplication insight: Ablations show per-dump MinHash dedup beats global dedup, a non-obvious finding that shaped the final pipeline and contradicts a common assumption that more aggressive dedup is always better. ● Fully reproducible: Released under ODC-By 1.0 with the complete datatrove pipeline and all ablation model checkpoints public, making it one of the most transparent large-scale web corpora to date. |
Paper, Tweet |
| 6) AI-powered Gene Editors - Profluent's OpenCRISPR-1 paper demonstrates that a large protein language model trained on biological diversity at scale can design programmable gene editors from scratch. The AI-designed editors successfully perform precision editing in the human genome. | Paper, Tweet |
| 7) AutoCrawler - AutoCrawler is a two-stage framework that combines LLMs with the hierarchical structure of HTML to auto-generate reusable web scrapers. Wrapper-based scrapers break on new sites and pure LLM agents don't reuse well across pages; AutoCrawler addresses both limitations. ● Hierarchical DOM understanding: The agent walks the HTML tree with top-down and step-back operations, progressively refining its understanding of a page before emitting a complete and executable scraper. ● Similarity across pages: Patterns learned on one page of a site generalize to structurally similar pages, making the generated scraper durable rather than a one-shot extraction. ● New executability metric: The paper introduces an executability metric for evaluating scraper-generation systems, filling a gap in prior benchmarks that focused only on extraction accuracy. ● EMNLP 2024: Experiments across multiple LLM backends validate the framework on diverse websites; the work was accepted to EMNLP 2024. |
Paper, Tweet |
| 8) Graph Machine Learning in the Era of LLMs - This survey maps the intersection of Graph ML and LLMs, covering both how LLMs enhance graph learning and how graphs (especially knowledge graphs) strengthen LLMs. The authors organize the literature into a taxonomy and highlight where open problems remain. ● Dual-direction coverage: Two complementary threads are surveyed - LLMs augmenting GNNs (feature quality, OOD generalization, few-shot learning) and graphs augmenting LLMs (knowledge grounding in pre-training and inference). ● Taxonomy of methods: Existing work is categorized by how LLMs interact with graphs - as feature extractors, as predictors, or as integral components of graph pipelines. ● Core problem domains: The paper explicitly covers graph heterogeneity, out-of-distribution generalization, explainability, and hallucination mitigation as key challenges at the intersection. ● Open directions: Identifies gaps in practical applications, reliable factual grounding from knowledge graphs, and broader empirical evaluation of graph-language approaches. |
Paper, Tweet |
| 9) Self-Evolution of LLMs - This survey organizes the emerging literature on self-evolving LLMs - models that improve through their own generated experience rather than additional human supervision. The authors propose a unified four-phase cycle and taxonomize existing methods across both standalone models and agent systems. ● Four-phase cycle: Self-evolution is framed as iterative cycles of experience acquisition, experience refinement, model updating, and evaluation, mirroring how humans learn from practice. ● Two application domains: The taxonomy separates standalone-LLM self-evolution (e.g., self-instruct, self-reward) from LLM-agent self-evolution (e.g., tool-use refinement, long-horizon planning). ● Motivation: Reduce dependence on costly human annotation and break through plateaus as task complexity grows, with self-evolution positioned as a possible path toward more autonomous capability growth. ● Open directions: The paper closes with concrete gaps - stable update dynamics, evaluation of self-evolved capabilities, and safety considerations as self-improvement loops tighten. |
Paper, Tweet |
| 10) Naturalized Execution Tuning (NExT) - NExT teaches LLMs to reason about program runtime behavior by generating synthetic chain-of-thought rationales over execution traces. The approach bootstraps training data through self-training rather than manual annotation, and the learned reasoning transfers to scenarios where traces are unavailable at inference. ● Execution-aware rationales: The model inspects variable states at each step of a program's execution and produces natural-language rationales explaining what is happening, then fine-tunes on those rationales. ● Self-training bootstrap: No human-annotated rationales are needed - rationales for correct repairs are kept and used to teach the model, scaling naturally across many programs. ● Large repair gains: On PaLM 2, NExT improves fix rate by +26.1% on MBPP and +14.3% on HumanEval (absolute), with both automated metrics and human evaluators rating the rationales as higher quality. ● Trace-free generalization: At inference the model applies the same reasoning patterns without live execution traces, showing it learned transferable execution-aware reasoning rather than trace-copying. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Llama 3 - Meta's Llama 3 launches with 8B and 70B pretrained and instruction-tuned variants. Llama 3 8B beats Gemma 7B and Mistral 7B Instruct, and Llama 3 70B is competitive with Gemini Pro 1.5 and Claude 3 Sonnet on standard benchmarks. ● Sizes and release: Meta ships 8B and 70B base and Instruct variants first; larger 400B+ models are still training and planned for later releases with multimodality and longer context. ● Training data: Pretrained on 15T+ tokens (7x more than Llama 2) including 4x more code, with 5%+ non-English across 30+ languages. A new 128K-token tokenizer improves encoding efficiency by roughly 15%. ● Benchmark results: The 70B Instruct model wins human preference rankings against Claude Sonnet, Mistral Medium, and GPT-3.5 across 12 use-case categories; the 8B sets a new state of the art for open models in its size class. ● Open deployment: Weights are available across AWS, HuggingFace, Databricks, Google Cloud, Azure, and more, alongside a safety suite - Llama Guard 2, Code Shield, and CyberSec Eval 2 - for responsible deployment. |
Paper, Tweet |
| 2) Mixtral 8x22B - Mistral's Mixtral 8x22B is a sparse Mixture-of-Experts model with 141B total / 39B active parameters and a 64K context window, released under Apache 2.0. It leads open models on MMLU and posts strong math, code, and multilingual numbers. ● Sparse MoE setup: 8 experts with 2 active per token yields 39B active parameters out of 141B total, giving Llama-2-70B-class quality at roughly half the inference cost. ● Multilingual and tooling: Fluent across English, French, Italian, German, and Spanish, with native function calling built in - a practical unlock for agent and tool-use pipelines. ● Benchmark results: Best-in-class among open models on MMLU; 90.8% on GSM8K (maj@8), 44.6% on MATH (maj@4), and state-of-the-art HumanEval pass@1 among contemporary open-weight models. ● License and openness: Apache 2.0 with weights freely available, explicitly framed by Mistral as promoting innovation and unrestricted deployment. |
Paper, Tweet |
| 3) Chinchilla Scaling: A replication attempt - This paper re-examines the third estimation procedure in Hoffmann et al. (2022) Chinchilla scaling law and finds it is inconsistent with the paper's own first two methods, fails to fit the extracted data, and reports implausibly narrow confidence intervals. ● What was audited: Chinchilla proposed three independent methods to estimate the compute-optimal ratio of parameters to training tokens; this paper digs into the parametric-loss-fitting approach (method 3). ● Inconsistent estimates: The published estimates from method 3 do not match the predictions of methods 1 and 2, and the parametric fit does not actually pass through the reconstructed data points. ● Implausible confidence intervals: The reported intervals would statistically require over 600,000 training runs, whereas the authors likely ran fewer than 500 - suggesting methodological errors in the uncertainty quantification. ● Rederivation: A corrected fit using method 3 produces scaling estimates that are consistent with methods 1 and 2, restoring internal coherence and slightly revising the compute-optimal guidance. |
Paper, Tweet |
| 4) How Faithful are RAG Models? (ClashEval) - ClashEval constructs a 1,200-question benchmark across six domains with intentionally corrupted retrieved documents to measure when RAG helps and when it misleads GPT-4 and other top LLMs. ● Controlled conflict setup: Questions span drug dosages, Olympic records, locations, and other verifiable facts. Retrieved documents are perturbed from subtle to obvious errors so the authors can study model behavior under realistic vs implausible corruption. ● RAG can override correct priors: LLMs abandon their correct internal knowledge over 60% of the time when the retrieved document is wrong but plausible - a strong demonstration that retrieval can hurt as much as help. ● Prior strength matters: The weaker a model's initial confidence (measured via token probabilities), the more it capitulates to the retrieved content. Models with strong, high-probability priors resist incorrect retrieval more effectively. ● Simple interventions help: The authors show that exploiting confidence signals - for example, gating acceptance of retrieved content on the model's prior probability - measurably improves accuracy under conflicting information. |
Paper, Tweet |
| 5) A Survey on Retrieval-Augmented Text Generation for LLMs - This survey organizes the RAG literature into a four-stage framework (pre-retrieval, retrieval, post-retrieval, generation) and traces the paradigm's evolution alongside open challenges. ● Four-pillar framework: The survey breaks each RAG system into pre-retrieval (indexing, query formulation), retrieval (dense/sparse/hybrid), post-retrieval (reranking, compression), and generation (prompt assembly and synthesis). ● Evolution of the paradigm: Traces RAG from early dense-retrieval + reader pipelines to modern multi-hop, agentic, and graph-based variants, highlighting how each pillar has been refined. ● Evaluation methodology: Covers benchmarks and metrics for retrieval quality, faithfulness, and end-task performance, noting that evaluation remains a bottleneck for comparing systems fairly. ● Open directions: Identifies gaps in multimodal RAG, long-context RAG, real-time and streaming RAG, and safe integration with agent workflows. |
Paper, Tweet |
| 6) The Illusion of State in State-Space Models - This paper proves that modern state-space models (Mamba, S4, etc.) share the same expressive ceiling as transformers: they cannot compute anything outside the TC^0 complexity class, despite the RNN-like "state" vocabulary they borrow. ● Expressive-power result: SSMs with typical parameterizations are provably confined to TC^0, which means the "state" that accumulates through the recurrence cannot simulate general sequential computation. ● Tasks they cannot solve: Permutation composition, code evaluation with branches, and entity tracking across long narratives all require state beyond TC^0 and therefore cannot be learned reliably by these SSMs. ● Transformer parity: The result places SSMs on the same theoretical footing as transformers rather than above them, pushing back against the intuition that recurrence automatically grants richer state. ● Practical implication: If you need genuine state tracking (interpreters, stateful agents, long-horizon planning), architectural changes beyond current SSMs are required - a clean "state is an illusion" framing for the field. |
Paper, Tweet |
| 7) Reducing Hallucination in Structured Outputs via RAG - This paper deploys a compact RAG pipeline - small retriever plus small LM - for an enterprise workflow-generation task and shows it reduces hallucination while improving out-of-domain generalization vs a baseline LLM. ● Target setting: A production system that turns natural-language requirements into executable workflows, where hallucinated fields or missing steps break the downstream pipeline. ● Small retriever + small LM: Instead of a massive generator, the authors train a specialized retriever encoder and pair it with a much smaller LM, cutting compute and memory without losing output quality. ● Hallucination and generalization gains: The RAG-augmented small system reduces factual errors in the structured output and generalizes better to out-of-domain inputs than the baseline LM alone. ● Deployment implication: The setup shows that high-quality structured generation does not require frontier-sized LLMs - a disciplined retriever + small LM can be cheaper to run and easier to productionize. |
Paper, Tweet |
| 8) Emerging AI Agent Architectures - A short survey mapping the current landscape of LLM-based agent architectures, focused on reasoning, planning, and tool calling as the three capability pillars for complex agentic workflows. ● Capability pillars: Reasoning, planning, and tool/API execution are treated as the core primitives; most modern agent frameworks are characterized as different combinations and orchestrations of these three. ● Single- vs multi-agent patterns: The survey separates single-agent architectures (ReAct-style loops, tool-augmented chains) from multi-agent patterns (leader/follower, debate, specialized role teams) and contrasts their trade-offs. ● Phases and meta-design: Describes how planning, execution, and reflection phases combine inside agents, and how choices like leadership structure and communication style materially affect reliability. ● Honest assessment: The survey explicitly calls out present-day limitations - brittleness, evaluation difficulty, and the gap between demos and deployable systems - grounding future-direction discussions. |
Paper, Tweet |
| 9) LM In-Context Recall is Prompt Dependent - Using needle-in-a-haystack tests across multiple models, this paper shows that in-context recall is highly sensitive to prompt wording and that training data biases can silently degrade a model's ability to retrieve from its own context. ● Needle-in-a-haystack methodology: A factoid is embedded at various positions inside a long filler context, and recall is measured as prompt length and needle depth vary across several frontier LLMs. ● Prompt sensitivity: Small rewordings of the query can move recall accuracy dramatically, indicating that existing "context-window size" numbers overstate practical recall capability. ● Training-data interference: When the needle conflicts with content the model likely saw in pre-training, the model often returns its memorized answer instead of the in-context fact - a subtle but important failure mode. ● Paths to improve recall: The paper shows that larger size, stronger attention mechanisms, alternative training objectives, and targeted fine-tuning each independently improve recall under prompt variation. |
Paper, Tweet |
| 10) A Survey on State Space Models - A comprehensive survey of modern SSMs with a principles-first walkthrough, taxonomy of existing variants, and experimental comparison across NLP, vision, graph, multimodal, point-cloud, event-stream, and time-series tasks. ● Principles first: The survey introduces the core SSM recurrence, discretization, and parallel-scan tricks up front so readers can reason about why variants like S4, Mamba, and H3 differ. ● Broad variant coverage: Catalogs architectural and parameterization choices across major SSM families, highlighting which choices matter for which modalities. ● Cross-domain applications: Demonstrates how SSMs have been applied beyond language - vision backbones, graph learning, multimodal fusion, and long time-series modeling - with comparative results. ● Open challenges: Identifies theoretical limits (e.g. state capacity), scaling behavior, hybridization with attention, and hardware-efficient training as the main frontiers, alongside a live GitHub tracker. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Leave No Context Behind (Infini-attention) - Google's Infini-attention extends Transformer LLMs to effectively infinite context with bounded memory and compute. It blends a compressive memory module with both masked local attention and linear long-term attention inside a single Transformer block. ● Unbounded context, bounded memory: The compressive memory stores historical key-value statistics in a fixed-size matrix, so token streaming does not grow memory over time and enables true long-form inference. ● Dual-attention block: Each Infini-attention layer combines local windowed attention for recent tokens with long-term linear attention over the compressive memory, giving the same block access to short- and long-range dependencies simultaneously. ● Empirical results: 1B and 8B Infini-Transformers outperform baseline long-context models on book summarization (500K tokens) and 1M-token passkey retrieval, while achieving a 114x memory compression ratio on long-context language modeling. ● Streaming implication: Because the memory footprint is bounded, Infini-Transformer is a clean fit for streaming inference, document-flow agents, and any setting where the input grows indefinitely. |
Paper, Tweet |
| 2) OpenEQA - Meta's OpenEQA is an open-vocabulary benchmark for embodied question answering: 1,600+ human-written questions across 180+ real-world environments, with a calibrated LLM-as-judge metric that tracks human agreement closely. ● Benchmark setup: Questions cover episodic-memory use cases (smart glasses) and active-exploration use cases (mobile robots), demanding that agents reason about the environment they occupy rather than just a single image. ● LLM-as-judge scoring: The paper introduces an automatic LLM-powered evaluation protocol with strong correlation to human judgment, solving the open-vocabulary scoring problem that blocks most EQA benchmarks from scaling. ● Frontier-model performance: GPT-4V, Claude 3, and Gemini Pro significantly outperform text-only baselines, but their gains come mostly from object recognition - for several question categories, they barely beat a blind-LLM baseline. ● Gap to humans: Across the board, state-of-the-art multimodal LLMs lag well behind human performance on OpenEQA, positioning the benchmark as a concrete target for embodied-agent research. |
Paper, Tweet |
| 3) CodeGemma - CodeGemma is a family of open code LLMs built on Gemma, released in 2B (pretrained), 7B (pretrained), and 7B-IT (instruction-tuned) variants. The 2B model is optimized for low-latency code completion, and the 7B-IT model leads its weight class on HumanEval. ● Three-variant family: 2B for fast on-device completion, 7B as a capable pretrained coder, and 7B-IT for chat-style code assistance, all derived from Gemma and released with open weights. ● Training recipe: Trained on 500B additional tokens of code, math, and synthetic data with a Fill-in-the-Middle objective (80% FIM rate, 50/50 PSM/SPM split), plus novel dependency-graph-based packing and unit-test-based lexical packing. ● Benchmark results: HumanEval pass@1 of 31.1% (2B), 44.5% (7B), and 56.1% (7B-IT). Single-line infilling reaches 78.4% for the 2B model, making it a strong low-latency IDE companion. ● Deployment focus: FIM tokens enable direct use in IDE auto-completion pipelines, and quantized builds are already available for llama.cpp, LM Studio, Jan, and Ollama for local deployment. |
Paper, Tweet |
| 4) LM-Guided Chain-of-Thought - This paper offloads rationale generation to a small, trained LM while keeping a frozen large LM as the answer predictor. The small model is optimized with knowledge distillation and reinforcement learning so it produces rationales that steer the large model more effectively. ● Split responsibilities: A small (<1B) LM writes the chain-of-thought rationale; a frozen large (>10B) LM reads the rationale and produces the final answer. Only the small LM is trained, cutting cost sharply. ● Two-stage training: First, distill rationales from the large LM into the small one (knowledge distillation). Then fine-tune the small LM with RL using rationale-oriented and task-oriented reward signals. ● Multi-hop QA gains: Evaluated on HotpotQA and 2WikiMultiHopQA, LM-guided CoT outperforms standard prompting and vanilla CoT prompting on answer-prediction accuracy; self-consistency decoding compounds the gains. ● Cost-aware reasoning: The recipe is a pragmatic template for teams that cannot fine-tune frontier models - train a tiny rationale generator instead and leave the big model frozen as an API. |
Paper, Tweet |
| 5) Best Practices and Lessons on Synthetic Data - Google DeepMind's survey-style position paper on synthetic data for LLMs. It covers applications, quality-assurance principles, and the open challenges of factuality, fidelity, bias, and privacy. ● Why synthetic data now: Addresses the shortage of large, diverse, high-quality natural data and the privacy constraints around using real user content, positioning synthetic data as a complementary source rather than a replacement. ● Quality-assurance triad: Emphasizes factuality (claims are true), fidelity (distribution matches real use), and unbiasedness (no over/under-representation) as the three tests every synthetic pipeline should run. ● Responsible use: Discusses provenance tagging, contamination risks with eval sets, and how to avoid model-collapse feedback loops when synthetic data is used for pretraining. ● Open directions: Calls out evaluation of synthetic pipelines, hybrid natural+synthetic recipes, and privacy-preserving generation (e.g., differential privacy) as the main frontiers. |
Paper, Tweet |
| 6) Reasoning with Intermediate Revision and Search (THOUGHTSCULPT) - THOUGHTSCULPT is a graph-based reasoning framework that combines Monte Carlo Tree Search with an explicit revision action, letting an LLM iteratively rewrite earlier thoughts instead of only extending them. ● Revision as a first-class action: Unlike Tree-of-Thoughts, THOUGHTSCULPT allows each node to either extend or revise previous reasoning, producing an interwoven graph of thoughts rather than a pure tree. ● MCTS-driven search: Monte Carlo Tree Search navigates the solution space efficiently; evaluation is done with either domain-specific heuristics or an LLM evaluator, giving flexibility across tasks. ● Concrete gains: +30% on story outline "interestingness," +16% word success rate on mini-crosswords, and +10% concept coverage on constrained generation, all vs. competitive ToT-style baselines. ● Fit for open-ended work: Because revision is built in, THOUGHTSCULPT is especially strong on tasks like creative ideation, multi-step reasoning, and open-ended generation where first drafts are rarely optimal. |
Paper, Tweet |
| 7) Overview of Multilingual LLMs - A first-of-its-kind survey on multilingual LLMs, organized by multilingual alignment principles rather than model-family hierarchy. The authors propose a unified taxonomy and collect open resources to accelerate future research. ● Alignment-first organization: The survey groups methods by how they align languages internally (shared embeddings, cross-lingual pre-training, translation-based alignment, in-context alignment) rather than by model family. ● Unified taxonomy: Provides a single framework that covers pretraining-only multilingual models, adapter-based variants, and LLMs adapted post-hoc with translation or code-switching data. ● Emerging frontiers: Highlights low-resource languages, cross-lingual transfer for reasoning, and multilingual evaluation as the three frontiers where the community still lacks strong benchmarks. ● Open resources: The paper accompanies a curated list of papers, datasets, and leaderboards, lowering the barrier to entry for teams launching new multilingual LLM projects. |
Paper, Tweet |
| 8) The Physics of Language Models - This paper measures how many bits of factual knowledge a language model can store per parameter and finds a remarkably stable 2-bits-per-parameter ceiling, even after int8 quantization. A 7B model can therefore hold ~14B bits - more than the English Wikipedia and textbooks combined. ● Knowledge-tuple methodology: Rather than using loss or benchmarks, the authors measure storage capacity directly by training models on controlled (entity, relation, value) tuples and counting how many are retrievable. ● The 2-bits-per-parameter law: Across model sizes and architectures, trained LLMs store about 2 bits of knowledge per parameter - a ceiling that stays roughly constant under post-training int8 quantization. ● Architecture and data effects: GPT-2 with rotary embeddings matches Llama/Mistral capacity, and prepending domain names to training data significantly boosts how much the model retains, hinting at implicit domain prioritization. ● Implications for scaling: If the 2-bit ceiling is real, compute-efficient scaling should match parameter count to the amount of factual knowledge you want to store, and quantization can compress weights without losing knowledge. |
Paper, Tweet |
| 9) Aligning LLMs to Quote from Pre-Training Data (Quote-Tuning) - Quote-Tuning aligns LLMs to quote verbatim from trusted pre-training sources, turning the attribution step from post-hoc fact-checking into a built-in model behavior. ● Membership inference at train time: A fast membership-inference function checks whether generated spans exist verbatim in a trusted corpus, producing a reward signal without any human annotation. ● Preference-based alignment: The authors build a synthetic preference dataset (quoted vs non-quoted outputs) and align the model with preference optimization, teaching it when to quote. ● Strong verbatim gains: Quote-Tuning achieves up to a 130% relative increase in verbatim quotes from high-quality documents while preserving response quality across tasks, domains, and model families. ● Verification advantage: Because quoted passages can be matched exactly to the source, downstream verification becomes trivial, helping regulated domains like medicine, law, and journalism. |
Paper, Tweet |
| 10) The Influence Between NLP and Other Fields - This EMNLP 2023 analysis quantifies NLP's cross-disciplinary engagement using a Citation Field Diversity Index across 23 academic fields. The headline: NLP has become dramatically more insular over four decades. ● CFDI drop: The Citation Field Diversity Index fell from 0.58 in 1980 to 0.31 in 2022 - an all-time low, indicating NLP increasingly cites itself rather than drawing from neighboring fields. ● CS dominance: Over 80% of NLP citations now go to computer science, with less than 8% to linguistics and less than 3% each to mathematics and psychology. ● Growing echo chamber: The paper quantifies an explicit rise in intra-field citation alongside a decline in multidisciplinary works, painting NLP as a community that has closed off from cognate disciplines. ● Call to action: The authors argue the field needs to actively rebuild engagement with linguistics, psychology, and the social sciences if it wants research foundations broader than "scale and transformers." |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Many-shot Jailbreaking - proposes a jailbreaking technique called many-shot jailbreaking to evade the safety guardrails of LLMs; this jailbreaking technique exploits the longer context window supported by many modern LLMs; it includes a very large number of faux dialogues (~256) preceding the final question which effectively steers the model to produce harmful responses. | Paper, Tweet |
| 2) SWE-Agent - a new open-source agentic system that can automatically solve GitHub issues with similar accuracy as Devin on the SWE-bench; the agent interacts with a specialized terminal and enables important processing of files and executable tests to achieve good performance; on SWE-bench, SWE-agent resolves 12.29% of issues, achieving the state-of-the-art performance on the full test set. | Paper, Tweet |
| 3) Mixture-of-Depths - demonstrates that transformer models can learn to efficiently and dynamically allocate FLOPs to specific positions in a sequence; this helps to optimize the allocation along the sequence for different layers across model depth; findings suggest that for a given FLOP budget models can be trained to perform faster and better than their baseline counterparts. | Paper, Tweet |
| 4) Local Context LLMs Struggle with Long In-Context Learning - finds that after evaluating 13 long-context LLMs on long in-context learning the LLMs perform relatively well under the token length of 20K. However, after the context window exceeds 20K, most LLMs except GPT-4 will dip dramatically. | Paper, Tweet |
| 5) Visualization-of-Thought - inspired by a human cognitive capacity to imagine unseen worlds, this new work proposes Visualization-of-Thought (VoT) prompting to elicit spatial reasoning in LLMs; VoT enables LLMs to "visualize" their reasoning traces, creating internal mental images, that help to guide subsequent reasoning steps; when tested on multi-hop spatial reasoning tasks like visual tiling and visual navigation, VoT outperforms existing multimodal LLMs. | Paper, Tweet |
| 6) The Unreasonable Ineffectiveness of the Deeper Layers - finds that a simple layer-pruning strategy of popular open-weight pretraining LLMs shows minimal performance degradation until after a large fraction (up to half) of the layers are removed; using a layer similarity mechanism optimal blocks are identified and pruned followed by a small amount of fine-tuning to heal damage | Paper, Tweet |
| 7) JetMoE - an 8B model trained with less than $ 0.1 million cost but outperforms LLaMA2-7B; shows that LLM training can be much cheaper than generally thought; JetMoE-8B has 24 blocks where each block has two MoE layers: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE); each MoA and MoE layer has 8 experts, and 2 experts are activated for each input token with 2.2B active parameters. | Paper, Tweet |
| 8) Representation Finetuning for LMs - proposes a method for representation fine-tuning (ReFT) that operates on a frozen base model and learns task-specific interventions on hidden representations; in other words, by manipulating a small fraction of model representations it is possible to effectively steer model behavior to achieve better downstream performance at inference time; also proposes LoReFT as a drop-in replacement for PEFTs that is 10-50x more parameter efficient. | Paper, Tweet |
| 9) Advancing LLM Reasoning - proposes a suite of LLMs (Eurus) optimized for reasoning and achieving SoTA among open-source models on tasks such as mathematics and code generation; Eurus-70B outperforms GPT-3.5 Turbo in reasoning largely due to a newly curated, high-quality alignment dataset designed for complex reasoning tasks; the data includes instructions with preference tree consisting of reasoning chains, multi-turn interactions and pairwise data for preference learning. | Paper, Tweet |
| 10) Training LLMs over Neurally Compressed Text - explores training LLMs with neural text compressors; the proposed compression technique segments text into blocks that each compress to the same bit length; the approach improves at scale and outperforms byte-level baselines on both perplexity and inference speed benchmarks; latency is reduced to the shorter sequence length | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) DBRX - a new 132B parameter open LLM that outperforms all the established open-source models on common benchmarks like MMLU and GSM8K; DBRX was pretrained on 12T tokens (text and code) and uses a mixture-of-experts (MoE) architecture; its inference is up to 2x faster than LLaMA2-70B and is about 40% of the size of Grok-1 in terms of both total and active parameter counts; there is also DBRX Instruct which demonstrates good performance in programming and mathematics; while DBRX is trained as a general-purpose LLM, it still surpasses CodeLLaMa-70 Instruct, a model built explicitly for code generation. | Paper, Tweet |
| 2) Grok-1.5 - xAI’s latest long-context LLM for advanced understanding and reasoning and problem-solving capabilities; Grok-1.5 achieved a 50.6% score on the MATH benchmark and a 90% score on the GSM8K benchmark; this model can process long contexts of up to 128K tokens and demonstrates powerful retrieval capabilities. | Paper, Tweet |
| 3) SEEDS - a generative AI model based on diffusion models that shows powerful capabilities to quantify uncertainty in weather forecasting; it can generate a large ensemble conditioned on as few as one or two forecasts from an operational numerical weather prediction system. | Paper, Tweet |
| 4) LLMs for University-Level Coding Course - finds that the latest LLMs have not surpassed human proficiency in physics coding assignments; also finds that GPT-4 significantly outperforms GPT-3.5 and prompt engineering can further enhance performance. | Paper, Tweet |
| 5) Mini-Gemini - a simple framework to enhance multi-modality vision models; specifically, visual tokens are enhanced through an additional visual encoder for high-resolution refinement without token increase; achieves top performance in several zero-shot benchmarks and even surpasses the developed private models. | Paper, Tweet |
| 6) Long-form factuality in LLMs - investigates long-form factuality in open-domain by generating a prompt set of questions including 38 topics; also proposes an LLM-based agent to perform evaluation for the task; finds that LLM agents can achieve superhuman rating performance and is reported to be 20 times cheaper than human annotations. | Paper, Tweet |
| 7) Agent Lumos - a unified framework for training open-source LLM-based agents; it consists of a modular architecture with a planning module that can learn subgoal generation and a module trained to translate them to action with tool usage. | Paper, Tweet |
| 8) AIOS - an LLM agent operation system that integrates LLMs into operation systems as a brain; the agent can optimize resource allocation, context switching, enable concurrent execution of agents, tool service, and even maintain access control for agents. | Paper, Tweet |
| 9) FollowIR - a dataset with instruction evaluation benchmark and a separate set for teaching information retrieval model to follow real-world instructions; a FollowIR-7B model has significant improvements (over 13%) after fine-tuning on a training set. | Paper, Tweet |
| 10) LLM2LLM - an iterative data augmentation strategy that leverages a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used to effectively fine-tune models; it significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Grok-1 - a mixture-of-experts model with 314B parameters which includes the open release of the base model weights and network architecture; the MoE model activates 25% of the weights for a given token and its pretraining cutoff date is October 2023. | Paper, Tweet |
| 2) Evolutionary Model Merge - an approach for automating foundation model development using evolution to combine open-source models; facilitates cross-domain merging where a Japanese Math LLM achieved state-of-the-art performance on Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not explicitly trained for these tasks. | Paper, Tweet |
| 3) TacticAI - an AI-powered assistant for football tactics developed and evaluated in collaboration with domain experts from Liverpool FC; the systems offer coaches a way to sample and explore alternative player setups for a corner kick routine and select the tactic with the highest predicted likelihood of success; TacticAI’s model suggestions are favored over existing tactics 90% of the time and it offers an effective corner kick retrieval system. | Paper, Tweet |
| 4) Tool Use in LLMs - provides an overview of tool use in LLMs, including a formal definition of the tool-use paradigm, scenarios where LLMs leverage tool usage, and for which tasks this approach works well; it also provides an analysis of complex tool usage and summarize testbeds and evaluation metrics across LM tooling works. | Paper, Tweet |
| 5) Step-by-Step Comparisons Make LLMs Better Reasoners - proposes RankPrompt, a prompting method to enable LLMs to self-rank their responses without additional resources; this self-ranking approach ranks candidates through a systematic, step-by-step comparative evaluation; it seems to work well as it leverages the capabilities of LLMs to generate chains of comparisons as demonstrations; RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4 on many arithmetic and commonsense reasoning tasks. | Paper, Tweet |
| 6) LLM4Decompile - a family of open-access decompilation LLMs ranging from 1B to 33B parameters; these models are trained on 4 billion tokens of C source code and corresponding assembly code; the authors also introduce Decompile-Eval, a dataset for assessing re-compatibility and re-executability for decompilation and evaluating with a perspective of program semantics; LLM4Decompile demonstrates the capability to decompile 21% of the assembly code, achieving a 50% improvement over GPT-4. | Paper, Tweet |
| 7) *Agent-FLAN* - designs data and methods to effectively fine-tune language models for agents, referred to as Agent-FLAN; this enables Llama2-7B to outperform prior best works by 3.5% across various agent evaluation datasets; Agent-FLAN greatly alleviates the hallucination issues and consistently improves the agent capability of LLMs when scaling model sizes while generally improving the LLM. | Paper, Tweet |
| 8) LLMs Leak Proprietary Information - shows that it’s possible to learn a large amount of non-public information about an API-protected LLM using the logits; with a relatively small number of API queries, the approach estimates that the embedding size of OpenAI's gpt-3.5-turbo to be about 4,096; the paper also proposes guardrails against the attacks used | Paper, Tweet |
| 9) DROID - an open-source, large-scale robot manipulation dataset to train and build more capable and robust robotic manipulation policies; it contains 76K demonstration trajectories, collected across 564 scenes and 86 tasks; training with DROID leads to higher performing policies and generalization. | Paper, Tweet |
| 10) Retrieval-Augmented Fine-Tuning - combines the benefits of RAG and fine-tuning to improve a model's ability to answer questions in "open-book" in-domain settings; combining it with RAFT's CoT-style response helps to improve reasoning. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) SIMA - a generalist AI agent for 3D virtual environments that follows natural-language instructions in a broad range of 3D virtual environments and video games; SIMA is evaluated across 600 basic skills, spanning navigation, object interaction, and menu use. Language seems to be a huge factor in performance. | Paper, Tweet |
| 2) Retrieval Augmented Thoughts - shows that iteratively revising a chain of thoughts with information retrieval can significantly improve LLM reasoning and generation in long-horizon generation tasks; the key idea is that each thought step is revised with relevant retrieved information to the task query, the current and past thought steps; Retrieval Augmented Thoughts (RAT) can be applied to different models like GPT-4 and CodeLlama-7B to improve long-horizon generation tasks (e.g., creative writing and embodied task planning); RAT is a zero-shot prompting approach and provides significant improvements to baselines that include zero-shot CoT prompting, vanilla RAG, and other baselines. | Paper, Tweet |
| 3) LMs Can Teach Themselves to Think Before Speaking - presents a generalization of STaR, called Quiet-STaR, to enable language models (LMs) to learn to reason in more general and scalable ways; Quiet-STaR enables LMs to generate rationales at each token to explain future text; it proposes a token-wise parallel sampling algorithm that helps improve LM predictions by efficiently generating internal thoughts; the rationale generation is improved using REINFORCE. | Paper, Tweet |
| 4) Knowledge Conflicts for LLMs - an overview of the common issue of knowledge conflict when working with LLMs; the survey paper categorizes these conflicts into context-memory, inter-context, and intra-memory conflict; it also provides insights into causes and potential ways to mitigate these knowledge conflict issues. | Paper, Tweet |
| 5) Stealing Part of a Production Language Model - presents the first model-stealing attack that extracts information from production language models like ChatGPT or PaLM-2; shows that it's possible to recover the embedding projection layer of a transformer-based model through typical API access; as an example, the entire projection matrix was extracted from the OpenAI ada and babbage models for under $20. | Paper, Tweet |
| 6) Branch-Train-MiX - proposes mixing expert LLMs into a Mixture-of-Experts LLM as a more compute-efficient approach for training LLMs; it's shown to be more efficient than training a larger generalist LLM or several separate specialized LLMs; the approach, BTX, first trains (in parallel) multiple copies of a seed LLM specialized in different domains (i.e., expert LLMs) and merges them into a single LLM using MoE feed-forward layers, followed by fine-tuning of the overall unified model. | Paper, Tweet |
| 7) LLMs Predict Neuroscience Results - proposes a benchmark, BrainBench, for evaluating the ability of LLMs to predict neuroscience results; finds that LLMs surpass experts in predicting experimental outcomes; an LLM tuned on neuroscience literature was shown to perform even better. | Paper, Tweet |
| 8) C4AI Command-R - a 35B parameter model, with a context length of 128K, optimized for use cases that include reasoning, summarization, and question answering; Command-R has the capability for multilingual generation evaluated in 10 languages and performant tool use and RAG capabilities; it has been released for research purposes. | Paper, Tweet |
| 9) Is Cosine-Similarity Really About Simirity? - studies embeddings derived from regularized linear models and derive analytically how cosine-similarity can yield arbitrary and meaningless similarities; also finds that for some linear models, the similarities are not even unique and others are controlled by regularization; the authors caution against blindly using cosine similarity and presents considerations and alternatives. | Paper, Tweet |
| 10) Multimodal LLM Pre-training - provides a comprehensive overview of methods, analysis, and insights into multimodal LLM pre-training; studies different architecture components and finds that carefully mixing image-caption, interleaved image-text, and text-only data is key for state-of-the-art performance; it also proposes a family of multimodal models up to 30B parameters that achieve SOTA in pre-training metrics and include properties such as enhanced in-context learning, multi-image reasoning, enabling few-shot chain-of-thought prompting. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Claude 3 - consists of a family of three models (Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus); Claude 3 Opus (the strongest model) seems to outperform GPT-4 on common benchmarks like MMLU and HumanEval; Claude 3 capabilities include analysis, forecasting, content creation, code generation, and converting in non-English languages like Spanish, Japanese, and French; 200K context windows supported but can be extended to 1M token to select customers; the models also have strong vision capabilities for processing formats like photos, charts, and graphs; Anthropic claims these models have a more nuanced understanding of requests and make fewer refusals. | Paper, Tweet |
| 2) Robust Evaluation of Reasoning - proposes functional benchmarks for the evaluation of the reasoning capabilities of LLMs; finds that there is a reasoning gap with current models from 58.35% to 80.31%; however, the authors also report that those gaps can be reduced with more sophisticated prompting strategies. | Paper, Tweet |
| 3) GaLore - proposes a memory-efficient approach for training LLM through low-rank projection; the training strategy allows full-parameter learning and is more memory-efficient than common low-rank adaptation methods such as LoRA; reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures. | Paper, Tweet |
| 4) Can LLMs Reason and Plan? - a new position paper discusses the topic of reasoning and planning for LLMs; here is a summary of the author's conclusion: "To summarize, nothing that I have read, verified, or done gives me any compelling reason to believe that LLMs do reasoning/planning, as normally understood. What they do instead, armed with web-scale training, is a form of universal approximate retrieval, which, as I have argued, can sometimes be mistaken for reasoning capabilities". | Paper, Tweet |
| 5) RAG for AI-Generated Content - provides an overview of RAG used in different generation scenarios like code, image, and audio, including a taxonomy of RAG enhancements with reference to key papers. | Paper, Tweet |
| 6) KnowAgent - proposes an approach to enhance the planning capabilities of LLMs through explicit action knowledge; uses an action knowledge base and a knowledgeable self-learning phase to guide the model's action generation, mitigate planning hallucination, and enable continuous improvement; outperforms existing baselines and shows the potential of integrating external action knowledge to streamline planning with LLMs and solve complex planning challenges. | Paper, Tweet |
| 7) Sora Overview - a comprehensive review of Sora and some of the key developments powering this model, including limitations and opportunities of large vision models. | Paper, Tweet |
| 8) LLM for Law - introduces SaulLM-7B, a large language model for the legal domain explicitly designed for legal text comprehension and generation; presents an instructional fine-tuning method that leverages legal datasets to further enhance performance in legal tasks. | Paper, Tweet |
| 9) Design2Code - investigates the use of multimodal LLMs for converting a visual design into code implementation which is key for automating front-end engineering; introduces a benchmark of 484 diverse real-world webpages and a set of evaluation metrics to measure the design-to-code capability; further develops a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision; an open-source fine-tuned Design2Code matches the performance of Gemini Pro Vision, however, GPT-4V performs the best on the task. | Paper, Tweet |
| 10) TripoSR - a transformer-based 3D reconstruction model for fast feed-forward 3D generation; it can produce 3D mesh from a single image in under 0.5 seconds; improvement includes better data processing, model design, and training. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Genie - a foundation model trained from internet videos and with the ability to generate a variety of action-controllable 2D worlds given an image prompt; Genie has 11B parameters and consists of a spatiotemporal video tokenizer, an autoregressive dynamic model, and a scalable latent action model; the latent action space enables training agents to imitate behaviors from unseen video which is promising for building more generalist agents. | Paper, Tweet |
| 2) Mistral Large - a new LLM with strong multilingual, reasoning, maths, and code generation capabilities; features include: 1) 32K tokens context window, 2) native multilingual capacities, 3) strong abilities in reasoning, knowledge, maths, and coding benchmarks, and 4) function calling and JSON format natively supported. | Paper, Tweet |
| 3) The Era of 1-bit LLMs - introduces a high-performing and cost-effective 1-bit LLM variant called BitNet b1.58 where every parameter is a ternary {-1, 0, 1}; given the same model size and training tokens, BitNet b1.58 can match the perplexity and task performance of a full precision Transformer LLM (i.e., FP16); the benefits of this 1-bit LLM are significantly better latency, memory, throughout, and energy consumption. | Paper, Tweet |
| 4) Dataset for LLMs - a comprehensive overview (180+ pages) and analysis of LLM datasets. | Paper, Tweet |
| 5) LearnAct - explores open-action learning for language agents through an iterative learning strategy that creates and improves actions using Python functions; on each iteration, the proposed framework (LearnAct) expands the action space and enhances action effectiveness by revising and updating available actions based on execution feedback; the LearnAct framework was tested on Robotic planning and AlfWorld environments; it improves agent performance by 32% in AlfWorld compared to ReAct+Reflexion. | Paper, Tweet |
| 6) EMO - a new framework for generating expressive video by utilizing a direct audio-to-video synthesis approach; by leveraging an Audio2Video diffusion model it bypasses the need for intermediate 3D models or facial landmarks; EMO can produce convincing speaking videos and singing videos in various styles while outperforming existing methods in terms of expressiveness and realism. | Paper, Tweet |
| 7) On the Societal Impact of Open Foundation Models - a position paper with a focus on open foundation models and their impact, benefits, and risks; proposes a risk assessment framework for analyzing risk and explains why the marginal risk of open foundation models is low in some cases; it also offers a more grounded assessment of the societal impact of open foundation models. | Paper, Tweet |
| 8) StarCoder 2 - a family of open LLMs for code with three different sizes (3B, 7B, and 15B); the 15B model was trained on 14 trillion tokens and 600+ programming languages with a context window of 16K token and employing a fill-in-the-middle objective; it matches 33B+ models on many evaluation like code completion, code reasoning, and math reasoning aided through PAL. | Paper, Tweet |
| 9) LLMs on Tabular Data - an overview of LLMs for tabular data tasks including key techniques, metrics, datasets, models, and optimization approaches; it covers limitations and unexplored ideas with insights for future research directions. | Paper, Tweet |
| 10) PlanGPT - shows how to leverage LLMs and combine multiple approaches like retrieval augmentation, fine-tuning, tool usage, and more; the proposed framework is applied to urban and spatial planning but there are a lot of insights and practical tips that apply to other domains. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Stable Diffusion 3 - a suite of image generation models ranging from 800M to 8B parameters; combines diffusion transformer architecture and flow matching for improved performance in multi-subject prompts, image quality, and spelling abilities; technical report to be published soon and linked here. | Paper, Tweet |
| 2) Gemma - a series of open models inspired by the same research and tech used for Gemini; includes 2B (trained on 2T tokens) and 7B (trained on 6T tokens) models including base and instruction-tuned versions; trained on a context length of 8192 tokens; generally outperforms Llama 2 7B and Mistral 7B. | Paper, Tweet |
| 3) LLMs for Data Annotation - an overview and a good list of references that apply LLMs for data annotation; includes a taxonomy of methods that employ LLMs for data annotation; covers three aspects: LLM-based data annotation, assessing LLM-generated annotations, and learning with LLM-generated annotations. | Paper, Tweet |
| 4) GRIT - presents generative representational instruction tuning where an LLM is trained to perform both generative and embedding tasks and designed to distinguish between them via the instructions; produces new state-of-the-art on MTEB and the unification is reported to speed up RAG by 60% for long documents. | Paper, Tweet |
| 5) LoRA+ - proposes LoRA+ which improves performance and finetuning speed (up to ∼ 2X speed up), at the same computational cost as LoRA; the key difference between LoRA and LoRA+ is how the learning rate is set; LoRA+ sets different learning rates for LoRA adapter matrices while in LoRA the learning rate is the same. | Paper, Tweet |
| 6) Revisiting REINFORCE in RLHF - shows that many components of PPO are unnecessary in an RLHF context; it also shows that a simpler REINFORCE variant outperforms both PPO and newly proposed alternatives such as DPO and RAFT; overall, it shows that online RL optimization can be beneficial and low cost. | Paper, Tweet |
| 7) Recurrent Memory Finds What LLMs Miss - explores the capability of transformer-based models in extremely long context processing; finds that both GPT-4 and RAG performance heavily rely on the first 25% of the input, which means there is room for improved context processing mechanisms; reports that recurrent memory augmentation of transformer models achieves superior performance on documents of up to 10 million tokens. | Paper, Tweet |
| 8) When is Tree Search Useful for LLM Planning - investigates how LLM solves multi-step problems through a framework consisting of a generator, discriminator, and planning method (e.g., iterative correction and tree search); reports that planning methods demand discriminators with at least 90% accuracy but current LLMs don’t demonstrate these discrimination capabilities; finds that tree search is at least 10 to 20 times slower but regardless of it good performance it’s impractical for real-world applications. | Paper, Tweet |
| 9) CoT Reasoning without Prompting - proposes a chain-of-thought (CoT) decoding method to elicit the reasoning capabilities from pre-trained LLMs without explicit prompting; claims to significantly enhance a model’s reasoning capabilities over greedy decoding across reasoning benchmarks; finds that the model's confidence in its final answer increases when CoT is present in its decoding path. | Paper, Tweet |
| 10) OpenCodeInterpreter - a family of open-source systems for generating, executing, and iteratively refining code; proposes a dataset of 68K multi-turn interactions; integrates execution and human feedback for dynamic code refinement and produces high performance on benchmarks like HumalEval and EvalPlus. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Sora - a text-to-video AI model that can create videos of up to a minute of realistic and imaginative scenes given text instructions; it can generate complex scenes with multiple characters, different motion types, and backgrounds, and understand how they relate to each other; other capabilities include creating multiple shots within a single video with persistence across characters and visual style. | Paper, Tweet |
| 2) Gemini 1.5 - a compute-efficient multimodal mixture-of-experts model that focuses on capabilities such as recalling and reasoning over long-form content; it can reason over long documents potentially containing millions of tokens, including hours of video and audio; improves the state-of-the-art performance in long-document QA, long-video QA, and long-context ASR. Gemini 1.5 Pro matches or outperforms Gemini 1.0 Ultra across standard benchmarks and achieves near-perfect retrieval (>99%) up to at least 10 million tokens, a significant advancement compared to other long-context LLMs. | Paper, Tweet |
| 3) V-JEPA - a collection of vision models trained on a feature prediction objective using 2 million videos; relies on self-supervised learning and doesn’t use pretrained image encoders, text, negative examples, reconstruction, or other supervision sources; claims to achieve versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters. | Paper, Tweet |
| 4) Large World Model - a general-purpose 1M context multimodal model trained on long videos and books using RingAttention; sets new benchmarks in difficult retrieval tasks and long video understanding; uses masked sequence packing for mixing different sequence lengths, loss weighting, and model-generated QA dataset for long sequence chat; open-sources a family of 7B parameter models that can process long text and videos of over 1M tokens. | Paper, Tweet |
| 5) The boundary of neural network trainability is fractal - finds that the boundary between trainable and untrainable neural network hyperparameter configurations is fractal; observes fractal hyperparameter landscapes for every neural network configuration and deep linear networks; also observes that the best-performing hyperparameters are at the end of stability. | Paper, Tweet |
| 6) OS-Copilot - a framework to build generalist computer agents that interface with key elements of an operating system like Linux or MacOS; it also proposes a self-improving embodied agent for automating general computer tasks; this agent outperforms the previous methods by 35% on the general AI assistants (GAIA) benchmark. | Paper, Tweet |
| 7) TestGen-LLM - uses LLMs to automatically improve existing human-written tests; reports that after an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases were built correctly, 57% passed reliably, and 25% increased coverage. | Paper, Tweet |
| 8) ChemLLM - a dedicated LLM trained for chemistry-related tasks; claims to outperform GPT-3.5 on principal tasks such as name conversion, molecular caption, and reaction prediction; it also surpasses GPT-4 on two of these tasks. | Paper, Tweet |
| 9) Survey of LLMs - reviews three popular families of LLMs (GPT, Llama, PaLM), their characteristics, contributions, and limitations; includes a summary of capabilities and techniques developed to build and augment LLM; it also discusses popular datasets for LLM training, fine-tuning, and evaluation, and LLM evaluation metrics; concludes with open challenges and future research directions. | Paper, Tweet |
| 10) LLM Agents can Hack - shows that LLM agents can automatically hack websites and perform tasks like SQL injections without human feedback or explicit knowledge about the vulnerability beforehand; this is enabled by an LLM’s tool usage and long context capabilities; shows that GPT-4 is capable of such hacks, including finding vulnerabilities in websites in the wild; open-source models did not show the same capabilities. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Grandmaster-Level Chess Without Search - trains a 270M parameter transformer model with supervised learning on a dataset of 10 million chess games with up to 15 billion data points; reaches a Lichess blitz Elo of 2895 against humans, and solves a series of challenging chess puzzles; it shows the potential of training at scale for chess and without the need for any domain-specific tweaks or explicit search algorithms. | Paper, Tweet |
| 2) AnyTool - an LLM-based agent that can utilize 16K APIs from Rapid API; proposes a simple framework consisting of 1) a hierarchical API-retriever to identify relevant API candidates to a query, 2) a solver to resolve user queries, and 3) a self-reflection mechanism to reactivate AnyTool if the initial solution is impracticable; this tool leverages the function calling capability of GPT-4 so no further training is needed; the hierarchical API-retriever is inspired by a divide-and-conquer approach to help reduce the search scope of the agents which leads to overcoming limitations around context length in LLMs; the self-reflection component helps with resolving easy and complex queries efficiently. | Paper, Tweet |
| 3) A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention - investigates and expands the theoretical understanding of learning with attention layers by exploring the interplay between positional and semantic attention; it employs a toy model of dot-product attention and identifies an emergent phase transition between semantic and positional learning; shows that if provided with sufficient data, dot-product attention layer outperforms a linear positional baseline when using the semantic mechanism. | Paper, Tweet |
| 4) Indirect Reasoning with LLMs - proposes an indirect reasoning method to strengthen the reasoning power of LLMs; it employs the logic of contrapositives and contradictions to tackle IR tasks such as factual reasoning and mathematic proof; it consists of two key steps: 1) enhance the comprehensibility of LLMs by augmenting data and rules (i.e., the logical equivalence of contrapositive), and 2) design prompt templates to stimulate LLMs to implement indirect reasoning based on proof by contradiction; experiments on LLMs like GPT-3.5-turbo and Gemini Pro show that the proposed method enhances the overall accuracy of factual reasoning by 27.33% and mathematic proof by 31.43% compared to traditional direct reasoning methods. | Paper, Tweet |
| 5) ALOHA 2 - a low-cost system for bimanual teleoperation that improves the performance, user-friendliness, and durability of ALOHA; efforts include hardware improvements such as grippers and gravity compensation with a higher quality simulation model; this potentially enables large-scale data collection on more complex tasks to help advanced research in robot learning. | Paper, Tweet |
| 6) More Agents is All You Need - presents a study on the scaling property of raw agents instantiated by LLMs; finds that performance scales when increasing agents by simply using a sampling-and-voting method. | Paper, Tweet |
| 7) Self-Discovered Reasoning Structures - proposes a new framework, Self-Discover, that enables LLMs to select from multiple reasoning techniques (e.g., critical thinking and thinking step-by-step) to compose task-specific reasoning strategies; outperforms CoT (applied to GPT-4 and PaLM 2) on BigBench-Hard experiments and requires 10-40x fewer inference compute than other inference-intensive methods such as CoT-Self-Consistency; the self-discovered reasoning structures are also reported to transfer well between LLMs and small language models (SLMs). | Paper, Tweet |
| 8) DeepSeekMath - continues pretraining a code base model with 120B math-related tokens; introduces GRPO (a variant to PPO) to enhance mathematical reasoning and reduce training resources via a memory usage optimization scheme; DeepSeekMath 7B achieves 51.7% on MATH which approaches the performance level of Gemini-Ultra (53.2%) and GPT-4 (52.9%); when self-consistency is used the performance improves to 60.9%. | Paper, Tweet |
| 9) LLMs for Table Processing - provides an overview of LLMs for table processing, including methods, benchmarks, prompting techniques, and much more. | Paper, Tweet |
| 10) LLM-based Multi-Agents - discusses the essential aspects of LLM-based multi-agent systems; it includes a summary of recent applications for problem-solving and word simulation; it also discusses datasets, benchmarks, challenges, and future opportunities to encourage further research and development from researchers and practitioners. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) OLMo - introduces Open Language Model (OLMo), a 7B parameter model; it includes open training code, open data, full model weights, evaluation code, and fine-tuning code; it shows strong performance on many generative tasks; there is also a smaller version of it, OLMo 1B. | Paper, Tweet |
| 2) Advances in Multimodal LLMs - a comprehensive survey outlining design formulations for model architecture and training pipeline around multimodal large language models. | Paper, Tweet |
| 3) Corrective RAG - proposes Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation in a RAG system; the core idea is to implement a self-correct component for the retriever and improve the utilization of retrieved documents for augmenting generation; the retrieval evaluator helps to assess the overall quality of retrieved documents given a query; using web search and optimized knowledge utilization operations can improve automatic self-correction and efficient utilization of retrieved documents. | Paper, Tweet |
| 4) LLMs for Mathematical Reasoning - introduces an overview of research developments in LLMs for mathematical reasoning; discusses advancements, capabilities, limitations, and applications to inspire ongoing research on LLMs for Mathematics. | Paper, Tweet |
| 5) Compression Algorithms for LLMs - covers compression algorithms like pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. | Paper, Tweet |
| 6) MoE-LLaVA - employs Mixture of Experts tuning for Large Vision-Language Models which constructs a sparse model with a substantial reduction in parameters with a constant computational cost; this approach also helps to address performance degradation associated with multi-modal learning and model sparsity. | Paper, Tweet |
| 7) Rephrasing the Web - uses an off-the-shelf instruction-tuned model prompted to paraphrase web documents in specific styles and formats such as “like Wikipedia” or “question-answer format” to jointly pre-train LLMs on real and synthetic rephrases; it speeds up pre-training by ~3x, improves perplexity, and improves zero-shot question answering accuracy on many tasks. | Paper, Tweet |
| 8) Redefining Retrieval in RAG - a study that focuses on the components needed to improve the retrieval component of a RAG system; confirms that the position of relevant information should be placed near the query, the model will struggle to attend to the information if this is not the case; surprisingly, it finds that related documents don't necessarily lead to improved performance for the RAG system; even more unexpectedly, irrelevant and noisy documents can help drive up accuracy if placed correctly. | Paper, Tweet |
| 9) Hallucination in LVLMs - discusses hallucination issues and techniques to mitigate hallucination in Large Vision-Language Models (LVLM); it introduces LVLM hallucination evaluation methods and benchmarks; provides tips and a good analysis of the causes of LVLM hallucinations and potential ways to mitigate them. | Paper, Tweet |
| 10) SliceGPT - a new LLM compression technique that proposes a post-training sparsification scheme that replaces each weight matrix with a smaller dense matrix; helps reduce the embedding dimension of the network and can remove up to 20% of model parameters for Llama2-70B and Phi-2 models while retaining most of the zero-shot performance of the dense models. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Depth Anything - a robust monocular depth estimation solution that can deal with any images under any circumstance; automatically annotates large-scale unlabeled data (~62M) which helps to reduce generalization error; proposes effective strategies to leverage the power of the large-scale unlabeled data; besides generalization ability, it established new state-of-the-art through fine-tuning and even results in an enhanced depth-conditioned ControlNet. | Paper, Tweet |
| 2) Knowledge Fusion of LLMs - proposes FuseLLM with the core idea of externalizing knowledge from multiple LLMs and transferring their capabilities to a target LLM; leverages the generative distributions of source LLMs to externalize both their collective knowledge and individual strengths and transfer them to the target LLM through continual training; finds that the FuseLLM can improve the performance of the target model across a range of capabilities such as reasoning, common sense, and code generation. | Paper, Tweet |
| 3) MambaByte - adapts Mamba SSM to learn directly from raw bytes; bytes lead to longer sequences which autoregressive Transformers will scale poorly on; this work reports huge benefits related to faster inference and even outperforms subword Transformers. | Paper, Tweet |
| 4) Diffuse to Choose - a diffusion-based image-conditioned inpainting model to balance fast inference with high-fidelity while enabling accurate semantic manipulations in a given scene content; outperforms existing zero-shot diffusion inpainting methods and even few-shot diffusion personalization algorithms such as DreamPaint. | Paper, Tweet |
| 5) WARM - introduces weighted averaged rewards models (WARM) that involve fine-tuning multiple rewards models and then averaging them in the weight space; average weighting improves efficiency compared to traditional prediction ensembling; it improves the quality and alignment of LLM predictions. | Paper, Tweet |
| 6) Resource-efficient LLMs & Multimodal Models - a survey of resource-efficient LLMs and multimodal foundations models; provides a comprehensive analysis and insights into ML efficiency research, including architectures, algorithms, and practical system designs and implementations. | Paper, Tweet |
| 7) Red Teaming Visual Language Models - first presents a red teaming dataset of 10 subtasks (e.g., image misleading, multi-modal jailbreaking, face fairness, etc); finds that 10 prominent open-sourced VLMs struggle with the red teaming in different degrees and have up to 31% performance gap with GPT-4V; also applies red teaming alignment to LLaVA-v1.5 with SFT using the proposed red teaming dataset, which improves model performance by 10% in the test set. | Paper, Tweet |
| 8) Lumiere - a text-to-video space-time diffusion model for synthesizing videos with realistic and coherent motion; introduces a Space-Time U-Net architecture to generate the entire temporal duration of a video at once via a single pass; achieves state-of-the-art text-to-video generation results and supports a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation. | Paper, Tweet |
| 9) Medusa - a simple framework for LLM inference acceleration using multiple decoding heads that predict multiple subsequent tokens in parallel; parallelization substantially reduces the number of decoding steps; it can achieve over 2.2x speedup without compromising generation quality, while Medusa-2 further improves the speedup to 2.3-3.6x. | Paper, Tweet |
| 10) AgentBoard - a comprehensive benchmark with an open-source evaluation framework to perform analytical evaluation of LLM agents; helps to assess the capabilities and limitations of LLM agents and demystifies agent behaviors which leads to building stronger and robust LLM agents. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) AlphaGeometry - an AI system that acts as a theorem prover that can solve Olympiad geometry problems without human demonstrations; this system is trained on synthetic data involving millions of theorems and proofs across different levels of complexity; the data is used to train a neural language model that can solve olympiad-level problems and approaches the performance of an average International Mathematical Olympiad (IMO) gold medallist. | Paper, Tweet |
| 2) AlphaCodium - a code-oriented iterative flow that improves LLMs on code generation; it involves two key steps to improve code generation capabilities in LLMs: i) additional generated data (problem self-reflection and test reasoning) to aid the iterative process, and ii) enriching public tests using additional AI-generated tests; using the CodeContests validation dataset, GPT-4 pass@5 accuracy increased from 19% using a single well-crafted prompt to 44% using the AlphaCodium flow; it even outperforms AlphaCode using a significantly smaller computation budget and 4 orders of magnitude fewer LLM calls. | Paper, Tweet |
| 3) RAG vs. Finetuning - report discussing the tradeoff between RAG and fine-tuning when using LLMs like Llama 2 and GPT-4; performs a detailed analysis and highlights insights when applying the pipelines on an agricultural dataset; observes that there is an accuracy increase of over 6 p.p. when fine-tuning the model and this is cumulative with RAG, which increases accuracy by 5 p.p. further. | Paper, Tweet |
| 4) Self-Rewarding Models - proposes a self-alignment method that uses the model itself for LLM-as-a-Judge prompting to provide its rewards during training; Iterative DPO is used for instruction following training using the preference pairs built from the generated data which comes from a self-instruction creation phase; using this approach, fine-tuning a Llama 2 70B model on three iterations can lead to a model that outperforms LLMs like Claude 2 and Gemini Pro on the AlpacaEval 2.0 leaderboard. | Paper, Tweet |
| 5) Tuning Language Models by Proxy - introduces proxy-tuning, a decoding-time algorithm that modifies logits of a target LLM with the logits’ difference between a small base model and a fine-tuned base model; this can enable a larger target base model to perform as well as would a fine-tuned version of it; proxy-tuning is applied to Llama2-70B using proxies of only 7B size to close 88% of the gap between Llama2-70B and its tuned chat version. | Paper, Tweet |
| 6) Reasoning with Reinforced Fine-Tuning - proposes an approach, ReFT, to enhance the generalizability of LLMs for reasoning; it starts with applying SFT and then applies online RL for further refinement while automatically sampling reasoning paths to learn from; this differs from RLHF in that it doesn’t utilize a reward model learned from human-labeled data; ReFT demonstrates improved performance and generalization abilities on math problem-solving. | Paper, Tweet |
| 7) Overview of LLMs for Evaluation - thoroughly surveys the methodologies and explores their strengths and limitations; provides a taxonomy of different approaches involving prompt engineering or calibrating open-source LLMs for evaluation | Paper, Tweet |
| 8) Patchscopes - proposes a framework that leverages a model itself to explain its internal representations; it decodes information from LLM hidden representations which is possible by “patching” representations into a separate inference pass that encourages the extraction of that information; it can be used to answer questions about an LLM’s computation and can even be used to fix latent multi-hop reasoning errors. | Paper, Tweet |
| 9) The Unreasonable Effectiveness of Easy Training Data for Hard Tasks - suggests that language models often generalize well from easy to hard data, i.e., easy-to-hard generalization; it argues that it can be better to train on easy data as opposed to hard data, even when the emphasis is on improving performance on hard data, and suggests that the scalable oversight problem may be easier than previously thought. | Paper, Tweet |
| 10) MoE-Mamba - an approach to efficiently scale LLMs by combining state space models (SSMs) with Mixture of Experts (MoE); MoE-Mamba, outperforms both Mamba and Transformer-MoE; it reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) InseRF - a method for text-driven generative object insertion in the Neural 3D scenes; it enables users to provide textual descriptions and a 2D bounding box in a reference viewpoint to generate new objects in 3D scenes; InseRF is also capable of controllable and 3D-consistent object insertion without requiring explicit 3D information as input. | Paper, Tweet |
| 2) Sleeper Agents - shows that LLMs can learn deceptive behavior that persists through safety training; for instance, an LLM was trained to write secure code for a specified year but given another year can enable exploitable code; this backdoor behavior can persist even when training LLMs with techniques like reinforcement learning and adversarial training. | Paper, Tweet |
| 3) Blending Is All You Need - shows that effectively combining existing small models of different sizes (6B/13B parameters) can result in systems that can compete with ChatGPT level performance; the goal is to build a collaborative conversational system that can effectively leverage these models to improve engagement and quality of chat AIs and generate more diverse responses. | Paper, Tweet |
| 4) MagicVideo-V2 - proposes an end-to-end video generation pipeline that integrates the text-to-image model, video motion generator, reference image embedding module, and frame interpolation module; it can generate high-resolution video with advanced fidelity and smoothness compared to other leading and popular text-to-video systems. | Paper, Tweet |
| 5) Trustworthiness in LLMs - a comprehensive study (100+ pages) of trustworthiness in LLMs, discussing challenges, benchmarks, evaluation, analysis of approaches, and future directions; proposes a set of principles for trustworthy LLMs that span 8 dimensions, including a benchmark across 6 dimensions (truthfulness, safety, fairness, robustness, privacy, and machine ethics); it also presents a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets; while proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, there are a few open-source models that are closing the gap. | Paper, Tweet |
| 6) Prompting LLMs for Table Understanding - a new framework, inspired by Chain-of-Thought prompting, to instruct LLMs to dynamically plan a chain of operations that transforms a complex table to reliably answer the input question; an LLM is used to iteratively generate operations, step-by-step, that will perform necessary transformations to the table (e.g., adding columns or deleting info). | Paper, Tweet |
| 7) Jailbreaking Aligned LLMs - proposes 40 persuasion techniques to systematically jailbreak LLMs; their adversarial prompts (also referred to as persuasive adversarial prompts) achieve a 92% attack success rate on aligned LLMs, like Llama 2-7B and GPT-4, without specialized optimization. | Paper, Tweet |
| 8) From LLM to Conversational Agents - proposes RAISE, an advanced architecture to enhance LLMs for conversational agents; it's inspired by the ReAct framework and integrates a dual-component memory system; it utilizes a scratchpad and retrieved examples to augment the agent's capabilities; the scratchpad serves as transient storage (akin to short-term memory) and the retrieval module operates as the agent's long-term memory; this system mirrors human short-term and long-term memory and helps to maintain context and continuity which are key in conversational systems. | Paper, Tweet |
| 9) Quantifying LLM’s Sensitivity to Spurious Features in Prompt Design - finds that widely used open-source LLMs are extremely sensitive to prompt formatting in few-shot settings; subtle changes in prompt formatting using a Llama 2 13B model can result in a performance difference of up to 76 accuracy points. | Paper, Tweet |
| 10) Adversarial Machine Learning - a comprehensive survey that covers the current state of adversarial ML with a proper taxonomy of concepts, discussions, adversarial methods, mitigation tactics, and remaining challenges. | Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Mobile ALOHA - proposes a system that learns bimanual mobile manipulation with low-cost whole-body teleoperation; it first collects high-quality demonstrations and then performs supervised behavior cloning; finds that co-training with existing ALOHA datasets increases performance on complex mobile manipulation tasks such as sauteing and serving a piece of shrimp, opening a two-door wall cabinet to store heavy cooking pots while keeping the budget under $32K | Paper, Tweet |
| 2) Mitigating Hallucination in LLMs - summarizes 32 techniques to mitigate hallucination in LLMs; introduces a taxonomy categorizing methods like RAG, Knowledge Retrieval, CoVe, and more; provides tips on how to apply these methods and highlights the challenges and limitations inherent in them. | Paper, Tweet |
| 3) Self-Play Fine-tuning - shows that without acquiring additional human-annotated data, a supervised fine-tuned LLM can be improved; inspired by self-play, it first uses the LLM to generate its training data from its previous iterations; it then refines its policy by distinguishing the self-generated responses from those obtained from human-annotated data; shows that the method can improve LLM’s performance and outperform models trained via DPO with GPT-4 preference data. | Paper, Tweet |
| 4) LLaMA Pro - proposes a post-pretraining method to improve an LLM’s knowledge without catastrophic forgetting; it achieves this by tuning expanded identity blocks using only new corpus while freezing the inherited blocks; uses math and code data to train a LLaMA Pro-8.3B initialized from Llama2-7B; these models achieve advanced performance on various benchmarks compared to base models while preserving the original general capabilities. | Paper, Tweet |
| 5) LLM Augmented LLMs - explore composing existing foundation models with specific models to expand capabilities; introduce cross-attention between models to compose representations that enable new capabilities; as an example, a PaLM2-S model was augmented with a smaller model trained on low-resource languages to improve English translation and arithmetic reasoning for low-resource languages; this was also done with a code-specific model which led to a 40% improvement over the base code model on code generation and explanation tasks. | Paper, Tweet |
| 6) Fast Inference of Mixture-of-Experts - achieves efficient inference of Mixtral-8x7B models through offloading; it applies separate quantization for attention layers and experts to fit the model in combined GPU and CPU memory; designs a MoE-specific offloading strategy that enables running Mixtral-8x7B on desktop hardware and free-tier Google Colab instances | Paper, Tweet |
| 7) GPT-4V is a Generalist Web Agent - explores the potential of GPT-4V as a generalist web agent; in particular, can such a model follow natural language instructions to complete tasks on a website? the authors first developed a tool to enable web agents to run on live websites; findings suggest that GPT-4V can complete 50% of tasks on live websites, possible through manual grounding of its textual plans into actions on the websites. | Paper, Tweet |
| 8) DocLLM - a lightweight extension to traditional LLMs for reasoning over visual documents; focuses on using bounding box information to incorporate spatial layout structure; proposes a pre-training objective that addresses irregular layout and heterogeneous content present in visual documents; it’s then fine-tuned on an instruction-dataset and demonstrate SoTA performance on 14 out of 16 datasets across several document intelligence tasks. | Paper, Tweet |
| 9) How Code Empowers LLMs - a comprehensive overview of the benefits of training LLMs with code-specific data. Some capabilities include enhanced code generation, enabling reasoning, function calling, automated self-improvements, and serving intelligent agents. | Paper, Tweet |
| 10) Instruct-Imagen - proposes an image generation model that tackles heterogeneous image generation tasks and generalizes across unseen tasks; it first enhances the model’s ability to ground its generation on external multimodal context and then fine-tunes on image generation tasks with multimodal instructions | Paper, Tweet |