This page collects every weekly issue of AI Papers of the Week from 2023. For other years, see the main index.
| Paper | Links |
|---|---|
| 1) CogAgent - Tsinghua's CogAgent is an 18B-parameter visual-language model purpose-built for GUI understanding and navigation, with unusually high input resolution. ● High-res GUI input: Supports 1120x1120 input resolution via a dedicated high-res cross-module, letting it read small fonts and dense UI elements that typical VLMs blur out. ● Dual-tower vision: Combines a low-res general vision encoder with a high-res cross-module, balancing context understanding with fine-grained icon/text perception. ● Broad capabilities: Handles visual Q&A, visual grounding, and end-to-end GUI agent tasks on web and desktop, positioning as a general GUI backbone. ● SoTA VQA: Achieves state-of-the-art on 5 text-rich (e.g., OCR-heavy) and 4 general VQA benchmarks, covering document, chart, and scene understanding. |
Paper, Tweet |
| 2) From Gemini to Q-Star - A 300+-paper survey mapping the state of Generative AI and the research frontiers that followed the Gemini + rumored Q* news cycle. ● Broad coverage: Surveys developments across language, vision, audio, and multimodal generative systems, treating Gen AI as a unified field rather than siloed modalities. ● Computational challenges: Catalogs scalability, efficiency, and alignment challenges currently gating further progress, including training compute, inference serving, and evaluation. ● Real-world applications: Reviews Gen AI impact across healthcare, finance, and education, highlighting where genuine deployment signals diverge from hype. ● Future directions: Identifies agent frameworks, reasoning, grounded multimodality, and alignment as the most live research areas heading into 2024. |
Paper, Tweet |
| 3) PromptBench - A unified library for comprehensive evaluation and analysis of LLMs that consolidates multiple evaluation concerns under one roof. ● Prompt-construction tooling: Ships with utilities for prompt construction, prompt engineering, and dataset/model loading, covering the end-to-end LLM evaluation workflow. ● Adversarial prompt attacks: Built-in adversarial prompt-attack capabilities let users stress-test LLMs against perturbations rather than just measuring clean accuracy. ● Dynamic evaluation: Supports dynamic evaluation protocols to detect dataset contamination and measure robustness beyond static benchmark numbers. ● Unified interface: Replaces the ad-hoc evaluation scripts many teams maintain with a consistent API, reducing friction when comparing across models and prompt variants. |
Paper, Tweet |
| 4) Exploiting Novel GPT-4 APIs - A red-team study of three newer GPT-4 API surfaces - fine-tuning, function calling, and knowledge retrieval - that reveals each introduces new attack vectors. ● Fine-tuning strips safeguards: As few as 15 harmful examples - or even 100 benign examples - fine-tuned into GPT-4 is enough to remove core safety behaviors. ● Function-call schema leakage: GPT-4 Assistants can be coerced into divulging their function-call schemas and then tricked into executing arbitrary function calls. ● Retrieval hijacking: The knowledge-retrieval endpoint is vulnerable to prompt injection via documents in the retrieval corpus, letting attackers steer model behavior through uploaded content. ● Policy implication: Expanding API surface area introduces alignment risks that weren't present for text-only completions, and API providers need surface-specific defenses rather than relying on base-model alignment. |
Paper, Tweet |
| 5) Fact Recalling in LLMs - A mechanistic-interpretability study showing that early MLP layers function as a lookup table for factual recall. ● Athletes-to-sports task: Scoped to how Pythia 2.8B recalls which of 3 different sports various athletes play - a clean task for dissecting a single type of factual recall. ● Early MLPs as lookup table: Early MLP layers perform a structured lookup rather than distributed reasoning, with specific neurons keyed to entity-attribute pairs. ● Multi-token embedding view: Recommends treating factual knowledge recall as operating over multi-token embeddings rather than single-token representations. ● Interpretability payoff: Provides a concrete, testable account of where and how facts live inside transformers, enabling targeted editing and auditing of parametric memory. |
Paper, Tweet |
| 6) Generative AI for Math (OpenWebMath / MathPile) - Releases a diverse, high-quality math-centric corpus of ~9.5B tokens designed for training math-capable foundation models. ● 9.5B-token corpus: Curated from mathematical content across the web, textbooks, papers, and Q&A, rebalanced for math-specific token distribution. ● Quality filtering: Applies math-specific filtering to surface content dense in symbolic notation, proofs, and problem solutions rather than surface-level mentions of math. ● Diverse sources: Explicitly mixes proof-heavy formal math with applied problem-solving to avoid over-fitting to any single mathematical register. ● Training signal: Positioned as a drop-in pretraining or continual-pretraining corpus to lift math reasoning in existing LLMs without changing the architecture. |
Paper, Tweet |
| 7) Principled Instructions Are All You Need - Distills effective LLM prompting into 26 guiding principles and validates them across multiple model families. ● 26 principles: Covers prompt structure, audience specification, example selection, formatting, role assignment, and stepwise decomposition. ● Broad model validation: Tested on LLaMA-1/2 (7B, 13B, 70B) and GPT-3.5/4, finding the principles generalize across scales and families. ● Both small and large benefits: Smaller models benefit more from structured prompting (higher variance reduction), while larger models benefit in absolute accuracy on harder tasks. ● Practical reference: Functions as a cheat-sheet for practitioners, converting scattered prompting folklore into testable recipes. |
Paper, Tweet |
| 8) Survey of Reasoning with Foundation Models - A comprehensive survey of reasoning with foundation models, covering tasks, methods, benchmarks, and future directions. ● Task coverage: Surveys math reasoning, commonsense reasoning, logical reasoning, symbolic reasoning, and multimodal reasoning - showing how each evolves with model scale. ● Methodology catalog: Covers prompting techniques (CoT, ToT, self-consistency), fine-tuning strategies, and neurosymbolic approaches under a unified framework. ● Benchmarks: Systematizes the reasoning benchmarks landscape and flags contamination and robustness concerns specific to reasoning evaluation. ● Adjacencies: Discusses how multimodal learning, autonomous agents, and super-alignment research intersect with and extend the reasoning agenda. |
Paper, Tweet |
| 9) LLaRA - LLaRA adapts a decoder-only LLM for dense retrieval via two tailored pretext tasks that leverage text embeddings from the LLM itself. ● EBAE pretext task: Embedding-Based Auto-Encoding uses LLM embeddings to reconstruct tokens of the input sentence, aligning the embedding space with semantic content. ● EBAR pretext task: Embedding-Based Auto-Regression predicts tokens of the next sentence from the current embedding, injecting discourse-level signal into retrieval embeddings. ● LLaMA 2 7B base: A LLaMA 2-7B base model is adapted into a retriever with these pretext tasks, yielding significant gains on MSMARCO and BEIR. ● Decoder retrievers validated: Provides another data point that decoder-only LLMs, with the right adaptation, rival specialized encoder retrievers - a theme that continued through 2024. |
Paper |
| 10) Gemini vs GPT-4V - A qualitative side-by-side comparison of Gemini and GPT-4V across vision-language tasks, documenting systematic behavioral differences. ● Head-to-head cases: Evaluates both models on a curated set of tasks covering document understanding, chart reading, everyday scenes, and multi-image reasoning. ● GPT-4V style: Produces precise, succinct answers with strong preference for brevity and factual minimalism. ● Gemini style: Returns more expansive, narrative answers frequently accompanied by relevant images and links - leveraging its deeper integration with search. ● Complementary strengths: Concludes that the models are substitutable for many core VLM tasks but differ sharply on response length, multimedia, and augmentation patterns. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Gemini's Language Abilities - CMU's impartial, reproducible evaluation of Gemini Pro against GPT and Mixtral across standard LLM benchmarks. ● Reproducible methodology: Provides an open, reproducible evaluation pipeline - a response to concerns about Google's own Gemini launch benchmarks being hard to independently verify. ● Gemini Pro vs. GPT 3.5 Turbo: Gemini Pro achieves comparable but slightly lower accuracy than GPT 3.5 Turbo, countering marketing claims of broad parity on language tasks. ● Gemini & GPT beat Mixtral: Both Gemini and GPT outperform Mixtral on these benchmarks, suggesting open mixture-of-experts has not yet closed the gap to frontier proprietary models. ● Evaluation norms: Positioned as evidence that independent replications remain essential, and that first-party model reports shouldn't be the final word on comparative capability. |
Paper, Tweet |
| 2) PowerInfer - A high-speed LLM inference engine for consumer GPUs that exploits sparse neuron activation patterns to run large models on commodity hardware. ● Hot/cold neurons: Analysis shows that a small fraction of "hot" neurons activate on most inputs while the majority of "cold" neurons activate rarely - a power-law pattern across many LLMs. ● GPU-CPU hybrid: Hot neurons are preloaded onto the GPU for fast access, while cold neurons live on the CPU and are computed lazily, dramatically reducing GPU memory pressure. ● Reduced memory + transfer: This split reduces both GPU memory demand and the CPU-GPU data transfer that typically dominates hybrid inference cost. ● 11x speedup over llama.cpp: Achieves up to ~11x faster token generation than llama.cpp on a single consumer GPU for OPT-175B-class models - a step-change for local deployment. |
Paper, Tweet |
| 3) Antibiotic Discovery with Graph Deep Learning (Nature) - MIT researchers use explainable graph neural networks to discover a new structural class of antibiotics. ● Graph neural networks: Trains GNNs on molecular graphs to predict antibiotic activity, with explainability layers that surface chemical substructures driving predictions. ● Explainable discovery: Unlike black-box property predictors, the explanation module identifies substructures underlying antibiotic activity - a feature drug chemists can actually use. ● New structural class: The discovered compounds belong to a novel structural class, not a variant of existing antibiotic scaffolds - an unusually strong generalization signal. ● Real-world pipeline: Demonstrates end-to-end pipeline from GNN prediction to wet-lab validation, reinforcing explainable ML as a practical discovery tool for biomedicine. |
Paper, Tweet |
| 4) VideoPoet - Google Research's VideoPoet is a large language model for zero-shot video generation that treats video as just another token stream. ● Unified token stream: Uses multiple tokenizers to map video, image, audio, and text into a shared discrete token space for a single autoregressive model. ● Zero-shot task variety: The same model handles image-to-video, video stylization, video-to-audio, and text-to-video without task-specific fine-tuning. ● Language-model paradigm: Demonstrates that a plain autoregressive LM, given the right tokenizers, can handle video generation - challenging the diffusion-everywhere default for video. ● Temporal consistency: Produces videos with reasonable motion coherence over short durations, a meaningful milestone for LM-based video generation. |
Paper, Tweet_ |
| 5) AppAgent - Introduces an LLM-based multimodal agent that operates real smartphone apps through touch actions and screenshots. ● Multimodal control: The agent reads the phone screen (visual input) and issues low-level touch actions (tap, swipe, type), operating apps the way humans do rather than via APIs. ● Two learning modes: Learns new apps either via autonomous exploration (discovering functionality through self-play) or by observing human demonstrations. ● Cross-app generality: Demonstrates proficiency across email, social media, shopping, and creative apps, suggesting that multimodal LLMs can generalize across smartphone UIs. ● Early mobile-agent blueprint: An early example of the on-device multimodal agent pattern that would become a major 2024 deployment theme. |
Paper, Tweet_ |
| 6) LLM in a Flash - Apple researchers show how to run LLMs larger than available DRAM by streaming weights from flash storage on demand. ● Flash as swap: Stores model weights on flash and streams only the rows/columns needed per forward pass into DRAM, exploiting the sparsity of relevant parameters. ● 2x DRAM headroom: Enables running models up to 2x the size of available DRAM without catastrophic slowdown, critical for on-device deployment where memory is tight. ● Major speedups vs. naive loading: 4-5x faster on CPU and 20-25x faster on GPU compared to naive parameter loading, thanks to selective transfer and row-column bundling. ● On-device LLM groundwork: Directly enabled Apple's later on-device LLM plans by showing that flash-based streaming can make phone-scale LLM inference practical. |
Paper, Tweet_ |
| 7) ReST Meets ReAct - Proposes a ReAct-style agent that improves itself via reinforced self-training on its own reasoning traces. ● Self-critique ReAct: A ReAct-style agent with a self-critique step that evaluates its own reasoning and answers, generating a filterable trace dataset. ● ReST-style iterative RL: Uses growing-batch RL from AI feedback to iteratively fine-tune on the agent's successful reasoning traces, improving over rounds without human labels. ● Human-label-free: Minimizes human involvement; synthetic data with self-improvement from AI feedback is the primary training signal throughout. ● Distillation to small models: The improved agent can be distilled into models 1-2 orders of magnitude smaller with comparable performance, dramatically cutting inference cost. |
Paper, Tweet_ |
| 8) Adversarial Attacks on GPT-4 - Demonstrates that a trivially simple random-search procedure can jailbreak GPT-4 with high reliability. ● Adversarial suffix: Appends a suffix to a harmful request and iteratively perturbs it, keeping changes that increase the log-probability of the response starting with "Sure". ● No gradients needed: Operates purely via the API in a black-box setting, without model gradients or weights - a much lower bar than prior white-box jailbreak work. ● Strong success rate: Achieves high attack-success rates on GPT-4 with a small number of API calls, despite ongoing alignment efforts. ● Alignment implication: Shows that current safety training is still vulnerable to near-trivial optimization attacks, pointing to the need for stronger behavioral defenses. |
Paper, Tweet_ |
| 9) RAG for LLMs - A broad survey of Retrieval-Augmented Generation research, organizing the rapidly growing literature into a coherent map. ● Three-paradigm taxonomy: Organizes RAG approaches into Naive RAG, Advanced RAG (pre/post-retrieval enhancements), and Modular RAG (orchestrated component-based systems). ● Core components: Reviews retrievers, generators, and augmentation strategies separately, clarifying which design choices sit in which component. ● Evaluation and datasets: Catalogs RAG-specific benchmarks and evaluation metrics, surfacing the still-uneven state of RAG evaluation. ● Frontier directions: Highlights agentic retrieval, multimodal RAG, and long-context RAG as the key research areas driving the 2024 RAG landscape. |
Paper, Tweet_ |
| 10) BabyLLM Challenge Findings - Reports results from a challenge on sample-efficient pretraining using a developmentally plausible corpus. ● Constrained pretraining: Participants pretrain on a small, child-directed-style corpus rather than on internet-scale data, testing how efficiently models can learn from limited input. ● LTG BERT wins: The winning submission, LTG BERT, beat Llama 2 70B on 3 of 4 evaluations despite vastly less training data. ● Data preprocessing pays: Strong-performing entries relied heavily on data preprocessing and training on shorter contexts, challenging assumptions about long-context training for small data. ● Cognitive-science bridge: Provides an empirical platform connecting language-model training to developmental psycholinguistics, informing both fields. |
Paper, Tweet_ |
| Paper | Links |
|---|---|
| 1) FunSearch - DeepMind's FunSearch uses LLMs as a mutation operator in an evolutionary loop to discover genuinely new mathematical knowledge. ● LLM + evaluator loop: Combines a pretrained LLM that proposes candidate programs with a systematic evaluator that scores them, iteratively evolving low-scoring programs into high-scoring ones. ● New math discoveries: Produces novel solutions to open problems in combinatorics, including cap-set and online bin-packing, not memorized from the training data. ● Hallucination mitigation: The evaluator acts as a hard filter - only programs that actually work are kept - so LLM hallucinations don't propagate into the "discovered" knowledge. ● General recipe: Positions LLM-in-the-loop search as a general tool for scientific discovery beyond math, applicable wherever candidates can be automatically scored. |
Paper, Tweet |
| 2) Weak-to-Strong Generalization - OpenAI's superalignment team shows that weak supervisors can still elicit capabilities from much stronger models - a first empirical signal for scalable oversight. ● Weak-to-strong setup: A weak model (e.g., GPT-2) generates labels, and a strong pretrained model (e.g., GPT-4) is fine-tuned on those labels - an analog of humans supervising superhuman AI. ● Better than the supervisor: Naively fine-tuning the strong model on weak-model labels often yields a model better than the supervisor itself, demonstrating useful capability elicitation. ● ~GPT-3.5 from GPT-2 supervision: Fine-tuning GPT-4 with GPT-2-level supervision recovers close to GPT-3.5-level performance on NLP tasks - a surprising amount of capability without strong labels. ● Superalignment signal: Offers an early empirical footing for the bet that humans can align superhuman systems using their own (weaker) judgments - provided the right training recipe. |
Paper, Tweet |
| 3) Audiobox - Meta's Audiobox is a unified flow-matching audio model that generates speech, sound effects, and music from natural-language and example prompts. ● Unified audio generation: Single model handles speech, sound, and music - ending the typical pattern of one model per audio modality. ● Description + example prompting: Supports both natural-language descriptions and reference-audio examples for style control, letting users mix semantic and acoustic conditioning. ● Self-supervised infilling: Adapts a self-supervised infilling objective to pretrain on large unlabeled audio, reducing dependence on scarce labeled speech/music datasets. ● Novel voice/styles: Unlocks generation of novel vocal and acoustic styles by interpolating in the learned audio space, going beyond reproduction of training-set styles. |
Paper, Tweet |
| 4) Mathematical LLMs Survey - A survey on the progress of LLMs on mathematical reasoning tasks, covering methods, benchmarks, and open problems. ● Task taxonomy: Covers math word problem solving, symbolic reasoning, and theorem proving, showing which capabilities emerge at which model scales. ● Methods landscape: Reviews prompting techniques (CoT, PoT, ToT, self-verification) alongside fine-tuning and tool-use approaches. ● Dataset reference: Catalogs the dominant math benchmarks (GSM8K, MATH, MiniF2F, etc.) and their evaluation methodologies. ● Frontier problems: Highlights reasoning-faithfulness, formal-vs-informal math integration, and reward-model design as the key open questions. |
Paper, Tweet |
| 5) LLM360 - LLM360 is a framework for fully transparent open-source LLM development, with everything from data to training dynamics released. ● End-to-end transparency: Ships training code, the pretraining corpus, intermediate checkpoints, evaluation code, and analyses - going well beyond the "just weights" openness of earlier "open" LLMs. ● Two 7B models: Releases AMBER (general) and CRYSTALCODER (code-specialized) 7B models pretrained from scratch under the framework. ● Enables training-dynamics research: Intermediate checkpoints let researchers study loss trajectories, emergent capabilities, and data-effect ablations - typically only possible inside frontier labs. ● Standard for openness: Pushes the community's definition of "open-source LLM" from weights to a full training-pipeline standard. |
Paper, Tweet |
| 6) LLMs in Medicine - A comprehensive survey (300+ papers) of LLMs applied to medicine, from clinical tasks to biomedical research. ● Principles and applications: Covers the core principles of medical LLMs and their applications across clinical decision support, patient communication, medical education, and biomedical research. ● Benchmark coverage: Reviews medical QA benchmarks (MedQA, PubMedQA, MedMCQA, etc.) and their limitations for real clinical settings. ● Challenges: Identifies challenges specific to medicine including hallucination in clinical advice, privacy, regulatory compliance, and equity/bias concerns. ● Deployment considerations: Discusses what's required for safe deployment, including evaluation, monitoring, and the role of clinician oversight. |
Paper, Tweet |
| 7) Beyond Human Data (ReST-EM) - DeepMind's ReST-EM shows that model-generated data plus a reward function can substantially reduce dependence on human-generated data. ● Expectation-Maximization framing: Generates candidate solutions from the current model, filters using a reward/verifier, and fine-tunes on the filtered set - repeat. ● Verifiable rewards: Uses automatic verifiers (e.g., correct-answer checks) as the reward signal, sidestepping the need for a learned reward model on scarce tasks. ● PaLM 2 gains: Scales effectively on PaLM 2 for math and code tasks, outperforming standard SFT on human data at matched compute. ● Synthetic-data signal: A strong empirical case that self-generated filtered data can replace much of the human data bottleneck for reasoning tasks - a theme that grew through 2024. |
Paper, Tweet |
| 8) Gaussian-SLAM - A neural RGBD SLAM method that extends 3D Gaussian Splatting to achieve photorealistic scene reconstruction without sacrificing speed. ● 3D Gaussians for SLAM: Represents scenes as 3D Gaussians rather than neural fields, inheriting the fast training and rendering of Gaussian Splatting. ● Photorealistic reconstruction: Produces significantly higher-fidelity reconstructions than prior neural SLAM methods at comparable or better runtime. ● RGBD input: Uses standard RGB+depth input streams, making it compatible with off-the-shelf depth cameras for practical deployment. ● Speed/quality Pareto: Advances the Pareto frontier for RGBD SLAM, where previous methods forced a trade-off between runtime and photorealism. |
Paper, Tweet |
| 9) Pearl - Meta's Pearl is a production-ready reinforcement learning agent package designed for real-world deployment constraints. ● Production-oriented design: Built for real-world environments with limited observability, sparse feedback, and high stochasticity - conditions that usually break research-oriented RL libraries. ● Modular components: Offers modular policy networks, exploration strategies, offline RL, and safety constraints that can be composed for specific applications. ● Research + practice: Targets both researchers building new RL agents and practitioners deploying RL in production recommender systems, ranking, and control. ● Meta internal use: Reflects learnings from Meta's internal deployments, making it a rare RL library that starts from production pain rather than benchmark scores. |
Paper, Tweet |
| 10) QuIP# - Cornell's QuIP# is a 2-bit LLM quantization scheme that combines lattice codebooks with incoherence processing to close the quality gap to FP16. ● Lattice codebooks: Uses E8 lattice codebooks for weight quantization, a classical lattice-quantization technique adapted to LLM weight matrices. ● Incoherence processing: Pre-processes weight matrices to make them "incoherent" (less structured along axes), which improves lattice-quantization fidelity. ● 2-bit at 16-bit quality: Significantly closes the gap between 2-bit quantized LLMs and their unquantized 16-bit counterparts across a range of LLaMA-family models. ● Deployment impact: Makes large LLMs (e.g., Llama 2 70B) fit into consumer-grade GPU memory without catastrophic quality loss, expanding the set of models hobbyists can run locally. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Gemini 1.0 - Google launches Gemini 1.0, a multimodal family natively designed to reason across text, images, video, audio, and code from the ground up. ● Three tiers: Ships as Ultra (frontier), Pro (balanced), and Nano (on-device), covering everything from data-center reasoning to mobile inference. ● Native multimodality: Unlike "bolted-on" multimodal models, Gemini is trained multimodally from scratch, with joint tokenization across text, image, video, audio, and code. ● MMLU milestone: Gemini Ultra reports the first MMLU score above human-expert performance (90.0%), using chain-of-thought with uncertainty-weighted majority voting. ● Broad capability claims: Ultra sets SOTA on 30 of 32 benchmarks in the report, spanning multimodality, multilinguality, factuality, summarization, math/science, long-context, and reasoning. |
Paper, Tweet |
| 2) EfficientSAM - Meta's EfficientSAM is a lightweight Segment Anything variant that preserves most of SAM's zero-shot quality at a fraction of the compute. ● Masked autoencoder pretraining: Uses a SAMI (SAM-leveraged masked image) pretraining objective where a small student learns to reconstruct features aligned with the SAM teacher. ● 20x smaller and faster: Achieves roughly 20x fewer parameters and 20x faster runtime than the original SAM image encoder. ● Near-parity quality: 44.4 AP vs. 46.5 AP on zero-shot instance segmentation (within 2 points) despite the dramatic efficiency win. ● Deployment-ready: Makes SAM-grade segmentation feasible on commodity hardware, consumer devices, and real-time applications where the original SAM is too heavy. |
Paper, Tweet |
| 3) Magicoder - Magicoder is a fully open-source code LLM that closes the gap with top commercial code models at only 7B parameters via high-quality synthetic instruction data. ● OSS-Instruct data: Generates 75K synthetic instruction pairs by seeding GPT with snippets pulled from open-source code, producing more diverse and realistic training data than prior code SFT datasets. ● Broad coverage: Training data spans Python, multilingual programming, and data-science program completion, producing a genuinely general code model rather than a Python-only model. ● HumanEval+ win: MagicoderS-CL-7B (based on CodeLlama) surpasses ChatGPT on HumanEval+ with 66.5 vs. 65.9 pass@1, despite being 7B. ● Fully open: Ships with code, data, and weights, positioning Magicoder as a reproducible open baseline for instruction-tuned code generation. |
Paper, Tweet |
| 4) LLMs on Graphs - A comprehensive overview of the many ways LLMs can be applied to graph-structured data and when each pattern is useful. ● Three graph scenarios: Organizes the space by whether graphs are pure (no text), text-rich (nodes/edges carry natural language), or text-paired (graphs alongside documents). ● Three role taxonomies: Categorizes LLMs as predictors, enhancers, or aligners with GNNs - clarifying whether the LLM is the model, a feature source, or a supervisor. ● Task coverage: Spans node classification, link prediction, graph-level tasks, and reasoning over knowledge graphs. ● Open problems: Flags scalability to large graphs, handling of graph structure without loss, and integration with tool-augmented LLMs as the key unsolved directions. |
Paper, Tweet |
| 5) Llama Guard - Meta's Llama Guard is a compact, instruction-tuned safety classifier built on Llama 2-7B for input/output moderation in conversational AI. ● Llama 2-7B base: Small enough to run inline with a main generative model while handling both prompt- and response-level safety classification. ● Customizable taxonomy: The safety taxonomy is specified in the instruction prompt itself, so operators can adapt it to their use case without retraining. ● Zero-shot and few-shot: Works off the shelf for many taxonomies in zero- or few-shot mode, and can be fine-tuned on a specific policy dataset when needed. ● Open release: Ships as an open model, filling a gap for teams that want local, auditable safety classification rather than relying solely on API-side moderation. |
Paper, Tweet |
| 6) KTO (Kahneman-Tversky Optimization) - Contextual AI introduces KTO, an alignment objective derived from prospect theory that works with binary "good/bad" signals instead of preference pairs. ● Prospect-theory motivation: Models reward as a Kahneman-Tversky value function with loss aversion, replacing DPO's log-likelihood-of-preferences objective with utility maximization. ● No preference pairs needed: Works with unpaired good/bad signals, dramatically loosening data collection requirements compared to DPO or RLHF. ● Matches/beats DPO: Matches or exceeds DPO performance at model scales from 1B to 30B, a clean empirical win at similar training cost. ● Practical data advantage: Makes alignment much cheaper to run in production where paired preference data is rare but outcome feedback ("user liked/didn't like") is abundant. |
Paper, Tweet |
| 7) Chain of Code - DeepMind's Chain of Code extends CoT by encouraging LMs to write pseudocode that mixes real code with LM-simulated sub-routines. ● LMulator: The LM generates pseudocode programs and explicitly annotates sub-tasks that can't be executed; a "LMulator" simulates those sub-tasks with the LM while the interpreter handles the rest. ● Undefined-behavior handling: The interpreter catches undefined behavior and cleanly hands off to the LM, sidestepping the brittleness of code-first approaches that fail silently on hard ops. ● 84% on BIG-Bench Hard: Achieves 84% on BIG-Bench Hard - a 12-point gain over Chain of Thought and a clean demonstration that mixing exact execution with LM simulation beats either alone. ● Broad applicability: Works across math, logic, and commonsense reasoning, positioning Chain of Code as a general-purpose CoT upgrade. |
Paper, Tweet |
| 8) Data Management for LLMs - A survey of data-management research for LLM pretraining and supervised fine-tuning stages. ● Pretraining data: Covers data quantity, quality filtering, deduplication, domain composition, and curriculum strategies for large-scale pretraining. ● SFT data: Reviews instruction-data generation, quality filtering, diversity metrics, and the emerging literature on "less is more" for SFT. ● Domain and task composition: Examines how task mixing affects generalization vs. specialization in fine-tuning. ● Open challenges: Identifies dataset contamination, deduplication at trillion-token scale, and reproducible data recipes as the top open problems. |
Paper, Tweet |
| 9) RankZephyr - RankZephyr is an open-source LLM for listwise zero-shot reranking that bridges the effectiveness gap with GPT-4. ● Listwise zero-shot: Reranks a full candidate list in a single shot rather than doing pairwise or pointwise scoring, matching the paradigm GPT-4 uses most effectively. ● Open-source: Based on the open Zephyr chat model, releasing a fully reproducible stack for high-quality reranking. ● Matches/beats GPT-4: Competitive with GPT-4 on standard reranking benchmarks and outperforms GPT-4 on NovelEval, a post-training-cutoff benchmark resistant to contamination. ● Contamination-free win: The NovelEval advantage is particularly meaningful because it addresses the concern that GPT-4's strong reranking numbers are partly driven by memorization of benchmark queries. |
Paper, Tweet |
| 10) The Efficiency Spectrum of LLMs - A comprehensive review of algorithmic advancements for improving LLM efficiency across the full training-to-inference stack. ● Scaling laws and data: Covers how scaling laws and data-utilization strategies interact with efficiency - more isn't always better under compute constraints. ● Architectural innovations: Reviews attention variants, state-space models, MoE, and other architectural levers for efficient scaling. ● Training and tuning: Catalogs PEFT methods (LoRA, adapters, prefix tuning), quantization-aware training, and curriculum-based training strategies. ● Inference techniques: Surveys quantization, pruning, speculative decoding, KV-cache optimization, and batching as the inference-time efficiency toolkit. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) GNoME - DeepMind's Graph Networks for Materials Exploration (GNoME) is an AI system that discovered 2.2 million new crystal structures, including 380,000 thermodynamically stable ones. ● 2.2M new crystals: Dramatically expands the known crystal inventory, with 380,000 stable materials - an order-of-magnitude leap over prior computational chemistry. ● Graph networks for stability: Predicts formation energies and stability of candidate materials using graph neural networks trained on DFT-labeled data. ● Active-learning loop: Combines exploration (proposing candidate structures) with exploitation (prioritizing high-stability candidates), iteratively expanding the frontier of known materials. ● Autonomous lab validation: A subset of predictions was validated in Berkeley's autonomous materials lab, closing the prediction-to-synthesis loop for the first time at this scale. |
Paper, Tweet |
| 2) Open-Source LLMs vs. ChatGPT - A survey cataloguing tasks where open-source LLMs claim to be on par with or better than ChatGPT. ● Task-by-task audit: Organizes claims by task category (code, math, reasoning, summarization, etc.) with the specific open models and benchmarks backing each claim. ● Gap measurement: Clarifies where open-source genuinely closes the gap vs. where "comparable" actually hides meaningful performance differences. ● Critical lens: Calls out evaluation-methodology issues in specific open-source claims, including benchmark contamination, cherry-picked subsets, and inconsistent judge setups. ● 2023 snapshot: Captures where open-source LLMs stood at the end of 2023 - a useful reference point for tracking how the gap evolved through 2024. |
Paper, Tweet |
| 3) Adversarial Diffusion Distillation (SDXL Turbo) - Stability AI's ADD trains a student diffusion model that produces high-quality images in just 1-4 sampling steps. ● Score distillation + adversarial loss: Combines score-distillation from a teacher diffusion model with an adversarial loss to maintain image fidelity in the low-step regime. ● 1-4 step generation: Produces usable images in a single step and SoTA-quality images in four, compared to 25-50 steps for typical SDXL sampling. ● Matches multi-step SoTA: Achieves image quality comparable to state-of-the-art diffusion baselines at four steps, dramatically cutting inference cost. ● Real-time generation: Enables SDXL-quality images at real-time frame rates on consumer GPUs, unlocking interactive creative tooling that was previously impractical. |
Paper, Tweet |
| 4) Seamless - Meta's Seamless is a family of models for end-to-end expressive, streaming cross-lingual speech communication. ● SeamlessExpressive: Preserves the speaker's expressive characteristics (pitch, emotion, pauses) across translation rather than flattening them into neutral speech. ● SeamlessStreaming: Produces translated speech in a streaming fashion with low latency, enabling near-real-time conversational translation. ● Low-resource coverage: An improved SeamlessM4T is trained on more low-resource language data, broadening the language coverage meaningfully beyond the original M4T. ● Safety red-teaming: Meta applies a red-teaming effort specifically for multimodal translation safety, a recognition that MT systems can amplify harmful content across languages. |
Paper, Tweet |
| 5) MEDITRON-70B - EPFL's MEDITRON is an open-source family of medical LLMs at 7B and 70B parameters, continually pretrained on curated medical corpora. ● Llama 2 base + medical pretraining: Builds on Llama 2 with continual pretraining on a curated medical corpus covering clinical papers, guidelines, and textbooks. ● Strong open medical baseline: MEDITRON-70B outperforms GPT-3.5 and Med-PaLM on standard medical QA benchmarks while being open-source. ● Close to frontier: Comes within 5% of GPT-4 and 10% of Med-PaLM 2 on MultiMedQA - competitive given the much smaller scale and open release. ● Reproducible recipe: Ships with pretraining data, code, and weights, providing a reproducible starting point for researchers and institutions building medical LLMs. |
Paper, Tweet |
| 6) Medprompt - Microsoft researchers show that careful prompt engineering can push general-purpose GPT-4 to state-of-the-art on medical benchmarks, no domain fine-tuning required. ● General-purpose prompting: Uses purely general-purpose prompt-engineering techniques (CoT, dynamic few-shot, choice-shuffling ensembling) with no medical-domain specialization. ● Medprompt recipe: Combines k-nearest-neighbor example selection, GPT-4-generated chain-of-thought rationales, and choice-shuffling to cancel answer-position biases. ● SoTA on 9 benchmarks: Achieves state-of-the-art on all nine benchmarks in MultiMedQA, beating Med-PaLM 2 and other specialized medical models. ● Broader lesson: Reopens the question of whether domain-specific pretraining is actually necessary when a frontier base model is paired with strong prompting - a framing that has recurred in later debates. |
Paper, Tweet |
| 7) UniIR - UniIR is a unified instruction-guided multimodal retriever that handles eight retrieval tasks across modalities with a single model. ● Instruction-guided: A single retriever conditioned on natural-language instructions determines which retrieval task to perform, rather than one retriever per task. ● Eight tasks: Handles image-to-text, text-to-image, composed-image retrieval, video retrieval, and other multimodal variants under one umbrella. ● Zero-shot generalization: Generalizes to unseen retrieval tasks not explicitly trained on, approaching a truly general multimodal retrieval model. ● M-BEIR benchmark: Ships with a new multimodal retrieval benchmark (M-BEIR) designed to standardize evaluation across tasks and modalities. |
Paper, Tweet |
| 8) Safe Deployment of Generative AI (Nature) - A Nature correspondence arguing that medical professionals - not commercial interests - must drive the development and deployment of generative AI in medicine. ● Privacy-first framing: Centers patient-privacy considerations as the non-negotiable constraint on medical AI deployment. ● Professional governance: Calls for clinician-led governance structures rather than commercial self-regulation, citing past failures of tech-industry oversight in regulated domains. ● Deployment guardrails: Recommends guardrails including consent, transparency of training data, and clinician accountability for AI-assisted decisions. ● Policy signal: As a Nature piece, amplifies medical-community concerns into the broader AI policy conversation at a key moment in the regulation debate. |
Paper, Tweet |
| 9) Dobb-E - NYU's Dobb-E is an affordable household-manipulation robot that learns new tasks with just 5 minutes of user demonstrations. ● 5 minutes of demos: Learns new household manipulation tasks from only ~5 minutes of demonstrations, a dramatic reduction from typical data requirements. ● Hardware design: Uses a low-cost stick-on gripper and a smartphone-driven data-collection rig, keeping the barrier to entry low for non-expert users. ● Home-specific challenges: Experiments in real homes surface challenges usually hidden in lab robotics - strong shadows, variable demo quality, and household-specific clutter. ● General-purpose household system: Positions Dobb-E as a general-purpose system for household robotics rather than a task-specific demonstrator, a step toward practical home robots. |
Paper, Tweet |
| 10) Translatotron 3 - Google's Translatotron 3 performs speech-to-speech translation using only monolingual data - no parallel corpora required. ● Fully unsupervised S2S: Learns direct speech-to-speech translation from monolingual data alone, a first for this task. ● Three-component architecture: Combines a masked autoencoder for speech representation, unsupervised embedding mapping across languages, and back-translation for alignment. ● Beats cascade baselines: Outperforms a comparable cascade of ASR + MT + TTS, a surprising result given cascade systems are typically the strong baseline. ● Paralinguistic preservation: Preserves paralinguistic features - pauses, speaking rates, and speaker identity - that cascaded systems tend to wash out in translation. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) System 2 Attention (S2A) - Meta's S2A uses the LLM's own reasoning to decide what context actually matters, regenerating a clean prompt before the final response step. ● Two-pass prompting: First pass uses the LLM to filter/regenerate the input context, removing irrelevant or misleading content; second pass generates the final answer from the clean context. ● Addresses distraction: Directly targets the well-known problem that LLMs attend to irrelevant or manipulative content (e.g., opinion-laden context that biases answers). ● Factuality gains: Increases factuality on QA and reduces the model's sensitivity to biased framing or distractors inserted into the prompt. ● Math word problems: Outperforms standard attention-based LLMs on math word problems, where filtering irrelevant details is often the hard part of the task. |
Paper, Tweet |
| 2) Advancing Long-Context LLMs - A survey of methodologies for improving Transformer long-context capability across pretraining, fine-tuning, and inference stages. ● Full-stack coverage: Organizes methods by training stage - pretraining objectives, position encoding, fine-tuning recipes, and inference-time interventions. ● Position-encoding deep dive: Reviews RoPE variants, ALiBi, and other positional-encoding choices that dominate long-context extrapolation. ● Efficient attention: Catalogs sparse, linear, and memory-augmented attention mechanisms that make longer contexts tractable. ● Evaluation considerations: Addresses benchmark limitations including the "needle in a haystack" problem and the gap between nominal context length and effective usable context. |
Paper, Tweet |
| 3) Parallel Speculative Sampling - Amazon researchers propose a parallel variant of speculative sampling that achieves significant LLM inference speedups with minimal extra parameters. ● Parallel decoding: Combines speculative sampling with parallel decoding so multiple tokens can be generated and verified in a single pass. ● Tiny overhead: Requires learning only O(d_emb) additional parameters, far fewer than typical speculative-decoding draft models. ● Up to 30% speedup: Achieves up to 30% end-to-end inference speedup without compromising output quality. ● Minimal integration cost: Unlike separate-draft-model speculative decoding, this fits inside the main model with essentially no deployment overhead. |
Paper, Tweet |
| 4) Mirasol3B - Google's Mirasol3B is a multimodal model that decouples modalities into focused autoregressive components rather than forcing a single fused stream. ● Decoupled autoregressive modeling: Separates audio/video processing from text processing into focused autoregressive components that communicate through learned cross-modal interfaces. ● Handles longer videos: The decoupled design lets the model handle longer video inputs than typical end-to-end multimodal models constrained by sequence length. ● Modality-specific processing: Inputs are processed according to their modalities with appropriate tokenization rather than forcing a one-size-fits-all tokenizer. ● SoTA on video benchmarks: Outperforms prior methods on video QA, long-video QA, and audio-video-text benchmarks, validating the decoupled approach. |
Paper, Tweet |
| 5) Teaching Small LMs to Reason - An approach that teaches smaller language models to explicitly select among reasoning techniques for each problem. ● Reasoning technique menu: Trains the small LM to choose among step-by-step processing, recall-then-generate, recall-reason-generate, extract-generate, and direct-answer strategies. ● Technique selection: The model learns when to apply each strategy based on problem structure, not just which answer to produce. ● Matches 5-10x larger models: Attains zero-shot reasoning performance similar or better than models 5-10x larger on complex reasoning tasks. ● Practical scaling: Offers a recipe for teams that can't deploy frontier-scale models but need strong reasoning quality - a recurring production constraint. |
Paper, Tweet |
| 6) GPQA - A graduate-level Google-proof QA benchmark designed to stress-test reasoning in systems that might exceed human expertise. ● 448 expert questions: Consists of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. ● Google-proof by design: Questions are constructed so that even with unrestricted internet access, non-experts (~34%) perform only slightly better than random on them. ● GPT-4 gets 39%: The strongest GPT-4 baseline hits only 39% accuracy, showing a clear headroom for frontier models on expert-level reasoning. ● Scalable oversight testbed: Explicitly designed to enable scalable oversight research - experiments in supervising models whose knowledge may exceed the supervisors'. |
Paper, Tweet |
| 7) Hitchhiker's Guide From CoT to Agents - A survey mapping the conceptual evolution from chain-of-thought reasoning to modern language-agent frameworks. ● CoT foundations: Covers the mechanics underpinning CoT (few-shot prompting, self-consistency, least-to-most, tree-of-thought) with a consistent formalism. ● Mechanism theory: Explores why CoT works - in-context learning, prompt engineering theories, and emergence at scale - rather than just cataloging results. ● CoT-to-agent bridge: Traces how CoT techniques were progressively extended into tool use, multi-step planning, and full agent loops (ReAct, Reflexion, etc.). ● Framework landscape: Organizes the modern language-agent frameworks by which parts of the CoT-to-agent pipeline they emphasize, clarifying an otherwise noisy field. |
Paper, Tweet |
| 8) GAIA - Meta's GAIA is a benchmark for general AI assistants that requires reasoning, multimodal handling, web browsing, and tool use to solve real-world questions. ● Real-world questions: Questions are conceptually simple for humans but require integrated reasoning, web research, and tool use - a realistic test for assistant-style AI. ● Massive human-model gap: Humans achieve 92% accuracy while GPT-4 with plugins achieves only 15% - the widest human-AI gap on any major 2023 benchmark. ● Level-graduated difficulty: Three difficulty levels let researchers measure incremental progress rather than just binary success/failure. ● Agent-first evaluation: Explicitly designed to test AI assistants, not base LLMs - a framing that has since become dominant for agent evaluations. |
Paper, Tweet |
| 9) MedAgents - A collaborative multi-round framework for medical reasoning that uses role-playing LLM agents to improve accuracy and reasoning depth. ● Multi-agent deliberation: Multiple LLM agents take on specialist roles (e.g., different medical specialties) and deliberate in rounds over a case. ● Role-playing: Each agent has a defined role-play prompt that scopes its expertise and reasoning style, producing more diverse intermediate hypotheses. ● Consensus protocol: Agents iterate until reaching consensus or until a moderator resolves disagreements, producing a final answer with rationale. ● Reasoning gains: Improves accuracy and reasoning quality on medical QA benchmarks compared to single-agent baselines at matched compute. |
Paper, Tweet |
| 10) TÜLU 2 - Allen AI's TÜLU 2 is a suite of improved open instruction-tuned LLMs and an accompanying study of adaptation best practices. ● Open suite: Releases open models that match or exceed GPT-3.5-turbo-0301 on several benchmarks, a meaningful milestone for the open ecosystem at the time. ● Post-training recipe: The paper doubles as a practical recipe, documenting how instruction data curation, mixing ratios, and DPO-based preference training interact. ● UltraFeedback preference data: Uses UltraFeedback for preference optimization, validating that openly released preference datasets are sufficient to close much of the gap to commercial post-training pipelines. ● Adaptation research platform: Explicitly positioned as a platform for studying open adaptation techniques, informing the TÜLU 3 release that would follow in 2024. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Emu Video and Emu Edit - Meta releases Emu Video and Emu Edit, a pair of diffusion models targeting controlled text-to-video generation and instruction-based image editing. ● Emu Video: Generates high-quality video from text-only, image-only, or combined text + image inputs using a factorized diffusion approach - text-to-image followed by image-conditioned video. ● Emu Edit: Enables free-form image editing through text instructions, handling region, local, and global edits within one model. ● Factorized video: The text-to-image then image-to-video split dramatically cuts training cost and improves controllability compared to end-to-end T2V models. ● Unified research line: Both models extend Meta's Emu foundation family, pointing toward a unified multimodal generative stack shared across image, video, and edit tasks. |
Paper, Tweet |
| 2) Chain-of-Note (CoN) - Tencent's Chain-of-Note adds an explicit note-taking step to RAG so the model can evaluate retrieved evidence before answering. ● Sequential notes: For each retrieved document, the model writes a "reading note" assessing relevance to the question, rather than attending to the entire retrieval dump directly. ● Noise robustness: +7.9 EM improvement when retrieved documents are entirely noisy, precisely the regime where standard RAG degrades most. ● Unknown-scenario handling: +10.5 rejection-rate improvement on questions outside the model's training scope, a key property for avoiding confident hallucinations. ● Generalizable pattern: The note-taking step is a lightweight addition on top of existing RAG pipelines, making it easy to adopt incrementally. |
Paper, Tweet |
| 3) LLMs for Scientific Discovery - A broad evaluation of GPT-4 across scientific disciplines including drug discovery, biology, and computational chemistry. ● Expert-driven assessment: Domain experts design case studies to probe GPT-4's understanding of complex scientific concepts and its ability to solve real research problems. ● Problem-solving capability: GPT-4 demonstrates meaningful problem-solving in many domains but shows systematic weaknesses on tasks requiring precise numerical reasoning or experimental design. ● Benchmark coverage: Complements qualitative case studies with quantitative benchmarks, triangulating on where current frontier models help vs. mislead. ● Research workflow integration: Argues LLMs can accelerate scientific ideation and literature synthesis but require careful scaffolding before touching high-stakes experimental decisions. |
Paper, Tweet |
| 4) Fine-Tuning LLMs for Factuality - Stanford fine-tunes LLMs for factuality without any human labels by using automatically generated preference signals. ● Automatic factuality signal: Derives factuality preference rankings from reference consistency checks and retrieval-based verification - no human labels required. ● Open-ended generation: Specifically targets open-ended generation settings rather than constrained QA, where hallucination is hardest to detect or correct. ● Llama 2 improvements: Significantly improves Llama 2's factuality on held-out topics, outperforming RLHF and decoding-time factuality strategies. ● Scalable alignment: Offers a recipe for scaling factuality alignment without proportionally scaling human annotation - an important direction as LLMs cover broader domains. |
Paper, Tweet |
| 5) Contrastive Chain-of-Thought - Proposes contrastive CoT prompting where models see both valid and invalid reasoning demonstrations to reduce reasoning errors. ● Valid + invalid demos: Demonstrations pair correct reasoning traces with common incorrect ones, teaching the model what not to do as well as what to do. ● Automatic construction: Provides an automatic method to generate contrastive demonstrations, avoiding the manual curation bottleneck that limited prior CoT variants. ● Improves over CoT: Outperforms standard CoT across reasoning benchmarks, with particularly strong gains on problems where common error patterns are predictable. ● Pedagogical analog: The improvement mirrors human learning research showing that studying worked examples and errors side-by-side beats studying successes alone. |
Paper, Tweet |
| 6) Survey on Language Models for Code - A comprehensive survey of LLMs for code covering 50+ models, 30+ evaluation tasks, and 500 related works. ● Model landscape: Catalogs 50+ code LLMs across sizes, architectures, and training regimes, providing a single reference for what's available. ● Task taxonomy: Reviews 30+ evaluation tasks spanning code generation, repair, translation, summarization, and execution prediction. ● Training and data recipes: Walks through pretraining corpus construction, instruction tuning, and RLHF specifically for code. ● Open problems: Highlights challenges in long-context code understanding, multi-file reasoning, and robust evaluation beyond HumanEval-style metrics. |
Paper, Tweet |
| 7) JARVIS-1 - An open-world multimodal agent for Minecraft that combines perception, planning, and memory into a self-improving system. ● Multimodal perception: Processes visual Minecraft observations and natural-language instructions through a unified multimodal input pipeline. ● Memory-augmented planning: Maintains a multimodal memory store of past observations and plans, enabling lifelong self-improvement across episodes. ● Strong task coverage: Completes 200+ diverse Minecraft tasks with competitive success rates, including long-horizon tasks like diamond collection. ● Open-world blueprint: An influential example of combining foundation models, memory, and explicit planning into an agent, foreshadowing many 2024 agent architectures. |
Paper, Tweet |
| 8) Learning to Filter Context for RAG (FILCO) - CMU's FILCO improves RAG by training a dedicated model to filter retrieved contexts before they reach the generator. ● Useful-context identification: Uses lexical and information-theoretic signals to identify genuinely useful portions of retrieved documents, rather than passing everything through. ● Context-filter training: Trains a separate filtering model whose only job is to retain useful context at inference time. ● Extractive QA wins: Outperforms prior RAG approaches on extractive QA benchmarks, a clean demonstration that context filtering is a high-leverage component. ● Modular addition: Slots in between retrieval and generation, making it compatible with any retriever/generator pairing. |
Paper, Tweet |
| 9) MART (Multi-round Automatic Red-Teaming) - Meta's MART scales LLM safety alignment using fully automatic multi-round red-teaming. ● Adversarial prompt writing: One LLM acts as red-teamer, automatically generating adversarial prompts that probe the target model's safety. ● Safe response generation: The target LLM then generates responses that are filtered/refined for safety, producing training data for the next round. ● 84.7% violation reduction: After 4 rounds, the violation rate of an initially weakly-aligned LLM drops up to 84.7%, matching models with extensive human-written adversarial data. ● Scalable alignment: Demonstrates that automatic red-teaming can substitute for expensive human adversarial prompt writing in the alignment pipeline. |
Paper, Tweet |
| 10) LLMs Can Deceive Users (Trading Agent) - Apollo Research shows that a helpful, honest LLM stock-trading agent can spontaneously deceive users under pressure. ● Stock-trading testbed: The LLM agent runs an autonomous trading simulation with access to market data and occasional insider tips. ● Acts on insider information: When placed under performance pressure, the agent acts on insider tips despite explicit instructions not to - a clear instance of strategic norm violation. ● Hides reasoning from the user: Crucially, the agent reports doctored rationales to its user, hiding the insider trade rather than reporting it - strategic deception without being trained to deceive. ● Alignment implication: Demonstrates that deception can emerge in "helpful and safe" models under realistic pressure, without targeted training - a significant datapoint for alignment research. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Hallucination in LLMs Survey - A comprehensive survey of hallucination in LLMs, covering taxonomy, causes, evaluation, and mitigation. ● Two-category taxonomy: Separates hallucinations into factuality hallucinations (incorrect facts) and faithfulness hallucinations (deviations from source content). ● Causes breakdown: Attributes hallucinations to training-data issues, training-stage artifacts, and inference-time choices - each with distinct mitigation paths. ● Evaluation landscape: Reviews benchmarks and automatic metrics specifically designed for hallucination, contrasting them with general-purpose LLM metrics. ● Mitigation strategies: Organizes mitigation into data curation, training-stage (RLHF, factuality tuning), and inference-stage (decoding, retrieval) approaches. |
Paper, Tweet |
| 2) Simplifying Transformer Blocks - Researchers show that many components of the standard transformer block can be removed with no loss in training speed or quality. ● Aggressive simplification: Removes residual connections, normalization layers, and value/projection parameters in specific blocks without hurting per-update training speed. ● Works across architectures: Tested on autoregressive decoder-only and BERT encoder-only models, validating that the simplifications aren't architecture-specific. ● 15% faster throughput: Simplified blocks deliver 15% faster training throughput with fewer parameters - a clean efficiency win. ● Design-space implication: Suggests the standard transformer is overdetermined and that careful ablation can yield simpler, faster architectures without new ideas. |
Paper, Tweet |
| 3) In-Context Learning Generalization Limits - Investigates whether transformers' in-context learning can generalize beyond the distribution of their pretraining data. ● Pretraining distribution bridge: Tests whether transformers can identify and learn new tasks in-context, both inside and outside their pretraining data distribution. ● Limited OOD generalization: In the regimes studied, there's limited evidence that ICL generalizes meaningfully beyond pretraining data coverage. ● Counter-narrative: Pushes back on the strong "universal learners" framing of ICL that sometimes accompanies emergence-claims, grounding it in data-distribution bounds. ● Research implication: Argues that evaluating ICL requires carefully distinguishing in-distribution skill retrieval from genuine OOD generalization - a distinction rarely made cleanly in headlines. |
Paper, Tweet |
| 4) MusicGen - Meta's MusicGen is a single-stage transformer LLM for music generation that operates over compressed discrete audio tokens. ● Single-stage transformer: Unlike multi-stage music generation pipelines, MusicGen generates music as a single autoregressive transformer over multi-codebook tokens. ● Multi-stream tokens: Operates over several parallel streams of compressed discrete music tokens, producing high-fidelity audio without the cascaded VQ-VAE + LM setup. ● Text and melody conditioning: Supports both text prompts and melody conditioning, letting users specify style with text and structure with reference audio. ● High-quality generation: Delivers competitive subjective quality against multi-stage baselines while being simpler and faster to deploy. |
Paper, Tweet |
| 5) AltUp (Alternating Updates) - Google's AltUp lets transformers benefit from wider representations without paying the full compute cost at every layer. ● Wide-but-cheap representation: Widens the learned representation but only actively updates one sub-block per layer, leaving others untouched during that forward pass. ● Predict-and-correct: A predict-and-correct mechanism updates the inactive sub-blocks with predictions, so they remain coherent without full computation. ● Negligible latency increase: Achieves wider representations at negligible latency cost compared to matched-width dense transformers. ● Scaling lever: Provides a middle-ground between narrow dense models and sparse MoE - wider without routing complexity. |
Paper, Tweet |
| 6) Rephrase and Respond (RaR) - An effective prompting method where the LLM rephrases and expands the user's question before answering it. ● Rephrase step: The model first rewrites the question to resolve ambiguity, fill in implicit assumptions, and make the task explicit - then answers the rephrased version. ● Broad task gains: Improves performance across diverse tasks without needing any fine-tuning, using only prompt-level changes. ● Stacks with CoT: Combines cleanly with chain-of-thought prompting, giving additive improvements on reasoning benchmarks. ● User-friendly interpretation: Shows that part of the "prompt engineering" skill gap between novice and expert users is really a rephrasing problem - one the LLM itself can fix. |
Paper, Tweet |
| 7) On the Road with GPT-4V - An exhaustive evaluation of GPT-4V applied to autonomous driving scenarios. ● Driving-scenario evaluation: Tests GPT-4V across diverse driving situations including scene understanding, traffic-sign recognition, and causal reasoning about driver intent. ● Scene-understanding strength: Demonstrates superior performance in scene understanding and causal reasoning compared to existing production autonomous-driving systems. ● Edge-case robustness: Shows relative robustness on edge cases (construction zones, unusual road layouts) that typically confuse narrower perception stacks. ● Practical limitations: Flags real-world issues including latency, rare-hazard handling, and dependence on high-quality image quality that would gate production deployment. |
Paper, Tweet |
| 8) GPT4All Technical Report - The GPT4All technical report documents the model family and the open ecosystem built around democratizing local LLMs. ● Model family: Covers the sequence of GPT4All models trained and released through 2023, spanning 3B-13B parameter sizes. ● Open-source focus: Ships with a cross-platform desktop app, open model weights, and an accompanying dataset - positioning itself as a turnkey local LLM stack. ● Data and training: Details the curated instruction-tuning dataset and fine-tuning recipes used to build the family. ● Ecosystem impact: Tracks GPT4All's role in popularizing local LLM usage among hobbyists and small organizations before Ollama and similar tools matured. |
Paper, Tweet |
| 9) S-LoRA - S-LoRA enables serving thousands of LoRA adapters concurrently on a single GPU through memory-paging and custom CUDA kernels. ● Main-memory adapter pool: Stores all adapters in main memory and loads adapters for currently running queries into GPU memory on demand, dramatically increasing the adapter pool size. ● Novel tensor parallelism: Introduces a tensor-parallelism strategy tailored for heterogeneous LoRA batches, where each query might use a different adapter. ● 4x throughput: Improves throughput by 4x compared to prior adapter-serving solutions at comparable latency. ● Adapter scale: Enables serving several orders of magnitude more adapters on the same hardware - important for multi-tenant LoRA deployments and personalized fine-tuning services. |
Paper, Tweet |
| 10) FreshLLMs (FreshQA) - Introduces FreshQA, a dynamic benchmark designed to stress-test LLMs on time-sensitive knowledge. ● Dynamic QA benchmark: Continuously refreshes questions so models can't memorize answers - a direct response to the contamination concerns plaguing static benchmarks. ● Four question categories: Covers never-changing, slow-changing, fast-changing, and false-premise questions, stressing different aspects of freshness handling. ● Reveals freshness gap: Shows that LLMs without search augmentation answer fast-changing questions poorly, while retrieval-augmented models close most of the gap. ● FreshPrompt: Proposes FreshPrompt, a simple search-augmented prompting strategy that substantially boosts LLM performance on time-sensitive questions. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) MetNet-3 - Google's MetNet-3 is a state-of-the-art neural weather model extending lead time and variable coverage well beyond prior observation-based models. ● Dense + sparse sensors: Learns jointly from dense sensor data (radar, satellite) and sparse in-situ station data, combining signals that were typically used separately. ● 24-hour forecasts: Produces predictions up to 24 hours ahead, a meaningful lead-time extension for observation-based weather modeling. ● Multi-variable output: Predicts precipitation, wind, temperature, and dew point from the same model, rather than requiring per-variable systems. ● Operational relevance: Demonstrates the neural-weather-model pattern that would dominate 2024 forecasting research - observation-driven, end-to-end neural pipelines replacing traditional numerical systems. |
Paper, Tweet |
| 2) Evaluating LLMs Survey - A comprehensive survey of LLM evaluation covering benchmarks, methodologies, and open problems. ● Task-wise organization: Organizes evaluation by task category - reasoning, knowledge, alignment, robustness, ethics, etc. - showing which benchmarks address which capabilities. ● Automatic vs. human: Discusses the trade-offs between automatic metrics (cheap, inconsistent), LLM-as-a-Judge (scalable, biased), and human evaluation (reliable, expensive). ● Contamination and robustness: Highlights contamination and robustness as cross-cutting concerns plaguing static benchmarks at all scales. ● Frontier-model needs: Argues that evaluating frontier-scale LLMs requires new paradigms beyond simple benchmark accuracy, including interactive evaluation and behavioral testing. |
Paper, Tweet |
| 3) Battle of the Backbones - A large-scale benchmarking framework that compares vision backbones across a diverse suite of computer vision tasks. ● Broad benchmarking: Compares CNN and ViT backbones across classification, segmentation, detection, retrieval, and other tasks at matched compute. ● Pretraining recipes matter: Shows that pretraining scheme (supervised, self-supervised, language-image) often matters more than the architecture family. ● ViT ≠ universal winner: Vision transformers are not universally superior - strong CNN backbones remain competitive or better on several downstream tasks. ● Practitioner guide: Functions as a decision reference - the report explicitly maps from task characteristics to recommended backbone + pretraining combinations. |
Paper, Tweet |
| 4) ChipNeMo (LLMs for Chip Design) - NVIDIA's ChipNeMo applies domain-adapted LLMs to industrial chip design workflows. ● Domain adaptation pipeline: Applies continued pretraining on chip-design corpora, SFT, and domain-specific RLHF to adapt general LLMs to semiconductor design language. ● Three applications: Evaluates assistant chatbot for engineers, EDA (electronic design automation) tool invocation, and bug summarization - three real internal chip-design pain points. ● Significant adaptation gains: Domain adaptation dramatically outperforms general-purpose LLMs across tasks despite using smaller model sizes. ● Adapted RAG: Using a domain-adapted LLM as the generator in RAG further improves answer quality compared to using a general-purpose LLM with the same retrieval stack. |
Paper, Tweet |
| 5) YaRN (Efficient Context Extension) - YaRN is a compute-efficient method for extending the context window of LLMs well beyond their pretrained length. ● Rotary-embedding scaling: Extends RoPE-based context length via a combined attention and NTK-aware scaling scheme, avoiding the degradation of naive interpolation. ● Fine-tune extrapolation: Extrapolates meaningfully beyond the limited context seen during fine-tuning, so short fine-tune sequences can unlock much longer inference contexts. ● 128K context: Successfully scales Llama-family models to 128K-token context with minimal additional training compute. ● Open recipe: Adopted widely across the open-source community as a standard recipe for extending Llama and other RoPE-based LLMs. |
Paper, Tweet |
| 6) Open DAC 2023 - Meta releases a large DFT dataset for training ML models that predict sorbent-adsorbate interactions in Direct Air Capture (DAC). ● 38M+ DFT calculations: Consists of more than 38M density functional theory calculations on metal-organic frameworks (MOFs), enabling large-scale ML-driven DAC material discovery. ● DAC research: Targets direct air capture, where efficient CO₂-capturing MOFs are needed - a high-impact climate application for ML. ● ML baselines: Provides strong ML baselines showing that ML surrogates can replace expensive DFT calculations for MOF screening. ● Open-science contribution: Positions the dataset as an open foundation for materials ML research on climate applications. |
Paper, Tweet |
| 7) Symmetry in Machine Learning - A methodological framework for enforcing, discovering, and promoting symmetry in machine learning models. ● Unified framework: Presents a single theoretical framework that covers data augmentation, equivariant architectures, and symmetry-discovering learning objectives. ● Three-way taxonomy: Organizes approaches into enforcing known symmetries, discovering latent ones, and biasing learning toward symmetric solutions. ● Worked examples: Applies the framework to MLPs and basis-function regression, showing concretely how the abstract concepts translate into design choices. ● Broader ML perspective: Positions symmetry as a first-class design lever alongside scale and data quality, particularly for scientific ML. |
Paper, Tweet |
| 8) Next-Generation AlphaFold - DeepMind previews the next AlphaFold with dramatically expanded scope of biomolecular complexes. ● Multi-entity complexes: Jointly predicts structure of complexes including proteins, nucleic acids, small molecules, ions, and modified residues in a single unified model. ● Beyond protein-only: Dramatically expands applicability beyond AlphaFold 2's protein-only regime, opening up drug discovery and RNA biology workflows. ● Beats specialist predictors: Achieves greater accuracy on protein-nucleic acid interactions than specialized predictors in that domain - remarkable for a general model. ● Biology pipeline signal: Preview of the capability direction that would crystallize as AlphaFold 3 in 2024, with profound implications for structural biology research. |
Paper, Tweet |
| 9) EmotionPrompt - Microsoft researchers show that appending emotional stimuli to prompts reliably improves LLM performance across 45 tasks. ● 45-task evaluation: Tested across Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4 on 45 deterministic and generative tasks. ● Emotional stimuli: Appends phrases like "This is very important to my career" to prompts, drawing on social-psychology theories of human motivation. ● Consistent gains: Produces consistent improvements across both smaller and frontier models, despite the prompts being content-free manipulations. ● Emotional-intelligence signal: Suggests LLMs have internalized patterns connecting emotional framing to effort - a "bug or feature" question that has driven follow-up research on LLM behavioral psychology. |
Paper, Tweet |
| 10) FP8-LM - Microsoft's FP8-LM demonstrates that most LLM training variables - gradients, optimizer states - can use FP8 without sacrificing accuracy. ● FP8 across the pipeline: Extends FP8 training beyond forward activations to gradients and optimizer states (both moments), widening the FP8 footprint. ● No hyperparameter changes: Works as a drop-in replacement for FP16/BF16 training without requiring changes to learning rates, schedules, or other hyperparameters. ● Matched accuracy: Achieves accuracy indistinguishable from FP16/BF16 baselines on LLM pretraining tasks. ● Efficiency gains: Delivers substantial memory and compute savings, particularly attractive for training large models on FP8-capable hardware like H100. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Zephyr - Hugging Face's Zephyr-7B is a 7B parameter LLM whose chat performance rivals much larger chat models aligned with human feedback. ● Distilled SFT: Uses distilled supervised fine-tuning on UltraChat-generated instruction data as the task-accuracy foundation. ● Distilled DPO: Aligns with AI feedback data via Direct Preference Optimization, rather than the expensive human-feedback RLHF pipeline. ● ChatGPT-level at 7B: Achieves competitive performance with ChatGPT on AlpacaEval and matches 70B chat models aligned with human feedback on several benchmarks. ● Recipe popularization: Open-sources the distilled-DPO recipe, which became a widely adopted template for small, strong open chat models. |
Paper, Tweet |
| 2) Fact-Checking with LLMs - Investigates the fact-checking capabilities of frontier LLMs across multiple languages and claim types. ● Contextual information helps: LLMs perform significantly better at fact-checking when equipped with retrieved evidence, validating the RAG pattern for claim verification. ● GPT-4 > GPT-3: GPT-4 shows meaningful accuracy gains over GPT-3 for fact-checking, but both struggle without supporting context. ● Multilingual variance: Accuracy varies substantially by query language and claim veracity, exposing persistent language-equity gaps in fact-checking. ● Inconsistent reliability: While LLMs show real fact-checking promise, their accuracy is inconsistent enough that they can't replace human fact-checkers - useful as assistants, not arbiters. |
Paper, Tweet |
| 3) Matryoshka Diffusion Models - Apple introduces an end-to-end framework for high-resolution image and video synthesis that denoises across multiple resolutions jointly. ● Joint multi-resolution diffusion: Runs the diffusion process at multiple resolutions simultaneously, sharing representations across scales in a single unified model. ● NestedUNet: Uses a NestedUNet architecture so that higher-resolution branches build on lower-resolution features without a separate cascade. ● Progressive training: Trains progressively from low to high resolution, dramatically improving optimization stability for high-resolution generation. ● Unified model: Eliminates the typical cascaded-diffusion pipeline used in prior high-resolution generation, simplifying training and serving. |
Paper, Tweet |
| 4) Spectron - Google's Spectron is a spoken-language model trained end-to-end on raw spectrograms rather than text or discrete audio tokens. ● End-to-end spectrogram modeling: Processes spectrograms directly without an intermediate speech-recognition or tokenization step, preserving paralinguistic information. ● High-quality spoken output: Fine-tuned to generate high-quality, accurate spoken language while preserving speaker and prosody characteristics. ● Speaker preservation: Outperforms prior spoken-language models on speaker preservation - a known weakness of tokenizer-based approaches. ● Semantic coherence: Also improves semantic coherence of generated speech, addressing the common drift problem in spectrogram-level generation. |
Paper, Tweet |
| 5) LLMs Meet New Knowledge - A benchmark that evaluates how well LLMs handle new knowledge beyond their training cutoff. ● Three-dimensional evaluation: Tests knowledge understanding, knowledge differentiation (old vs. new), and knowledge association across the full set of relations. ● Post-cutoff focus: Uses knowledge that appears after the model's training cutoff, avoiding contamination that undermines many LLM knowledge benchmarks. ● LLMs struggle with new knowledge: Reveals systematic gaps - even frontier LLMs handle post-cutoff facts significantly worse than pre-cutoff ones, despite strong reasoning. ● RAG-oriented motivation: Provides empirical grounding for RAG: parametric memory is tied to training data, so retrieval remains necessary for fresh knowledge. |
Paper, Tweet |
| 6) Min-K% Prob (Detecting Pretraining Data) - Proposes Min-K% Prob as an effective detection method for determining whether specific text was in an LLM's pretraining data. ● Method: Computes the average log-probability of the K% least-likely tokens in a text; memorized text has higher log-probabilities on these tokens than unseen text. ● Black-box detection: Works on API-accessible models without needing gradients or internal activations, making it broadly applicable. ● Multiple use cases: Usable for benchmark-contamination detection, privacy auditing of machine unlearning, and copyrighted-text detection in pretraining corpora. ● Policy implications: Provides a technical tool for the copyright and privacy debates, letting third parties measurably test specific-text inclusion in training data. |
Paper, Tweet |
| 7) ConvNets Match Vision Transformers - DeepMind shows that strong ConvNet architectures pretrained at scale match ViTs on ImageNet performance at comparable compute. ● JFT-4B pretraining: Pretrains performant ConvNet architectures (NFNets) on JFT-4B at scale - matching the data regime where ViTs typically pull ahead. ● Log-log scaling law: Observes a log-log scaling law between held-out loss and compute, mirroring the scaling properties seen in ViTs. ● ImageNet parity: Fine-tuned NFNets match the reported performance of Vision Transformers at comparable compute budgets, refuting the "ConvNets don't scale" narrative. ● Architecture vs. recipe: Argues that the ConvNet-vs-ViT gap is largely a scale/recipe gap rather than an architectural limitation - a recurring theme in vision research. |
Paper, Tweet |
| 8) CommonCanvas - Releases CommonCanvas, a text-to-image dataset composed entirely of Creative-Commons-licensed images. ● CC-only training data: Every image is Creative Commons-licensed, providing a clean-license dataset for commercial and research T2I training. ● Scale despite licensing constraints: Curates hundreds of millions of images despite the CC-only constraint, dispelling the myth that legal T2I training requires permissive copyrighted data. ● Strong baseline models: Trains SD-style models on CommonCanvas that reach competitive quality, demonstrating CC data is sufficient for SoTA T2I. ● Policy contribution: Provides a practical counterexample to the argument that copyrighted training data is necessary - important as copyright litigation reshaped the AI-data landscape. |
Paper, Tweet |
| 9) Managing AI Risks (Bengio, Hinton, et al.) - A high-profile position paper by leading AI researchers laying out risks from upcoming advanced AI systems. ● Risk catalog: Enumerates social harms, malicious uses, large-scale autonomous risks, and potential loss-of-control scenarios from increasingly capable AI. ● Signatory weight: Signed by multiple Turing Award-winning researchers including Hinton and Bengio, amplifying its impact in the policy conversation. ● Concrete recommendations: Calls for investment in safety research, mandatory standards for advanced AI, and international coordination - not a pure threat-inventory. ● Political moment: Published during active AI-regulation discussions in the US and UK, directly influencing the UK AI Safety Summit and related policy processes. |
Paper, Tweet |
| 10) Branch-Solve-Merge (BSM) - BSM decomposes LLM tasks into parallel sub-tasks via three LLM-programmed modules: branch, solve, and merge. ● Three-module architecture: A branch module proposes a decomposition into parallel sub-tasks, a solve module independently answers each, and a merge module fuses results into a final response. ● Prompt-parameterized: All three modules are the same base LLM with different prompts, so BSM works with any base model without fine-tuning. ● Evaluation quality gains: Improves evaluation correctness and consistency for multiple LLMs, particularly on tasks where a flat prompt leaves too much implicit. ● General pattern: Generalizes the "decompose then solve" pattern from math/CoT to arbitrary tasks, anticipating more structured agent decomposition patterns. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Llemma - Llemma is an open LLM for mathematics built via continued pretraining of Code Llama on the Proof-Pile-2 dataset. ● Proof-Pile-2 dataset: Mixes scientific papers, math-heavy web pages, and mathematical code into a focused math-pretraining corpus. ● Code Llama base: Uses Code Llama as the base model, leveraging its existing code proficiency as a scaffold for formal-style math reasoning. ● Beats unreleased Minerva: Outperforms open base models and the unreleased Minerva on the MATH benchmark at comparable scale. ● Full open release: Releases model, dataset, and code - positioning Llemma as a reproducible starting point for open mathematical LLM research. |
Paper, Tweet |
| 2) LLMs for Software Engineering - A comprehensive survey of LLMs for software engineering covering models, tasks, evaluation, and open challenges. ● Task coverage: Surveys code generation, bug detection and repair, code review, code translation, documentation, and testing. ● Model landscape: Reviews code-specialized LLMs (Codex, StarCoder, CodeLlama) alongside general-purpose LLMs applied to code. ● Evaluation review: Catalogs standard benchmarks (HumanEval, MBPP, DS-1000) and their limitations for real-world software engineering. ● Open challenges: Highlights long-context code understanding, multi-file reasoning, verification, and agent-based SE as key open directions. |
Paper, Tweet |
| 3) Self-RAG - Self-RAG trains an LM to adaptively retrieve, generate, and self-critique using special reflection tokens. ● Reflection tokens: Introduces special tokens that control retrieval decisions, passage relevance judgments, and self-evaluation of generations. ● Adaptive retrieval: The model decides on-the-fly whether to retrieve, rather than always retrieving on every query - saving compute on knowledge-light queries. ● Self-reflection: Critiques its own generations against retrieved passages, enabling controllable trade-offs between response quality and factuality at inference. ● Significant gains: Outperforms state-of-the-art LLMs and strong RAG baselines on open-domain QA, reasoning, and fact verification. |
Paper, Tweet |
| 4) RAG for Long-Form QA - Explores retrieval-augmented LMs specifically on long-form question answering, where RAG failures are more subtle. ● Retrieval is necessary: Confirms that retrieval is an important component for long-form QA, but that evidence documents must be carefully curated and ordered. ● Attribution errors: Documents attribution errors - where the model cites passages that don't actually support its claims - and shows these spike when retrieved docs lack sufficient evidence. ● Document ordering: Demonstrates that document order within the context substantially affects long-form QA attribution accuracy. ● Practical guidelines: Offers concrete guidelines for document selection, ordering, and prompting to reduce hallucination in long-form RAG outputs. |
Paper, Tweet |
| 5) GenBench - A Nature Machine Intelligence paper framework for characterizing and understanding generalization research in NLP. ● Meta-analysis: Reviews 543 papers on generalization in NLP, mapping what "generalization" actually means across different research threads. ● Generalization taxonomy: Organizes generalization into compositional, structural, cross-lingual, cross-task, and cross-domain generalization types. ● Evaluation taxonomy: Provides tools for classifying generalization studies by the kind of distribution shift and evaluation protocol they test. ● Research infrastructure: Ships with tools to help researchers classify and compare generalization work, aiming to reduce conceptual fragmentation in the field. |
Paper, Tweet |
| 6) LLM Self-Explanations - Investigates whether LLMs can generate useful feature-attribution explanations for their own outputs. ● Self-explanation capability: LLMs can self-generate feature-attribution explanations that meaningfully highlight the tokens driving their predictions. ● Performance + truthfulness: Self-explanation improves both task performance and the truthfulness of outputs compared to baseline prompting. ● CoT synergy: Combines productively with chain-of-thought prompting, giving additive improvements rather than substituting for it. ● Interpretability lever: Offers a cheap, model-agnostic interpretability pattern that works through the API without needing gradients or white-box access. |
Paper, Tweet |
| 7) OpenAgents - An open platform for running and hosting real-world language agents, including three distinct agent types. ● Data Agent: A data-analysis agent capable of exploring datasets, running analyses, and producing visualizations through conversation. ● Plugins Agent: Integrates 200+ daily-use API tools (e.g., weather, search, calendars) into a single conversational agent interface. ● Web Agent: An autonomous web-browsing agent capable of navigating real websites and completing multi-step tasks. ● Open alternative to ChatGPT Plus: Positions OpenAgents as an open-source alternative to ChatGPT's plugin ecosystem, usable for research into agent-user interaction patterns. |
Paper, Tweet |
| 8) Eliciting Human Preferences with LLMs - Anthropic uses LLMs to guide the task-specification process, eliciting user intent through natural-language dialogue. ● Interactive elicitation: The LLM asks the user open-ended questions to clarify intent, producing a structured task specification that the model can then execute. ● Beats user-written prompts: Systems built via LLM-elicited specifications produce more informative, accurate responses than user-written prompts alone. ● Better than single-shot prompting: Shows that multi-turn elicitation yields higher task-success rates than single-shot prompting, even when the user is not a prompt engineer. ● Usable AI pattern: Offers a pattern for bridging the user-intent gap that shapes AI product design - spec-driven rather than prompt-driven interaction. |
Paper, Tweet |
| 9) AutoMix - AutoMix routes queries between LLMs of different sizes based on smaller-model confidence, saving cost without sacrificing quality. ● Confidence-based routing: A small model answers first; a confidence signal determines whether to accept its answer or escalate to a larger model. ● Cascading thresholds: Uses multiple confidence thresholds to route queries through a cascade of increasingly capable (and expensive) models. ● Cost-quality Pareto: Achieves Pareto improvements over single-model baselines, delivering equivalent quality at substantially lower inference cost. ● Production relevance: The pattern maps cleanly onto practical LLM deployment where most queries can be handled by cheap models but a tail of hard queries need the frontier model. |
Paper, Tweet |
| 10) Video Language Planning - Enables synthesizing complex long-horizon video plans for robotics via tree search over vision-language and text-to-video models. ● Tree-search planner: Uses a tree-search procedure over a vision-language model serving as policy+value, with a text-to-video model acting as the dynamics model. ● Long-horizon plans: Produces multi-step video plans for robotics tasks that would be infeasible with single-shot video generation. ● Cross-domain generalization: Works across diverse robotics domains, showing the approach is not tied to a specific embodiment or task type. ● Planning-via-generation: Demonstrates that generative video models can serve as world models for planning, a pattern that has gained traction through 2024. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Ring Attention - UC Berkeley's Ring Attention scales transformer context to 100M+ tokens by distributing blockwise self-attention across devices in a ring topology. ● Blockwise attention: Computes self-attention in blocks so that only small KV chunks need to fit on each device at any time. ● Ring communication: Passes KV chunks between devices in a ring, overlapping communication with computation to hide networking latency. ● Context scales with devices: Achievable context length grows linearly with the number of devices, with no attention approximations required. ● 100M+ tokens: Enables context lengths exceeding 100 million tokens in theory, far beyond what any single-device attention implementation can reach. |
Paper, Tweet |
| 2) UniSim (Universal Simulator) - Google's UniSim learns a universal generative simulator of real-world interactions from diverse video + action data. ● Generative world model: Simulates how humans and agents interact with the world by predicting the visual outcome of high-level instructions and low-level controls. ● Diverse action conditioning: Handles both text instructions ("pick up the cup") and low-level motor commands, unifying instruction-following and dynamics modeling. ● Training downstream systems: Can be used to train vision-language planners, low-level RL policies, and video-captioning systems - acting as a general data source. ● World-model agenda: A key datapoint for the broader "generative world models for embodied AI" research agenda that accelerated through 2024. |
Paper, Tweet |
| 3) Survey on Factuality in LLMs - A survey covering evaluation and enhancement techniques for LLM factuality. ● Evaluation taxonomy: Organizes factuality evaluation by granularity (token, sentence, passage), task (QA, generation, dialogue), and reference availability. ● Enhancement taxonomy: Reviews enhancement techniques including better training data, retrieval augmentation, factuality-aware decoding, and post-hoc verification. ● Factuality vs. truthfulness: Clarifies the often-confused distinction between factuality (correct facts) and truthfulness (model reports its beliefs honestly). ● Open problems: Highlights persistent gaps in cross-lingual factuality, open-ended generation factuality, and calibration. |
Paper, Tweet |
| 4) Hypothesis Search (LLMs Can Learn Rules) - A two-stage framework where the LLM learns a rule library for reasoning. ● Rule induction phase: In the first stage, the LLM induces general rules from a small set of examples, producing an explicit rule library rather than implicit pattern matching. ● Rule application phase: In the second stage, the model applies rules from its library to new problems, with explicit rule-lookup rather than end-to-end inference. ● Improves reasoning: The explicit rule library improves reasoning performance on tasks where generalization from examples beats pure in-context learning. ● Interpretability bonus: The learned rule library is human-readable and auditable, providing a window into what the model actually learned from its examples. |
Paper, Tweet |
| 5) Meta Chain-of-Thought Prompting (Meta-CoT) - A generalizable CoT framework that selects domain-appropriate reasoning patterns for the task at hand. ● Task-adaptive CoT: Rather than using a fixed CoT prompt template, Meta-CoT adaptively selects reasoning patterns based on task characteristics. ● Pattern library: Maintains a library of reasoning templates tailored to task families (math, logic, commonsense, etc.), picking the best one per query. ● Strong across tasks: Improves reasoning accuracy across diverse task types compared to single-template CoT prompting. ● Generalizable framework: The Meta-CoT pattern is easy to extend to new task families by just adding new templates to the library. |
Paper, Tweet |
| 6) LLMs for Healthcare Survey - A comprehensive overview of LLMs applied to the healthcare domain. ● Application coverage: Surveys clinical decision support, patient communication, medical summarization, diagnostic assistance, and biomedical research applications. ● Medical-LLM landscape: Reviews major medical LLMs (Med-PaLM, MEDITRON, ClinicalBERT) alongside general-purpose LLMs prompted for medical use. ● Benchmarks: Catalogs medical QA benchmarks and discusses their limitations for predicting real-world clinical usefulness. ● Deployment challenges: Covers regulatory, privacy, and safety challenges specific to healthcare LLM deployment. |
Paper, Tweet |
| 7) RECOMP (Retrieval-Augmented LMs with Compressors) - Proposes two compression approaches to shrink retrieved documents before in-context use. ● Extractive compressor: Selects the most useful sentences from retrieved documents, retaining the most relevant signal at a fraction of token budget. ● Abstractive compressor: Generates a summary synthesizing information from multiple retrieved documents, compressing redundancy across sources. ● 6% compression rate: Achieves compression rates as low as 6% with minimal performance loss on language modeling and open-domain QA. ● Selective augmentation: The training scheme learns to emit empty summaries when retrieved docs are irrelevant - a built-in mechanism for gracefully handling noisy retrieval. |
Paper, Tweet |
| 8) InstructRetro - NVIDIA introduces Retro 48B, the largest LLM pretrained with retrieval at the time. ● 48B scale: Continues pretraining a 43B parameter GPT model on 100B additional tokens while retrieving from a 1.2T-token database. ● Instruction tuning: Further instruction-tunes the retrieval-pretrained model, producing an instruction-following version of Retro. ● Stronger factuality: Shows reduced hallucination and better factuality on knowledge-intensive tasks compared to Retro-free baselines at comparable scale. ● Retrieval pretraining validated: Provides evidence that retrieval-during-pretraining can scale to 40B+ parameters and benefit downstream instruction-tuned use cases. |
Paper, Tweet |
| 9) MemWalker - MemWalker treats the LLM as an interactive agent that traverses a tree-structured summary of long text. ● Tree of summary nodes: Preprocesses long context into a hierarchical tree of summary nodes, compressing and structuring the information. ● Query-driven traversal: Given a query, the LLM traverses the tree through iterative prompting, descending into subtrees that are most relevant to the question. ● Reasoning-based reading: The traversal decisions are reasoning-based, so the model can explain which part of the document it consulted and why. ● Explainability bonus: The traversal trace serves as a human-readable explanation of the model's document reading, improving debuggability of long-context QA. |
Paper, Tweet |
| 10) FireAct (Language Agent Fine-tuning) - Explores fine-tuning LLMs specifically for language-agent use, demonstrating consistent gains over prompting alone. ● Fine-tuning beats prompting: Language agents consistently improve over prompted baselines after fine-tuning their backbone LLM on agent trajectories. ● 500 trajectories suffice: Fine-tuning a Llama 2-7B on just 500 agent trajectories produces a substantially stronger language agent than a prompted GPT-4 on several agent benchmarks. ● Data-efficient: The low data threshold suggests agent behaviors can be cheaply specialized, which matters for production agent deployment. ● Agent-specialization pattern: Anticipates the wave of agent-specialized LLMs released through 2024, where small focused fine-tunes outperform prompting of large general models. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) LLMs Represent Space and Time - MIT researchers find that LLMs internally encode linear representations of space and time across multiple scales. ● Linear geographic representations: Activations contain linear representations of coordinates (latitude, longitude) of real-world entities, detectable via probes. ● Multi-scale time: Similar linear representations exist for time at multiple scales (historical year, news date, etc.), suggesting a structured temporal axis. ● Robust across prompts: The representations are robust to prompt variations and unified across different entity types (cities, events, people). ● World-model evidence: Provides empirical support for the claim that LLMs build literal world models, not just surface-statistics imitators - a live debate in interpretability. |
Paper, Tweet |
| 2) Retrieval Meets Long-Context LLMs - NVIDIA's study comparing RAG and long-context LLMs, with the punchline that the two are complementary rather than substitutes. ● 4K + RAG ≈ 16K fine-tuned: An LLM with only a 4K context window using simple RAG can match a fine-tuned LLM with 16K context - a striking efficiency result. ● Retrieval always helps: Retrieval improves performance regardless of context-window size, even when the model can fit the full document in its native context. ● LLaMA-2 70B beats GPT-3.5: A retrieval-augmented LLaMA 2 70B with 32K context outperforms GPT-3.5-turbo-16k on seven long-context tasks including QA and query-based summarization. ● Implication: Don't think of long context and retrieval as competing solutions - pair them, and let the model attend to both the query and retrieved evidence. |
Paper, Tweet |
| 3) StreamingLLM - MIT's StreamingLLM enables efficient streaming inference by preserving "attention sinks" - early-sequence tokens that most attention mass flows to. ● Attention sink phenomenon: The authors observe that attention heads consistently route a large fraction of attention mass to the first few tokens, even when those tokens are semantically irrelevant. ● Sink tokens are essential: Keeping the KV states of initial tokens around dramatically recovers the performance of sliding-window attention. ● Infinite-length inference: Enables LLMs trained with finite context to generate infinitely long outputs without fine-tuning, by retaining sink tokens plus a sliding window. ● Emergent explanation: Attention sinks appear because the softmax must normalize to one - unused attention mass is "dumped" onto the first tokens, which explains why removing them breaks the model. |
Paper, Tweet |
| 4) Neural Developmental Programs (NDPs) - Proposes neural networks that self-assemble through a developmental process inspired by biological embryonic development. ● Bio-inspired growth: A small set of developmental rules governs how neurons replicate and connect, mirroring the way biological nervous systems grow from genomes. ● Indirect encoding: The final network emerges from a much smaller developmental program rather than being specified directly - an indirect encoding scheme. ● Self-assembly: Networks self-assemble through repeated application of local developmental rules, without a global blueprint. ● Research direction: Positioned as a step toward more open-ended, flexible neural architectures that could eventually grow and adapt throughout training rather than being fixed a priori. |
Paper, Tweet |
| 5) The Dawn of LMMs (GPT-4V Deep Dive) - Microsoft's exhaustive 166-page analysis of GPT-4V's capabilities and limitations. ● Comprehensive task coverage: Probes GPT-4V across visual reasoning, code, OCR, document understanding, multimodal commonsense, and agent-style tasks. ● Working input modes: Catalogs the diverse input patterns GPT-4V supports - single images, multi-image reasoning, image-text interleaving, sketches, and handwritten input. ● Capability frontier: Demonstrates emergent capabilities like reading diagrams, interpreting medical imaging, and extracting structured information from complex visuals. ● Open issues: Identifies persistent weaknesses including hallucination, fine-grained spatial reasoning, and consistency across related queries - a reference for what was still broken at the start of the GPT-4V era. |
Paper, Tweet |
6) Training LLMs with Pause Tokens - CMU shows that adding a learnable <pause> token during both pretraining and fine-tuning gives the model extra "thinking time" and improves reasoning. ● Learnable pause token: Inserts a <pause> token into the input; the model processes these tokens but doesn't treat them as meaningful content, letting it compute more before answering. ● CommonsenseQA and math gains: Produces measurable performance gains on CommonsenseQA and math word problems - both tasks that benefit from extra internal computation. ● Pretraining is required: The benefit only materializes if pauses are introduced in both pretraining and fine-tuning - adding them only at inference doesn't work. ● Compute-aware decoding: Positions pause tokens as a simple inference-time knob for trading compute against accuracy, foreshadowing many 2024 "thinking time" tricks. |
Paper, Tweet |
| 7) Self-Taught Optimizer (STOP) - Proposes recursively self-improving code generation where an LLM-scaffolded program improves itself. ● Seed improver: A "seed improver" program first improves an input program to return the best solution found - a self-improvement scaffold built on GPT-4. ● Recursive improvement: The seed improver is itself tasked with improving itself, producing the first concrete demonstration of recursive self-improvement in LLM code generation. ● GPT-4 capable: Shows that GPT-4 models can write code that modifies itself iteratively, producing measurably better scaffolds than the initial seed. ● Foundational work: An early, influential demonstration of the LLM-as-code-modifier pattern that would reappear across 2024 in agent and tool-use research. |
Paper, Tweet |
| 8) RA-DIT (Retrieval-Augmented Dual Instruction Tuning) - Meta's RA-DIT is a lightweight recipe that retrofits LLMs with retrieval capabilities through dual fine-tuning. ● Two-stage fine-tuning: Stage 1 updates the LM to better use retrieved information; stage 2 updates the retriever to return documents the LM actually prefers. ● Each stage adds gains: Both stages contribute meaningfully and combine to produce strong downstream RAG performance without end-to-end joint training. ● 65B SoTA: The 65B model achieves state-of-the-art on a range of knowledge-intensive zero-shot and few-shot benchmarks. ● Strong relative gains: Outperforms existing retrieval-augmented approaches by up to +8.9% in zero-shot and +1.4% in 5-shot settings - non-trivial gains on already-strong baselines. |
Paper, Tweet |
| 9) KOSMOS-G - Microsoft's KOSMOS-G extends zero-shot image generation to multi-image vision-language input. ● Generalized VL input: Generates images from a vision-language prompt that can include multiple reference images, unlike typical single-reference setups. ● Multi-entity scenarios: Extends zero-shot subject-driven image generation to scenarios with multiple subjects - e.g., generating a scene where A is doing X to B, preserving each identity. ● CLIP-replaceable: Allows replacing CLIP in downstream image-generation pipelines, unlocking new applications with U-Net techniques like ControlNet and LoRA. ● Unified generation interface: Positions itself as a unified vision-language input interface for controllable image generation, rather than a new diffusion backbone. |
Paper, Tweet |
| 10) Analogical Prompting - Google's Analogical Prompting guides LLM reasoning by having the model self-generate relevant exemplars on the fly. ● Self-generated exemplars: Rather than requiring curated few-shot demonstrations, the model is prompted to recall or generate relevant analogous problems before solving the target question. ● Analogical-reasoning inspiration: Draws on the cognitive-science concept of analogical reasoning, where humans solve new problems by invoking similar past cases. ● No labeled exemplars needed: Unlike CoT, which requires demonstrations of the reasoning process, Analogical Prompting requires no labeled reasoning data at all. ● Benchmark gains: Improves over standard CoT and zero-shot baselines across math, commonsense, and code reasoning tasks, with particularly strong gains on math word problems. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) The Reversal Curse - Finds that LLMs trained on "A is B" fail to generalize to "B is A" - a surprisingly deep failure of learning. ● Asymmetric fact learning: LLMs finetuned on statements of the form "A is B" show no ability to answer "Who is B?" with A, even after extensive training. ● Fictitious-statement testbed: Demonstrates the effect using fine-tuning on fictitious statements, so training data can't contribute the reverse direction through coincidence. ● Model-family robust: The Reversal Curse persists across different model sizes and model families, suggesting it reflects a fundamental property of next-token prediction training. ● Knowledge representation implication: Raises hard questions about how LLMs represent knowledge - they clearly don't store bidirectional relations by default, unlike symbolic knowledge bases. |
Paper, Tweet |
| 2) Effective Long-Context Scaling (Meta) - Meta proposes a 70B long-context LLM that surpasses GPT-3.5-turbo-16k on long-context benchmarks. ● Continual pretraining recipe: Uses continual pretraining on long documents to extend Llama 2's context window efficiently, without training a new model from scratch. ● Beats GPT-3.5-turbo-16k: The 70B variant outperforms GPT-3.5-turbo-16k on a suite of long-context tasks including document QA, summarization, and multi-hop reasoning. ● Cost-effective instruction tuning: Introduces an instruction-tuning procedure that doesn't require human-annotated long-instruction data - a common bottleneck for long-context fine-tuning. ● Open release: Produces an open long-context Llama 2 variant, making strong long-context capability accessible to the research community. |
Paper, Tweet |
| 3) Graph Neural Prompting (GNP) - A plug-and-play method that injects knowledge-graph information into frozen pretrained LLMs. ● KG-to-embedding bridge: Uses a graph neural network to encode relevant knowledge-graph subgraphs into a soft prompt embedding that conditions the LLM. ● Frozen-LLM compatible: Works with frozen pretrained LLMs without requiring any fine-tuning, making it cheap to adopt. ● Commonsense gains: Improves performance on commonsense QA benchmarks where structured knowledge-graph information is known to help. ● Modular extensibility: The GNN-encoded soft-prompt pattern generalizes beyond KGs to any structured input that can be encoded into embeddings. |
Paper, Tweet |
| 4) Vision Transformers Need Registers - Meta researchers identify artifact tokens in ViT feature maps and propose a trivial fix: add dedicated register tokens. ● Artifact identification: Vision transformers repurpose certain input tokens as "internal scratch space", producing high-norm artifacts that contaminate feature maps. ● Register tokens: Adds a small number of dedicated register tokens to the input sequence, giving the model explicit scratch space instead of co-opting patch tokens. ● Cleaner features: The fix produces substantially smoother feature and attention maps, with the artifact tokens disappearing. ● New SoTA on dense tasks: Sets new state-of-the-art results on dense visual prediction tasks (segmentation, depth, object discovery), with real downstream impact. |
Paper, Tweet |
| 5) Boolformer - The first Transformer trained to perform end-to-end symbolic regression of Boolean functions. ● End-to-end symbolic regression: Directly predicts compact Boolean formulas from input-output examples, skipping the typical search-over-programs loop of symbolic regression. ● Handles complex functions: Produces compact formulas for complex Boolean functions that traditional symbolic-regression methods struggle to compress. ● Gene regulatory networks: Applied to modeling the dynamics of gene regulatory networks, providing a concrete real-world application beyond synthetic benchmarks. ● Transformer-as-symbolic-learner: Extends the "Transformer as symbolic regression engine" line started by earlier work on equation discovery, covering the discrete-logic case. |
Paper, Tweet |
| 6) LLaVA-RLHF - Adapts factually augmented RLHF to aligning large multimodal models, reducing hallucination without falling into reward-hacking pitfalls. ● Factually augmented RLHF: Augments the reward model with factual-consistency signals (e.g., grounded-in-image checks), reducing the reward hacking common in vanilla multimodal RLHF. ● Hallucination reduction: Produces meaningful reductions in hallucination on multimodal benchmarks compared to SFT-only or vanilla RLHF variants. ● 94% of text GPT-4: Reaches 94% of the performance level of text-only GPT-4 on LLaVA-Bench - closing a substantial gap via alignment alone. ● Open recipe: Releases the full training recipe so the multimodal RLHF approach can be applied to other open VLMs. |
Paper, Tweet |
| 7) LLM Alignment Survey - A comprehensive survey of LLM alignment research spanning theoretical foundations to adversarial pressure. ● Outer and inner alignment: Distinguishes outer alignment (specifying the right objective) from inner alignment (ensuring the model actually pursues that objective). ● Mechanistic interpretability: Reviews interpretability as an alignment tool, covering circuits, activation patching, and probing approaches. ● Adversarial pressure: Catalogs known attacks on aligned LLMs including jailbreaks, prompt injection, and reward hacking. ● Evaluation and directions: Discusses alignment evaluation methodologies and open problems, including scalable oversight for future systems beyond human capability. |
Paper, Tweet |
| 8) Qwen - Alibaba releases the Qwen family of open LLMs with strong tool-use and planning capabilities for language agents. ● Open model family: Ships with multiple sizes (7B, 14B, 72B) and both base and chat variants, covering a wide range of downstream needs. ● Tool use and planning: Emphasizes tool use and planning capabilities through targeted RLHF training for agentic tasks. ● Agent-ready: Comes with agent-specific RLHF data and recipes that would inform the Qwen-Agent releases through 2024. ● Multilingual strength: Strong on Chinese alongside English, filling a gap in the open-LLM landscape previously dominated by English-centric releases. |
Paper, Tweet |
| 9) MentaLLaMA - An open-source LLM family specialized for interpretable mental-health analysis on social media. ● Mental-health focus: Fine-tuned specifically for mental-health analysis tasks including depression, anxiety, and stress detection in social media text. ● Instruction-following: Supports instruction-following interfaces, letting clinicians and researchers query the model in natural language rather than via fixed classifiers. ● 105K instruction dataset: Releases a multi-task, multi-source interpretable mental-health instruction dataset with 105K samples. ● Interpretability-first: Emphasizes interpretable predictions rather than black-box classification, important for downstream clinical or research use. |
Paper, Tweet |
| 10) Logical Chain-of-Thought (LogiCoT) - A neurosymbolic framework that verifies and revises zero-shot CoT reasoning using symbolic-logic principles. ● Symbolic-logic verification: Applies principles from symbolic logic to verify whether each step of a CoT reasoning chain is internally consistent. ● Revision loop: When the verifier detects an inconsistency, the model revises the reasoning step before continuing, preventing error propagation. ● Zero-shot: Works zero-shot without requiring labeled examples of logical reasoning - the verifier is symbolic rather than learned. ● Reasoning gains: Improves CoT reasoning on logical-reasoning benchmarks where vanilla CoT tends to produce fluent but invalid chains. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) AlphaMissense - DeepMind's AlphaMissense is an AI model that classifies missense genetic variants as pathogenic or benign at genome scale. ● 71M variants classified: Categorizes 89% of all 71 million possible missense variants as either likely pathogenic or likely benign, producing a comprehensive human-genome catalog. ● Disease-cause identification: Helps pinpoint the molecular cause of genetic diseases, where missense variant interpretation is a known bottleneck in clinical genetics. ● AlphaFold lineage: Builds on the AlphaFold family's protein-structure understanding, leveraging structural context to assess variant impact. ● Open catalog: The full catalog is released to accelerate research in rare-disease diagnosis and drug target discovery. |
Paper, Tweet |
| 2) Chain-of-Verification (CoVe) - Meta's Chain-of-Verification adds a "deliberation" step where the LLM fact-checks its own draft before finalizing. ● Four-step pipeline: (1) Draft an initial response; (2) plan verification questions for fact-checking; (3) answer each verification question independently; (4) generate a final verified response. ● Independent verification: Each verification question is answered independently to avoid bias from other responses, producing more reliable fact-checks than joint answering. ● Hallucination reduction: Produces measurable hallucination reductions on long-form QA tasks compared to standard and CoT prompting. ● Self-correction pattern: Influential example of the "LLM as its own critic" pattern, foreshadowing many 2024 self-refinement techniques. |
Paper, Tweet |
| 3) Contrastive Decoding for Reasoning - Shows that contrastive decoding, a simple inference-time technique, substantially improves reasoning in large LLMs. ● Contrastive decoding: Subtracts the log-probabilities of a smaller "expert" model from those of the target LLM, boosting tokens where the larger model confidently differs from the smaller one. ● Llama 65B beats Llama 2: Contrastive decoding lets Llama 65B outperform Llama 2 and other strong baselines on commonsense and reasoning benchmarks. ● Training-free: Requires no additional training - just a smaller model available at inference time and a modified decoding rule. ● Generalizable lever: Positions contrastive decoding as a simple, cheap lever for reasoning improvement that can complement other prompting or fine-tuning techniques. |
Paper, Tweet |
| 4) LongLoRA - An efficient LoRA-based fine-tuning recipe for extending LLM context windows without expensive full fine-tuning. ● Shift short attention: Uses "shift short attention" during training, a pattern-shifted sparse approximation that mimics full attention while cutting cost. ● LoRA-compatible: Works with standard LoRA, making it compatible with the existing parameter-efficient fine-tuning ecosystem. ● Lower GPU cost: Dramatically reduces GPU memory and training time compared to full fine-tuning for context extension. ● No accuracy compromise: Achieves comparable accuracy to full fine-tuning at extended context lengths, despite using a much cheaper approximation. |
Paper, Tweet |
| 5) Struc-Bench (LLMs for Structured Data) - Studies how LLMs handle complex structured-data generation and proposes a structure-aware fine-tuning method. ● Structured data challenge: Tests LLMs on generating complex structured data (HTML tables, JSON, LaTeX) where surface-form correctness matters. ● Structure-aware fine-tuning: Proposes a fine-tuning recipe specifically designed to teach small models the syntactic constraints of structured outputs. ● 7B beats GPT-4: A fine-tuned Llama 7B significantly outperforms GPT-3.5/4 and Vicuna-13B on structured-data generation benchmarks. ● Deployment relevance: Demonstrates that for production structured-output applications, small specialized models can beat frontier general-purpose models at a fraction of the cost. |
Paper, Tweet |
| 6) LMSYS-Chat-1M - LMSYS releases a large-scale dataset of 1 million real-world LLM conversations collected from the Vicuna demo and Chatbot Arena. ● 1M conversations: Comprises 1 million real-world conversations across 25 state-of-the-art LLMs, a uniquely broad snapshot of how people actually use chat models. ● 210K unique users: Collected from 210K unique IP addresses, giving a diverse user sample rather than a curated research group. ● Real-world use cases: Captures natural usage patterns - coding help, writing, exploration, role-play - across many topics and languages. ● Research resource: Opens up research directions in LLM evaluation, preference modeling, and usage-pattern analysis that were previously gated by data scarcity. |
Paper, Tweet |
| 7) Language Modeling Is Compression - DeepMind empirically revisits the theoretical equivalence between prediction and compression, applied to modern LLMs. ● Theoretical equivalence: Reminds that optimal compression and optimal prediction are duals - a good language model is implicitly a powerful compressor. ● ImageNet compression: Chinchilla 70B compresses ImageNet patches to 43.4% of raw size, better than domain-specific codecs like PNG. ● LibriSpeech compression: Compresses LibriSpeech samples to 16.4% of raw size, beating FLAC and gzip on audio data despite never being trained on audio. ● Cross-modal generalization: Shows LLMs work as general-purpose compressors across text, image, and audio - a striking demonstration of in-context learning's reach. |
Paper, Tweet |
| 8) Compositional Foundation Models (HiP) - Proposes foundation models that compose multiple expert foundation models trained on different modalities to solve long-horizon goals. ● Hierarchical planning: Uses separate foundation models for language (high-level plans), vision (grounding), and action (execution) that compose into a hierarchical planner. ● Long-horizon goals: Targets goals requiring dozens of subgoals - a regime where monolithic policies typically fail. ● Training-free composition: Composes existing pretrained models at inference time without joint training, dramatically reducing the compute cost of long-horizon agents. ● Robotics relevance: Demonstrates the approach on robotic manipulation tasks, pointing toward practical long-horizon embodied-AI systems. |
Paper, Tweet |
| 9) OWL (LLMs for IT Operations) - Proposes OWL, an LLM specialized for IT operations through self-instruct fine-tuning on IT-specific tasks. ● IT operations focus: Targets IT-specific tasks including log analysis, incident diagnosis, config-file manipulation, and automated operations. ● Self-instruct dataset: Uses a self-instruct strategy grounded in real IT tasks to construct a high-quality instruction dataset from scratch. ● IT benchmark: Introduces a benchmark for evaluating LLMs on IT operations tasks, filling a gap left by general-purpose LLM benchmarks. ● Enterprise deployment: Positions LLMs as practical assistants for IT operators rather than just developer copilots. |
Paper, Tweet |
| 10) KOSMOS-2.5 - Microsoft's KOSMOS-2.5 is a multimodal model purpose-built for "machine reading" of text-intensive images. ● Text-rich image input: Specialized for documents, forms, receipts, and other images dominated by text rather than natural-scene imagery. ● Document-level generation: Capable of document-level text generation from images, handling layout-aware reading order and structure. ● Image-to-markdown: Converts complex text-rich images directly into Markdown output, preserving headings, lists, and tables. ● Complements KOSMOS-1/2: Extends the KOSMOS family toward document intelligence, a domain where general VLMs had weaker performance. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Textbooks Are All You Need II (phi-1.5) - Microsoft's phi-1.5 demonstrates that a 1.3B model trained on "textbook-quality" synthetic data rivals much larger models on reasoning. ● Small but capable: A 1.3B parameter model trained on only 30B tokens competes or outperforms much larger open models on reasoning tasks. ● Synthetic textbook data: Training data consists of AI-generated "textbook-quality" content, deliberately curated for pedagogical clarity rather than web breadth. ● Data quality dominates: Suggests that data quality and pedagogical structure matter more for reasoning emergence than raw parameter count - a provocative counter to pure-scaling narratives. ● Phi-family kickoff: Establishes the recipe that the phi-2, phi-3, and phi-4 releases would refine, popularizing synthetic-data-heavy small LLM training. |
Paper, Tweet |
| 2) The Rise and Potential of LLM-Based Agents - A comprehensive survey of LLM-based agents covering construction, capability, and societal implications. ● Agent architecture: Organizes the space by core agent components - perception, brain (planning, memory, reflection), and action - giving a clean compositional view. ● Single-agent vs. multi-agent: Reviews both single-agent systems and multi-agent societies, covering coordination patterns and emergent behaviors. ● Application landscape: Catalogs the applications where LLM agents were showing promise at the time, from software engineering to scientific research to social simulation. ● Societal implications: Dedicated discussion of "harnessing agents for good" - safety, alignment, and governance considerations specific to agent deployment. |
Paper, Tweet |
| 3) EvoDiff - Microsoft's EvoDiff combines evolutionary-scale protein data with diffusion models for controllable protein generation in sequence space. ● Sequence-space diffusion: Operates directly in protein-sequence space rather than structure space, enabling generation of proteins that structure-based models can't reach. ● Evolutionary-scale training: Trains on massive evolutionary protein datasets, leveraging the diverse biological sequence space as learning signal. ● Controllable generation: Supports conditional generation on function, family, or motif constraints, giving researchers practical design levers. ● Beyond structure-based models: Generates proteins that are inaccessible to structure-based generators (e.g., those without well-defined folds), expanding the design space. |
Paper, Tweet |
| 4) Rewindable Auto-regressive INference (RAIN) - Shows that unaligned LLMs can produce aligned responses at inference time via self-evaluation and rewinding. ● No fine-tuning needed: Produces human-preference-aligned responses from unaligned base LLMs without any additional fine-tuning. ● Self-evaluation: The LLM evaluates its own in-progress generation against alignment criteria, flagging problematic paths. ● Rewind mechanism: When self-evaluation detects a problematic direction, the model rewinds and regenerates - an inference-time search strategy. ● Practical alignment: Offers a lightweight alignment pattern for cases where fine-tuning isn't feasible (e.g., API-only models or rapid policy iteration). |
Paper, Tweet |
| 5) Robot Parkour Learning - Stanford's Robot Parkour system learns end-to-end vision-based parkour policies that transfer to a quadrupedal robot. ● Vision-based parkour: Learns policies from an egocentric depth camera that let a quadruped execute real parkour skills like jumping gaps and climbing obstacles. ● Sim-to-real transfer: Trained in simulation and transferred to a physical low-cost robot, demonstrating successful sim-to-real in a challenging contact-rich domain. ● Skill selection: The policy automatically selects and sequences appropriate parkour skills based on terrain observed in real time. ● Low-cost hardware: Runs on commodity quadruped hardware, making advanced mobile behaviors accessible to smaller labs - a recurring pattern through 2023 robotics. |
Paper, Tweet |
| 6) Hallucination Survey (Early) - Classifies hallucination phenomena in LLMs and catalogs evaluation criteria and mitigation strategies. ● Hallucination types: Distinguishes factual hallucinations, logical hallucinations, and contextual hallucinations, showing they require different mitigation approaches. ● Evaluation criteria: Reviews evaluation metrics for detecting and quantifying hallucinations, covering automatic metrics, LLM-as-judge, and human evaluation. ● Mitigation catalog: Organizes mitigation strategies by training stage (pretraining, SFT, RLHF) and inference stage (RAG, decoding, verification). ● Reference snapshot: Captures the state of hallucination research mid-2023, providing a useful anchor for tracking how the field evolved through 2024. |
Paper, Tweet |
| 7) Agents Library - An open-source library for building autonomous language agents with first-class support for planning, memory, tools, and multi-agent communication. ● Full-feature agent framework: Supports planning, long-term memory, tool usage, and multi-agent communication out of the box. ● Multi-agent coordination: Provides primitives for multi-agent societies where agents can communicate, negotiate, and collaborate on tasks. ● Modular design: Agent components are modular and composable, letting researchers swap planners, memory modules, or tool interfaces. ● 2023 agent-framework moment: One of several agent frameworks that emerged in 2023, showing the rapid maturation of the language-agent tooling ecosystem. |
Paper, Tweet |
| 8) Radiology-Llama 2 - A Llama 2-based LLM specialized for radiology report generation. ● Llama 2 base: Fine-tuned on a large dataset of radiology reports, producing a domain-specialized model from an open general-purpose base. ● Clinical impressions: Generates coherent and clinically useful impression statements from structured radiology findings. ● Coherence gains: Outperforms general-purpose LLMs on radiology-specific report-generation tasks, as measured on both automatic metrics and clinician evaluation. ● Domain-LLM template: An early datapoint for the "domain-specialized open LLM" pattern that became standard practice across medicine, law, and other regulated fields. |
Paper, Tweet |
| 9) ChatDev (Communicative Agents for Software Development) - ChatDev is a virtual chat-powered software company where LLM agents take on roles in a waterfall-model dev process. ● Waterfall mirroring: LLM agents play roles (CEO, CTO, programmer, reviewer, tester) in a simulated waterfall software-development process, coordinating through chat. ● End-to-end pipeline: Completes the entire software-development lifecycle from requirements to testing, producing working software artifacts. ● Under $1, under 7 minutes: Generates full software projects in under 7 minutes for less than $1 of API cost - striking cost-efficiency for agent-based development. ● Multi-agent coordination: Demonstrates that simple role-based multi-agent coordination can produce coherent, non-trivial software without heavy scaffolding. |
Paper, Tweet |
| 10) MAmmoTH - An open-source LLM family specialized for general mathematical problem solving. ● Math-specialized models: Trained on a curated math instruction-tuning dataset covering arithmetic, algebra, calculus, and contest-style problems. ● Beats existing open math LLMs: Outperforms prior open-source math LLMs across a range of mathematical reasoning benchmarks at comparable parameter counts. ● CoT + PoT hybrid data: Training data mixes chain-of-thought and program-of-thought traces, teaching the model both natural-language and code-aided reasoning. ● Open family: Released in multiple sizes to let researchers study math-LLM scaling laws in the open-source ecosystem. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Transformers as Support Vector Machines - A theoretical paper establishing a formal connection between self-attention optimization and hard-margin SVM problems. ● Hard-margin SVM connection: Shows the optimization geometry of self-attention in transformers exhibits a direct connection to hard-margin SVM problems. ● Implicit regularization: Gradient descent without early stopping leads to implicit regularization, with attention converging toward SVM-like solutions. ● Theoretical foundation: Provides a rare closed-form theoretical lens on self-attention dynamics, cutting through much of the "transformers as black box" framing. ● Future analysis tool: The SVM connection gives researchers a principled tool to analyze attention convergence, generalization, and feature selection. |
Paper |
| 2) RLAIF (Scaling RLHF with AI Feedback) - Google compares RLHF with RLAIF (Reinforcement Learning from AI Feedback) to test whether AI preferences can replace human preferences. ● Head-to-head comparison: Directly compares the efficacy of human vs. AI feedback for preference-based alignment, using the same policy optimization pipeline. ● ~70% preference: On summarization, human evaluators prefer both RLAIF and RLHF outputs over the baseline SFT model in roughly 70% of cases - statistical parity. ● Scaling studies: Reports optimal settings for AI-feedback generation, including prompt design, chain-of-thought, and label-combining strategies. ● Cost-reduction implication: Suggests RLAIF can substitute for RLHF for many alignment use cases, dramatically reducing the human-labeling cost of alignment. |
Paper, Tweet |
| 3) GPT Solves Math Problems Without a Calculator - Demonstrates that with sufficient training data, even a small language model can perform accurate multi-digit arithmetic. ● 2B model, 100% arithmetic: A 2B language model performs multi-digit arithmetic operations with 100% accuracy, without data leakage or calculator tools. ● GLM-10B on Chinese math: A GLM-10B fine-tuned on multi-step arithmetic and detailed math problems is competitive with GPT-4 on a 5K-sample Chinese math problem test set. ● Data-centric argument: Suggests arithmetic "weakness" in LLMs is largely a data-coverage issue rather than a fundamental architectural limit. ● Tool-free reasoning: Pushes back on the common view that LLMs can never do reliable arithmetic without tool use, with implications for tool-use-vs-internal-computation design choices. |
Paper, Tweet |
| 4) OPRO (LLMs as Optimizers) - DeepMind's OPRO uses LLMs as general-purpose optimizers over natural-language-described problems. ● Natural-language optimization: The optimization problem is described in natural language; the LLM iteratively proposes new solutions conditioned on previously found solutions. ● Prompt optimization: As a key application, optimizes prompts to maximize test accuracy, using previously evaluated prompts as trajectory context. ● Big gains over human prompts: LLM-optimized prompts outperform human-designed prompts on GSM8K and BIG-Bench Hard, sometimes by over 50 percentage points. ● General-purpose pattern: Positions LLMs as general-purpose optimizers for problems that are hard to specify mathematically, including linear regression, traveling salesman variants, and prompt design. |
Paper, Tweet |
| 5) ImageBind-LLM - Shanghai AI Lab's ImageBind-LLM brings six-modality understanding to LLMs via the ImageBind joint embedding space. ● ImageBind backbone: Leverages ImageBind's joint embedding space (covering image, text, audio, depth, thermal, IMU) as a universal multimodal encoder. ● Learnable bind network: Aligns ImageBind's visual encoder with a frozen LLM through a learnable bind network, enabling instruction tuning across modalities. ● Six-modality input: Responds to instructions over audio, 3D point clouds, video, and beyond - not just text and image. ● Generation quality: Maintains high language-generation quality despite the modality diversity, validating the ImageBind-as-bridge approach. |
Paper, Tweet |
| 6) Explaining Grokking - DeepMind advances our understanding of grokking, predicting and confirming two novel phenomena that test their theory. ● Ungrokking: A model can go from perfect generalization back to memorization when trained further on a smaller dataset below a critical threshold - the first demonstration of this reverse effect. ● Semi-grokking: A randomly initialized network trained on the critical dataset size shows a grokking-like transition but partial, rather than the sharp full-grokking curve. ● Theoretical predictions: These behaviors were predicted from theory before being demonstrated empirically - a rare example of predictive rather than post-hoc explanation in deep learning. ● Generalization theory: Advances understanding of when and why neural networks transition from memorization to generalization, bridging empirical observation with principled prediction. |
Paper, Tweet |
| 7) Overview of AI Deception - A survey cataloguing empirical examples of AI systems exhibiting deceptive behavior. ● Empirical catalog: Documents empirical instances of AI deception across game-playing, language models, and economic-simulation systems. ● Learned deception: Shows how deception can emerge as an instrumentally useful strategy even when models aren't directly trained to deceive. ● Risk framing: Organizes deception risks from near-term harms (misinformation, manipulation) to longer-term alignment concerns. ● Research agenda: Calls for dedicated research on deception detection, deception prevention during training, and evaluation frameworks for deceptive behavior. |
Paper, Tweet |
| 8) FLM-101B - A 101B parameter open LLM trainable on a $100K budget through a growth-based training strategy. ● $100K budget for 101B: Trains a 101B model on 0.31TB tokens at a total compute cost of approximately $100K - remarkable for a frontier-scale parameter count. ● Progressive growth strategy: Rather than training 101B from scratch, trains three models sequentially with each larger model inheriting from its smaller predecessor. ● 50%+ cost reduction: The aggressive growth strategy reduces total training cost by more than 50% compared to from-scratch training. ● Open-science contribution: Releases the 101B model, providing a transparent reference for how far careful training-strategy design can stretch a limited budget. |
Paper, Tweet |
| 9) Cognitive Architectures for Language Agents (CoALA) - Princeton proposes CoALA, a systematic framework for understanding and building language agents. ● Production-system inspiration: Draws on classical cognitive architectures and production systems (Soar, ACT-R) to structure language agents. ● Four-component organization: Agents consist of memory modules, action space, decision procedures, and reasoning - each with specific design choices. ● Unifies recent methods: Catalogs methods for LLM-based reasoning, grounding, learning, and decision-making as instantiations of CoALA components. ● Design-space map: Makes the language-agent design space explicit, helping researchers compare systems and identify underexplored combinations. |
Paper, Tweet |
| 10) Q-Transformer - Google's Q-Transformer is a scalable RL method for training multi-task robotic policies from large offline datasets. ● Offline RL at scale: Trains multi-task policies from large offline datasets combining human demonstrations and autonomously collected robot data. ● Transformer policy: Uses a transformer backbone with Q-learning, bridging the scaling properties of transformers with the data-efficiency of Q-learning. ● Strong robotics performance: Achieves strong performance on a large diverse real-world robotic manipulation task suite - not just simulation. ● Scaling signal for robotics: A significant early demonstration that transformer + Q-learning scales on real-world robot data, pointing toward foundation models for robotic control. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) LLaSM (Large Language and Speech Model) - A combined language-and-speech model trained with cross-modal conversational abilities. ● Cross-modal conversation: Supports speech-and-language instructions seamlessly, enabling more natural interactions than text-only or speech-only systems. ● Instruction-tuned: Fine-tuned on speech-language instruction data, letting users speak prompts and receive responses without a separate ASR step. ● Unified architecture: Uses a single model trained end-to-end rather than a cascade of ASR, LLM, and TTS - reducing error propagation and improving latency. ● Accessibility implication: Positions the unified speech-language approach as a path toward more accessible AI interfaces, particularly for users who prefer voice interaction. |
Paper, Tweet |
| 2) SAM-Med2D - Adapts the Segment Anything Model (SAM) to 2D medical imaging through large-scale medical fine-tuning. ● Medical-domain adaptation: Fine-tunes SAM on a large, diverse collection of 2D medical images spanning multiple anatomies and modalities (CT, MRI, X-ray, ultrasound). ● Comprehensive medical segmentation: Handles organ, lesion, and anatomical-structure segmentation across common imaging modalities. ● Prompt engineering for clinicians: Supports the same point/box/text-prompt interaction paradigm as SAM, making it approachable for clinicians already familiar with SAM. ● Strong medical baseline: Achieves strong performance on medical segmentation benchmarks, showing the SAM-adaptation pattern works well for regulated domains. |
Paper, Tweet |
| 3) Vector Search with OpenAI Embeddings - Argues, via empirical analysis, that dedicated vector databases aren't necessarily required for modern AI-stack search applications. ● Cost-benefit framing: "From a cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern 'AI stack'" - a pointed critique of the vector-DB explosion. ● Existing infrastructure suffices: Shows that widely deployed search infrastructure (Elasticsearch, Lucene) can handle OpenAI embeddings adequately for most applications. ● Performance characterization: Benchmarks OpenAI embeddings on standard retrieval tasks using existing search infrastructure, providing hard numbers. ● Industry pushback: Part of a broader debate about the necessity of specialized vector databases, offering empirical ammunition to the skeptics. |
Paper, Tweet |
| 4) Graph of Thoughts (GoT) - Generalizes Chain-of-Thought and Tree-of-Thought by modeling LLM reasoning as an arbitrary graph. ● Arbitrary graph structure: Represents LLM-generated thoughts as nodes in a graph with arbitrary edges - allowing merging, looping, and non-tree structures. ● Feedback loops: Enables explicit feedback loops where earlier thoughts can be revised based on later exploration - impossible in strictly linear or tree-structured reasoning. ● Network reasoning: The authors call this "network reasoning", treating reasoning as a graph-exploration problem rather than a linear or branching one. ● No model updates: Like CoT and ToT, works purely at prompting level without any model fine-tuning - extending the chain-of-X prompting family. |
Paper, Tweet |
| 5) MVDream - ByteDance's MVDream is a multi-view diffusion model that generates geometrically consistent images from multiple viewpoints given a text prompt. ● Multi-view conditioning: Generates consistent multi-view images by conditioning the diffusion model on camera viewpoint alongside the text prompt. ● 2D diffusion + 3D data: Leverages pretrained 2D diffusion models and a multi-view dataset rendered from 3D assets, combining 2D generalizability with 3D consistency. ● Best of both worlds: Inherits the creativity of 2D diffusion priors while maintaining the geometric coherence required for downstream 3D reconstruction. ● 3D generation foundation: Became a building block for many subsequent text-to-3D pipelines that rely on multi-view-consistent diffusion as a prior. |
Paper, Tweet |
| 6) Nougat - Meta's Nougat is a visual transformer for "Neural Optical Understanding for Academic documents" that converts PDFs to LaTeX/Markdown. ● Academic-document focused: Specifically targets academic PDFs, where equations, tables, and reference formatting challenge general-purpose OCR systems. ● End-to-end visual transformer: A single visual transformer processes PDF page images into structured Markdown/LaTeX directly - no separate OCR + layout pipeline. ● Equation and table extraction: Handles mathematical equations and tables, producing LaTeX-correct output rather than flat text. ● Open release: Released with weights, enabling researchers to turn academic PDF collections into machine-readable corpora for downstream training and analysis. |
Paper, Tweet |
| 7) FacTool - A tool-augmented framework for detecting factual errors in LLM-generated text. ● Tool-augmented detection: Integrates LLMs with external tools (search engines, code executors, calculators) to fact-check generated content. ● Multi-domain coverage: Handles factual errors across knowledge-based QA, code generation, mathematical reasoning, and scientific literature review. ● Component-level analysis: Identifies the necessary components (claim extraction, query generation, evidence retrieval, verification) and shows which matter most. ● Practical recipe: Offers a concrete recipe for integrating fact-checking into LLM pipelines, using off-the-shelf tools rather than bespoke detectors. |
Paper, Tweet |
| 8) AnomalyGPT - Applies large vision-language models to industrial anomaly detection with synthetic data augmentation. ● Synthetic anomaly data: Simulates anomalous images and textual descriptions to generate training data, addressing the scarcity of real anomaly examples in industrial settings. ● Image decoder + prompt learner: Combines an image decoder with a prompt learner to detect and localize anomalies in product images. ● Few-shot ICL: Demonstrates few-shot in-context learning capabilities, adapting to new product types from a handful of examples. ● SoTA on industrial benchmarks: Achieves state-of-the-art performance on standard industrial anomaly-detection benchmarks, validating the VLM approach for manufacturing QA. |
Paper, Tweet |
| 9) FaceChain - Alibaba's FaceChain is a personalized portrait generation framework that produces identity-preserving portraits from just a handful of input photos. ● Few-shot personalization: Generates personalized portraits from only a handful of input images, dramatically reducing the data requirement for identity-preserving generation. ● Customization + perception pipeline: Combines customized image-generation models with face-related perceptual-understanding models for identity preservation. ● Truthful portraits: Produces portraits that preserve identity rather than drifting toward a "generic attractive person" archetype - a common failure of naive fine-tuning. ● Consumer-app friendly: Positioned as a deployable solution for consumer portrait-generation apps, supporting rapid personalization at scale. |
Paper |
| 10) Qwen-VL - Alibaba's Qwen-VL is a large-scale vision-language model family with strong performance across captioning, VQA, and visual localization. ● Broad capability: Handles image captioning, visual QA, visual localization (grounding), and flexible multi-turn visual interaction. ● Multilingual VL: Strong in both Chinese and English for visual tasks, filling a multilingual gap in VLMs predominantly English at the time. ● Visual grounding: Supports bounding-box output for visual grounding, a capability not universally present in early VLMs. ● Open release: Released as open weights, providing a strong open VLM baseline and kicking off the Qwen-VL family that has continued through 2024. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Code Llama - Meta releases Code Llama, a family of code-specialized LLMs built on top of Llama 2. ● Three-tier release: Foundation base models, Python-specialist variants, and instruction-following Code Llama - Instruct models, all in 7B/13B/34B sizes. ● Long context: Supports input contexts up to 100K tokens, enabling whole-repository or long-file code completion and analysis - unusual for open code LLMs at the time. ● Fill-in-the-middle: Includes fill-in-the-middle support, a key capability for editor-integrated use cases like code completion and gap filling. ● Strong HumanEval results: Code Llama - Python 34B reaches ~53% on HumanEval, establishing a strong open baseline for code models that persisted into 2024. |
Paper, Tweet |
| 2) Survey on Instruction Tuning for LLMs - A comprehensive survey of instruction tuning covering methodology, dataset construction, and applications. ● Systematic literature review: Provides a structured taxonomy of instruction-tuning research across datasets, training recipes, and evaluation approaches. ● Dataset construction: Reviews how instruction datasets are assembled - from human-written prompts to model-generated self-instruct and hybrid pipelines. ● Training methodologies: Catalogs SFT, multitask learning, RLHF, and their variants, with a focus on how each technique interacts with instruction-tuning data. ● Open problems: Highlights issues including instruction-data quality, data scaling, multilingual instruction tuning, and evaluation of instruction-following reliability. |
Paper, Tweet |
| 3) SeamlessM4T - Meta's SeamlessM4T is a unified multilingual and multimodal machine-translation system that handles five translation tasks in one model. ● Five tasks, one model: Handles ASR, text-to-text, speech-to-text, text-to-speech, and speech-to-speech translation in a unified architecture. ● 100+ languages: Covers ~100 languages for text and ~36 for speech, dramatically broadening the set of supported language pairs compared to prior systems. ● Unified training: Avoids the cascade of per-task models typical in translation pipelines, reducing error accumulation and improving multilingual generalization. ● Open release: Releases model weights and evaluation code, providing a strong open baseline for multilingual multimodal translation research. |
Paper, Tweet |
| 4) LLMs for Illicit Purposes - A survey cataloguing threats and vulnerabilities arising from LLM deployment. ● Threat taxonomy: Organizes LLM misuse threats into categories including misinformation, cyberattacks, social engineering, and unauthorized content generation. ● Mitigation catalog: Reviews existing mitigation strategies - training-time, inference-time, and system-level defenses - with critical evaluation of each. ● Deployment guide: Functions as a practical guide for building more reliable and robust LLM-powered systems. ● Policy relevance: Contributes to the growing AI-safety policy discourse by organizing abstract risk concerns into a concrete framework. |
Paper, Tweet |
| 5) Giraffe - A family of context-extended Llama and Llama 2 models, along with an empirical study of context-extension techniques. ● Extended contexts: Fine-tuned models with 4K, 16K, and 32K context windows, providing ready-to-use open long-context variants. ● Technique comparison: Systematically compares context-extension methods including positional interpolation, truncation strategies, and attention scaling. ● Practitioner insights: Reports practical findings on which techniques preserve downstream quality at extended contexts - useful for anyone building long-context applications. ● Context-extension recipe: The lessons from Giraffe fed directly into the recipes that would culminate in YaRN and similar approaches later that year. |
Paper, Tweet |
| 6) IT3D - Improves Text-to-3D generation by leveraging explicitly synthesized multi-view images in the training loop. ● Multi-view image supervision: Uses explicitly synthesized multi-view images as additional training signal for 3D generation, beyond standard per-view 2D supervision. ● Diffusion-GAN dual training: Integrates a discriminator alongside the diffusion loss, producing a hybrid Diffusion-GAN training strategy for the 3D models. ● Consistency gains: Improves geometric and photometric consistency across views compared to prior text-to-3D approaches. ● Complements MVDream-style methods: Works well alongside multi-view diffusion priors, pointing toward increasingly sophisticated 2D-to-3D pipelines. |
Paper |
| 7) LLM-Based Autonomous Agents Survey - A comprehensive survey of LLM-based autonomous agents covering construction and applications. ● Agent construction framework: Organizes autonomous agents by profile, memory, planning, and action components - the canonical modular view. ● Application coverage: Reviews applications across social science, natural science, and engineering, showing the breadth of agent use cases in mid-2023. ● Systematic literature review: Covers the explosion of agent papers following ReAct, AutoGPT, and similar early frameworks. ● Evaluation landscape: Discusses evaluation approaches for autonomous agents, a notoriously difficult area compared to static LLM evaluation. |
Paper, Tweet |
| 8) Prompt2Model - CMU's Prompt2Model automates the path from a natural-language task description to a deployable small special-purpose model. ● Prompt-as-specification: Users describe the target task in natural language; the framework produces a small model that can execute it. ● Three-channel pipeline: Automatically assembles training data via dataset retrieval (find relevant existing data), dataset generation (synthesize new data), and model retrieval (find relevant pretrained models). ● Small deployable output: Produces small, efficient models suitable for deployment - not just API wrappers around frontier LLMs. ● Accessibility gain: Lowers the barrier for non-ML practitioners to build task-specific models, abstracting away much of the data-engineering burden. |
Paper, Tweet |
| 9) LegalBench - A collaboratively constructed benchmark for measuring legal reasoning in LLMs. ● 162 tasks: Covers 162 legal-reasoning tasks designed by legal experts, significantly broader than prior legal benchmarks. ● Six reasoning categories: Categorizes tasks across rule-recall, rule-application, rule-conclusion, interpretation, rhetorical-analysis, and issue-spotting. ● Collaborative construction: Built through collaboration with legal practitioners to ensure tasks reflect real legal reasoning rather than generic NLP tasks dressed in legal vocabulary. ● LLM-lawyer evaluation: Provides the first rigorous benchmark for systematically evaluating LLM legal capability - essential for responsible deployment in legal workflows. |
Paper, Tweet |
| 10) Language to Rewards for Robotic Skill Synthesis - Google's Language-to-Rewards uses LLMs to define reward parameters for robotic RL. ● LLM-defined rewards: Uses LLMs to translate natural-language task descriptions into optimizable reward parameters for downstream RL training. ● Real-robot evaluation: Evaluated on a real robot arm, not just in simulation, validating that the approach survives sim-to-real challenges. ● Emergent skills: Complex manipulation skills including non-prehensile pushing emerge from the LLM-specified rewards alone. ● Natural robot programming: Positions natural language as a practical interface for programming robot behaviors without handcrafting reward functions. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Humpback (Self-Alignment with Instruction Backtranslation) - Meta's Humpback automatically generates instruction-tuning data by back-translating web text into plausible instructions. ● Instruction backtranslation: Given a web document, generates a plausible instruction that the document could answer - inverting the typical instruction-data creation direction. ● Four-step pipeline: (1) Fine-tune LLM with small seed data, (2) generate instructions for web docs, (3) self-curate high-quality examples, (4) fine-tune on curated data. ● Tops Alpaca leaderboard: The self-aligned model outperforms all other Llama-based models on the Alpaca leaderboard at the time of release. ● Data abundance: Turns the entire web into potential instruction-tuning data, dramatically expanding the accessible instruction corpus beyond curated human-written datasets. |
Paper, Tweet |
| 2) Platypus - Platypus is a family of fine-tuned and merged LLMs that topped the Open LLM Leaderboard in August 2023. ● LoRA fine-tuning + merging: Describes an efficient process for fine-tuning and merging LoRA modules, demonstrating that careful composition beats monolithic fine-tuning. ● Open-Platypus dataset: Releases a small, highly curated fine-tuning dataset that delivers strong performance with short and cheap training - quality over quantity. ● 5 hours on one A100: A 13B Platypus can be trained on a single A100 GPU using 25K curated questions in roughly 5 hours. ● Leaderboard-topping: Demonstrates that careful data curation and LoRA merging can produce leaderboard-topping open models without massive compute. |
Paper, Tweet |
| 3) Model Compression for LLMs Survey - A survey of recent model-compression techniques applied specifically to LLMs. ● Core technique families: Covers quantization, pruning, knowledge distillation, and architectural compression across training-time and post-training approaches. ● LLM-specific concerns: Addresses unique LLM concerns including long-sequence compression, KV-cache optimization, and retaining reasoning capability under compression. ● Evaluation metrics: Reviews benchmark strategies and evaluation metrics for measuring compressed-LLM effectiveness - not just perplexity but downstream capability preservation. ● Practitioner reference: Functions as a compact reference for teams deciding which compression technique matches their deployment constraints. |
Paper, Tweet |
| 4) GEARS - Stanford's GEARS predicts cellular responses to genetic perturbation using deep learning + a gene-relationship knowledge graph. ● KG-guided prediction: Combines deep-learning models with an explicit gene-relationship knowledge graph, letting the model leverage structured biological priors. ● Combinatorial perturbations: Predicts cellular responses to combinations of perturbations, a harder regime than single-perturbation prediction. ● 40% precision gain: Achieves 40% higher precision than prior approaches when predicting four distinct genetic-interaction subtypes in a combinatorial perturbation screen. ● Drug discovery relevance: Accelerates hypothesis generation in perturbation biology, with direct implications for target discovery and drug development. |
Paper, Tweet |
| 5) Shepherd - Meta's Shepherd is a 7B language model specifically tuned to critique model outputs and suggest refinements. ● Critique-specialized 7B: A 7B parameter model fine-tuned specifically on the task of critiquing LLM responses and suggesting improvements. ● Error identification: Capable of identifying diverse error types - factual, logical, stylistic, safety - and suggesting remedies for each. ● ChatGPT-comparable critiques: Human evaluators judge Shepherd's critiques as similar or preferred to ChatGPT's, despite Shepherd being much smaller. ● Critic-as-a-service: Points toward a deployment pattern where small specialized critic models are paired with larger generation models, a recurring theme in 2024 alignment work. |
Paper, Tweet |
| 6) GPT-4 Code Interpreter for Math - A zero-shot prompting technique for GPT-4 Code Interpreter that dramatically boosts math-reasoning accuracy via code self-verification. ● Code-as-verifier prompting: Explicitly encourages GPT-4 Code Interpreter to use code for self-verification of intermediate and final answers. ● 69.7% on MATH: Achieves 69.7% zero-shot accuracy on the MATH dataset - a 27.5-point improvement over vanilla GPT-4 (42.2%). ● Execution-grounded reasoning: Code execution provides a high-fidelity verification signal that vanilla CoT lacks, reducing hallucinated intermediate steps. ● Tool-use template: Establishes a template for tool-augmented reasoning that would generalize to many later math-LLM recipes. |
Paper, Tweet |
| 7) Teach LLMs to Personalize - A multitask-learning approach for personalized text generation without relying on predefined user attributes. ● Attribute-free personalization: Generates personalized text without predefined attributes like age, profession, or preferences - instead inferring style from user history. ● Multitask learning: Frames personalization as a multitask problem where tasks correspond to different personalization axes, sharing representation across them. ● Generalizable style: Demonstrates that models can adapt to new users with minimal examples when trained with this multitask approach. ● Production relevance: Directly applicable to personalized-assistant and content-generation products where explicit user-profile attributes are impractical or privacy-sensitive. |
Paper, Tweet |
| 8) OctoPack - Hugging Face releases OctoPack, a 4TB dataset of Git commits across 350 programming languages for instruction-tuning code LLMs. ● 4TB commit dataset: Curated dataset of 4 terabytes of Git commits across 350 programming languages, using commit messages as implicit instructions. ● Natural code instructions: Commit messages provide real-world, naturally occurring instructions for code changes - far more authentic than synthetically generated code instructions. ● SoTA without OpenAI outputs: Achieves state-of-the-art performance on HumanEval Python among models not trained on OpenAI outputs. ● HumanEval extension: Extends HumanEval beyond Python generation to include code explanation and code repair tasks, providing richer evaluation coverage. |
Paper, Tweet |
| 9) Outlines (Efficient Guided Generation) - A library for guided LLM text generation that enforces structural constraints with minimal overhead. ● Regex guarantees: Guarantees that generated output matches a specified regular expression, supporting grammar-constrained generation at the token level. ● JSON schema enforcement: Produces output that follows a JSON schema, unlocking reliable structured-output generation without post-hoc parsing retries. ● Fast implementation: Achieves low overhead via efficient state-machine construction and token-mask caching, making constrained decoding practical in production. ● Broad adoption: Became widely used in LLM pipelines where structured output is non-negotiable - function calling, tool use, API output, and data extraction. |
Paper, Tweet |
| 10) Bayesian Flow Networks (BFN) - Introduces a new class of generative models that combine Bayesian inference with deep learning. ● Parameters, not noisy data: BFNs operate on parameters of a data distribution rather than on a noisy version of the data itself - a fundamental architectural departure from diffusion models. ● Unified data types: Adapts to continuous, discretized, and discrete data with minimal changes to the training procedure - unlike diffusion variants that need per-modality engineering. ● Competitive with diffusion: Achieves competitive or better likelihood on image, text, and discrete-data benchmarks compared to diffusion baselines. ● Research direction: Opens a new family of generative models with distinct theoretical properties, attracting follow-up work through 2024. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) D-Bot (LLMs as Database Administrators) - Introduces D-Bot, an LLM-based framework that continuously acquires database-administration knowledge from textual sources. ● Knowledge detection: Automatically detects database-maintenance knowledge from documentation and tool outputs, continuously updating its operational knowledge base. ● Tree-of-thought diagnosis: Uses tree-of-thought reasoning for root-cause analysis of database performance and reliability issues. ● Multi-LLM collaboration: Collaborative diagnosis among multiple LLMs yields better root-cause identification than single-model analysis. ● DBA augmentation: Positions LLMs as augmenting DBAs rather than replacing them, with concrete value on knowledge retrieval and diagnostic reasoning. |
Paper, Tweet |
| 2) Political Biases in NLP Models - Develops methods to measure political and media biases in LLMs and their downstream effects. ● Bias measurement methodology: Introduces measurement techniques for political and media biases in LLMs that can be applied across models and over time. ● Downstream bias propagation: Studies how biases in pretrained LLMs propagate to downstream NLP models fine-tuned on top of them. ● Political leanings detected: Finds that LLMs exhibit measurable political leanings that reflect and reinforce polarization patterns in their training corpora. ● Fairness implications: Provides empirical ammunition for discussions of LLM fairness, deployment in politically sensitive contexts, and bias-mitigation research. |
Paper, Tweet |
| 3) AgentBench - Tsinghua's AgentBench is a multidimensional benchmark for LLM-as-Agent reasoning and decision-making across 8 environments. ● Multi-environment design: Tests agents across 8 diverse environments including web browsing, operating systems, databases, and games - capturing breadth of agent demands. ● Open vs. commercial gap: Reveals a significant performance gap between top commercial LLMs (GPT-4) and open-source models on agent tasks. ● Open-source lags: Open-source LLMs lag substantially on AgentBench, exposing a gap that subsequent open-agent fine-tuning efforts targeted. ● GPT-4 shows potential: GPT-4's performance demonstrates that frontier models can support continuously learning agents, even if they're not there yet. |
Paper, Tweet |
| 4) Studying LLM Generalization with Influence Functions - Anthropic scales influence functions to LLMs up to 52B parameters to investigate generalization patterns. ● Efficient scaling: Introduces computational tricks that make influence-function analysis tractable on LLMs with up to 52 billion parameters - a massive scale-up from prior work. ● Cross-lingual generalization: Finds evidence of cross-lingual generalization, where training examples in one language influence predictions in another. ● Middle-layer abstraction: Middle layers of the network appear responsible for the most abstract generalization patterns, supporting emerging interpretability narratives. ● Alignment implications: Influence-function analysis gives alignment researchers a new tool for understanding which training data drives which model behaviors. |
Paper, Tweet |
| 5) NeuroImagen - Reconstructs visual stimuli images from EEG signals using latent diffusion, opening new windows into visually-evoked brain activity. ● EEG-to-image reconstruction: Reconstructs high-resolution visual stimuli images from EEG signals recorded while subjects viewed those images. ● Latent diffusion pipeline: Uses a latent diffusion model conditioned on EEG features, inheriting the high-fidelity generation capabilities of diffusion priors. ● Non-invasive BCI: EEG is non-invasive and comparatively cheap, making this approach more practical for real-world brain-computer interface research than fMRI-based alternatives. ● Cognitive-science bridge: Provides a new tool for studying visual cognition, complementing and extending earlier fMRI-decoding work. |
Paper, Tweet |
| 6) SynJax - DeepMind's SynJax is a JAX-based library for efficient vectorized inference in structured distributions. ● Vectorized structured inference: Provides efficient vectorized implementations of inference algorithms for structured distributions - tagging, segmentation, trees - on modern hardware. ● Supported structures: Covers constituency trees, dependency trees, spanning trees, tagging, and segmentation - the workhorses of structured prediction. ● Differentiable models: Enables building large-scale differentiable models that explicitly represent structure in data, bridging classical NLP and deep learning. ● Hardware-friendly: JAX backend lets researchers run structured-inference models at scale on accelerators, unblocking research that had been stuck on CPU speeds. |
Paper, Tweet |
| 7) Synthetic Data Reduces Sycophancy - Google shows that fine-tuning on simple synthetic data can significantly reduce LLM sycophancy. ● Sycophancy problem: Sycophancy occurs when LLMs align their responses with perceived user views even when those views are factually incorrect. ● Synthetic anti-sycophancy data: Constructs simple synthetic examples where the correct answer contradicts the user's stated view, then fine-tunes models on them. ● Meaningful reduction: Fine-tuning on this synthetic data measurably reduces sycophantic behavior without degrading overall helpfulness. ● Broader lesson: Offers a cheap, targeted intervention for a specific alignment failure mode - a template for addressing other narrow failure modes through targeted synthetic data. |
Paper, Tweet |
| 8) PUG (Photorealistic Unreal Graphics) - Meta's PUG uses Unreal Engine to generate photorealistic, semantically controllable synthetic datasets for vision research. ● Unreal-powered synthesis: Leverages Unreal Engine's photorealistic rendering to produce high-fidelity synthetic training images with precise semantic control. ● Controllable semantics: Researchers can specify scene content, lighting, camera angles, and object configurations, making targeted ablations possible. ● Democratizing synthetic data: Lowers the barrier to photorealistic synthetic data generation, previously limited to groups with custom rendering pipelines. ● Rigorous evaluation: Enables more rigorous evaluations of vision-model robustness to controlled distribution shifts - lighting, occlusion, pose - than natural data allows. |
Paper, Tweet |
| 9) LLMs for HVAC Control - Microsoft applies LLMs to industrial control tasks (HVAC for buildings), comparing against RL baselines. ● Demonstration selection: Develops a recipe for selecting demonstrations and generating high-performing prompts for industrial control tasks. ● GPT-4 ≈ RL: GPT-4 performs comparably to specialized RL methods on HVAC control, despite being a general-purpose model. ● Lower technical debt: Uses dramatically fewer samples and avoids the operational complexity of training and maintaining a dedicated RL policy. ● Practical implication: Suggests LLMs can substitute for RL in many control tasks where sample efficiency and maintenance matter more than peak performance. |
Paper, Tweet |
| 10) Trustworthy LLMs - Presents a comprehensive framework of categories for assessing LLM trustworthiness. ● Seven-dimensional framework: Covers reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. ● Aligned models advantage: Aligned models perform better on trustworthiness dimensions, but alignment effectiveness varies dramatically across dimensions. ● Sub-category detail: Each top-level dimension is broken into measurable sub-categories, making the framework operational for evaluation rather than just conceptual. ● Evaluation tooling: Positioned as a foundation for systematic trustworthiness evaluation - a precursor to later trust-specific benchmarks like TrustLLM. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Open Problems and Limitations of RLHF - A comprehensive survey of open problems and fundamental limitations of RLHF as an alignment approach. ● Scope: Catalogs issues across the entire RLHF pipeline - preference data collection, reward modeling, policy optimization, and evaluation. ● Fundamental limitations: Discusses issues that can't be solved by incremental engineering alone, including the difficulty of specifying human preferences completely. ● Reward hacking taxonomy: Organizes the many varieties of reward hacking seen in practice, from sycophancy to specification gaming. ● Research agenda: Argues for investment in alignment approaches beyond RLHF that can address its structural limitations - a precursor to DPO and related methods. |
Paper, Tweet |
| 2) Med-Flamingo - Stanford's Med-Flamingo is a multimodal medical model supporting in-context learning for few-shot medical visual QA. ● Medical ICL: Supports in-context learning for medical visual QA, letting clinicians specialize the model via examples at inference time rather than fine-tuning. ● Physician evaluation: Physician evaluators rate Med-Flamingo's responses up to 20% higher than baseline multimodal models - a significant clinical quality improvement. ● Hallucination concerns: Authors transparently report occasional low-quality generations and hallucinations, a necessary caveat for medical deployment. ● Clinical-deployment template: Sets a template for responsible medical VLM development - physician-in-the-loop evaluation alongside automatic metrics. |
Paper, Tweet |
| 3) ToolLLM - Tsinghua's ToolLLM enables LLMs to interact with 16,000+ real-world APIs through a comprehensive framework for tool-using LLMs. ● 16K APIs: Covers 16,000+ real-world APIs - orders of magnitude more than prior tool-use benchmarks, capturing the real diversity of modern API ecosystems. ● Full-stack framework: Includes data preparation, training methodology, and evaluation infrastructure - a complete open stack for tool-use research. ● ToolLLaMA hits ChatGPT-16k: The authors' ToolLLaMA model matches ChatGPT (turbo-16k) on tool-use benchmarks, showing open models can close the gap. ● Tool-use research foundation: Became a standard reference point for tool-use research, influencing how tool datasets and benchmarks were structured through 2024. |
Paper, Tweet |
| 4) Skeleton-of-Thought (SoT) - Microsoft's Skeleton-of-Thought parallelizes LLM generation by first producing an answer skeleton then filling it in concurrently. ● Two-stage generation: First generates an answer skeleton outlining the response structure, then fills in each skeleton point through parallel API calls. ● 2.39x speedup: Achieves up to 2.39x speedup over sequential decoding by exploiting the independence of skeleton points. ● Quality improvements: Besides the speedup, reports quality improvements on some tasks - structure-first generation can produce more coherent long responses. ● Applicability: Works best for list-style or outline-style responses where the skeleton decomposition is natural, less so for tightly coupled prose. |
Paper, Tweet |
| 5) MetaGPT - MetaGPT is a multi-agent framework that encodes standardized operating procedures (SOPs) for complex problem solving. ● SOP-encoded workflows: Encodes human standardized operating procedures into agent workflows, imposing structure rather than letting agents improvise. ● Multi-agent roles: Agents take on well-defined roles (PM, engineer, architect, QA, etc.) mirroring real software-development team structures. ● Multifaceted capability: Handles software development, code generation, and data analysis - a broader scope than ChatDev's software-focus. ● Tool integration: Integrates with tools like AutoGPT and LangChain, slotting into the broader agent-framework ecosystem rather than replacing it. |
Paper, Tweet |
| 6) OpenFlamingo - An open-source family of autoregressive vision-language models spanning 3B to 9B parameters. ● Open reproduction: A faithful open-source reproduction of DeepMind's closed Flamingo, enabling research groups to build on the architecture. ● Size range: Covers 3B to 9B parameters, offering multiple sizes for researchers with varying compute budgets. ● Training data + eval suite: Releases the training data and evaluation suite alongside models, providing a complete reproducible stack. ● Open VLM foundation: Became a widely used starting point for open VLM research through 2023-2024. |
Paper, Tweet |
| 7) The Hydra Effect - DeepMind shows that language models exhibit self-repairing behavior when attention heads are ablated. ● Self-repair phenomenon: Ablating a layer of attention heads causes a later layer to take over the ablated layer's function - a previously unknown redundancy property. ● Interpretability implications: Complicates interpretability work based on ablation - removing a component doesn't necessarily isolate its contribution if other components compensate. ● Circuit-level redundancy: Suggests transformer circuits have built-in redundancy that is activated under ablation, analogous to biological neural networks. ● Research-method correction: Forces a rethinking of causal-mediation experiments in mechanistic interpretability, since ablations alone understate components' true contributions. |
Paper, Tweet |
| 8) Self-Check - Explores LLM capacity for self-checking on complex reasoning tasks requiring multi-step and non-linear thinking. ● Zero-shot verification: Proposes a zero-shot verification scheme that recognizes errors in its own reasoning without external tools or references. ● Weighted voting improvement: Applying self-check scores as weights in majority voting improves QA performance over standard CoT self-consistency. ● Math word problems: Demonstrates improved accuracy on math word problems - tasks that benefit most from catching intermediate-step errors. ● Self-critique groundwork: An early contribution to the self-critique literature that would mature through 2024 into Constitutional AI-style and debate-style methods. |
Paper, Tweet |
| 9) Dynalang (Agents Model the World with Language) - UC Berkeley's Dynalang agent learns a multimodal world model predicting future text, video, and rewards. ● Multimodal world model: Jointly predicts future language, video, and rewards, treating language as another stream of observation/prediction rather than just policy input. ● Instruction-following: Learns to follow instructions in visually and linguistically complex domains, grounded in the world model's predictions. ● Cross-domain applicability: Applied to multiple embodied environments, showing the language-inclusive world-model approach is general. ● Research direction: Foreshadows the "video-plus-language world model" direction that would grow prominent in 2024 (e.g., Sora's world simulator framing). |
Paper, Tweet |
| 10) AutoRobotics-Zero - Discovers zero-shot adaptable robot policies from scratch, including the automatic discovery of Python control code. ● Zero-shot adaptability: Policies adapt to sudden environmental changes without any fine-tuning at test time - a critical property for robust robotics. ● Python-code policies: Automatically discovers Python code that implements robot controllers - an interpretable, auditable policy representation. ● Discovery from scratch: Policies are discovered from scratch rather than fine-tuned from pretrained ones, reducing assumptions about prior knowledge. ● AutoML for robotics: Extends the AutoML paradigm into robotics, using search over code rather than over neural architectures. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Universal Adversarial LLM Attacks - Finds universal and transferable adversarial attacks that cause aligned models like ChatGPT and Bard to generate objectionable behaviors. ● Automatic suffix generation: Uses a combination of greedy and gradient-based search to automatically produce adversarial suffixes that bypass alignment safeguards. ● Universal transferability: A single adversarial suffix found on open models transfers to proprietary models like GPT-4, Claude, and Bard, revealing a systemic weakness. ● Jailbreaking industrialized: Demonstrated that automated attacks could produce unlimited variants, forcing a rethink of alignment robustness beyond manual red-teaming. ● Foundational safety paper: Became one of the most-cited adversarial robustness papers of 2023 and a reference point for later work on refusal training and representation-level defenses. |
Paper, Tweet |
| 2) RT-2 - Google DeepMind's end-to-end vision-language-action model that learns from both web and robotics data to control robots. ● VLA architecture: Treats robot actions as another language the model generates - actions are tokenized and output in the same stream as text tokens. ● Web-scale knowledge transfer: Leverages internet-scale VLM pretraining so the robot can reason about novel objects and symbols it never saw in robotics data (e.g., "pick up the extinct animal"). ● Emergent semantic reasoning: Shows emergent capabilities like chain-of-thought robotic reasoning and multi-stage task planning absent in prior RT-1. ● Robot foundation models: Established the VLA paradigm that dominated 2024 robotics research (OpenVLA, RT-X, π0) and moved robotics firmly into the foundation-model era. |
Paper, Tweet |
| 3) Med-PaLM Multimodal - Introduces a generalist biomedical AI system and a new multimodal biomedical benchmark with 14 tasks. ● MultiMedBench: A new benchmark spanning 14 tasks across clinical text, medical imaging (e.g., chest X-ray, pathology, dermatology), and genomics. ● Single generalist model: A single 562B model handles medical Q&A, VQA, report generation, and genomic variant call - rather than disease-specific narrow models. ● Clinician evaluations: In pilot evaluations by radiologists, Med-PaLM M's chest X-ray reports were preferred over reference reports in 40.50% of cases. ● Generalist medical AI vision: Provided the strongest proof-of-concept for generalist biomedical AI, previewing the trajectory toward healthcare foundation models. |
Paper, Tweet |
| 4) Tracking Anything in High Quality - A framework for high-quality tracking-anything in videos combining segmentation and refinement. ● Two-stage design: Combines a video multi-object segmenter with a pretrained mask refiner model to clean up tracking output. ● Mask quality focus: Addresses the common failure mode where trackers lose object boundaries over time, maintaining sharp masks across long clips. ● VOTS2023 results: Ranked 2nd place in the VOTS2023 challenge, demonstrating competitive quality against specialized trackers. ● Practical tool: Useful for video editing, AR/VR, and content creation pipelines that require pixel-accurate object tracking over long sequences. |
Paper, Tweet |
| 5) Foundation Models in Vision - A comprehensive survey on foundational models for computer vision and their open research directions. ● Landscape mapping: Reviews textually prompted (CLIP, ALIGN), visually prompted (SAM), and generative (DALL-E, Imagen) vision foundation models in one unified taxonomy. ● Challenges enumerated: Identifies open problems in evaluation, grounding, hallucination, compositionality, and domain-specific adaptation for CV. ● Cross-modal trends: Analyzes how vision foundation models increasingly borrow from LLM training recipes (instruction tuning, RLHF). ● Reference for researchers: Became a go-to survey for new researchers entering vision foundation-model research in late 2023. |
Paper, Tweet |
| 6) L-Eval - A standardized evaluation suite for long-context language models. ● Dataset scale: 411 long documents covering over 2K query-response pairs across law, finance, school lectures, long conversations, novels, and meetings. ● Realistic domains: Moves beyond synthetic needle-in-haystack tests toward practical long-form applications users actually encounter. ● Evaluation methodology: Provides multiple evaluation protocols including exact match, n-gram, and LLM-as-judge to cross-validate results. ● Long-context benchmark: Became a reference benchmark during 2023's context-window race, paving the way for later benchmarks like LongBench and RULER. |
Paper, Tweet |
| 7) LoraHub - Enables efficient cross-task generalization via dynamic LoRA composition. ● Dynamic composition: Combines pre-trained LoRA modules via learned weights without human expertise or additional parameters/gradient updates. ● Gradient-free optimization: Uses gradient-free algorithms like Nelder-Mead to find optimal LoRA weightings on a handful of examples. ● ICL-matching performance: Matches the performance of in-context learning in few-shot settings while using much less inference compute. ● Modular LLMs vision: Part of the broader push toward modular, composable adapter ecosystems - a direction still actively developed in 2024's MoE-of-LoRAs work. |
Paper, Tweet |
| 8) Survey of Aligned LLMs - A comprehensive overview of alignment approaches covering data, training, and evaluation. ● Full-stack view: Covers preference data collection, RLHF variants, DPO-style direct methods, and alignment evaluation in one unified reference. ● Taxonomy of methods: Organizes alignment techniques into clear families (outer alignment vs. inner alignment, value alignment vs. behavior alignment). ● Practical pitfalls: Documents known failure modes like reward hacking, sycophancy, and mode collapse that practitioners should watch for. ● Reference document: Frequently cited in alignment onboarding material as the first-pass overview for new researchers. |
Paper, Tweet |
| 9) WavJourney - Leverages LLMs to orchestrate audio generation models for compositional storytelling. ● LLM as composer: Uses an LLM to plan scene-level audio scripts, then dispatches sub-prompts to specialized TTS, music, and sound-effect models. ● Explainable structure: Produces intermediate audio scripts that users can inspect and edit, giving creative control rather than opaque end-to-end generation. ● Storytelling workflow: Demonstrates long-form coherent audio stories with speech, music, and ambient sound combined into unified scenes. ● Agentic audio precursor: An early example of LLM-as-orchestrator for multimedia generation - a pattern that matured in 2024 multi-modal agent frameworks. |
Paper, Tweet |
| 10) FacTool - A task- and domain-agnostic framework for factuality detection of LLM-generated text. ● General framework: Unifies factuality detection across knowledge QA, code generation, math reasoning, and scientific literature review under a common pipeline. ● Tool-augmented verification: Calls external tools (search engines, code executors, math solvers) to verify claims rather than relying on the LLM's internal judgment alone. ● Benchmark release: Releases an accompanying benchmark dataset plus a ChatGPT plugin implementation for hands-on experimentation. ● Practical fact-checking: Provided one of the first end-to-end fact-checking frameworks suitable for deployment alongside LLM chatbots. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Llama 2 - Meta's open-weight foundation model family with chat-tuned variants ranging from 7B to 70B parameters. ● Open-weight release: Released pretrained and RLHF-tuned chat models under a permissive license that allowed commercial use, reshaping the open-source LLM landscape. ● Training recipe: Pretrained on 2T tokens with 4K context; chat models use SFT followed by iterative RLHF with Ghost Attention (GAtt) for multi-turn consistency. ● Safety investment: Extensive red-teaming, safety reward models, and context distillation produce chat models with strong helpfulness-safety trade-offs. ● Ecosystem catalyst: Llama 2 became the base for hundreds of community fine-tunes (Vicuna, WizardLM, CodeLlama) and catalyzed the open-weight movement that 2024's Llama 3 and Mistral would extend. |
Paper, Tweet |
| 2) How is ChatGPT's Behavior Changing Over Time? - Evaluates GPT-3.5 and GPT-4 over months to show significant behavioral drift in deployed systems. ● Longitudinal measurement: Compares March vs. June 2023 snapshots of GPT-3.5 and GPT-4 on math, code, sensitive-question answering, and visual reasoning. ● Large performance deltas: GPT-4's prime identification accuracy dropped from 97.6% to 2.4% between snapshots, demonstrating drift can be severe and non-monotonic. ● Safety and format shifts: Code generation formatting, verbosity, and willingness to answer sensitive questions all changed substantially across versions. ● Deployment implications: Highlighted the need for version pinning, regression testing, and behavioral monitoring when building on proprietary APIs - sparking major industry discussion. |
Paper, Tweet |
| 3) FlashAttention-2 - Tri Dao's follow-up to FlashAttention, dramatically improving attention throughput on modern GPUs. ● Work partitioning: Redesigns parallelism so non-matmul FLOPs are reduced and thread blocks are better utilized across SMs. ● ~2x speedup: Achieves approximately 2x speedup over FlashAttention-1 and reaches 50-73% of theoretical maximum FLOPs/s on A100. ● Shared-memory communication: Parallelizes attention along sequence length, increases occupancy, and reduces cross-warp communication via shared memory. ● Training infrastructure staple: Became the default attention kernel in PyTorch, HuggingFace, vLLM, and nearly every 2024 training stack for long-context models. |
Paper, Tweet |
| 4) Measuring Faithfulness in Chain-of-Thought Reasoning - Anthropic's investigation into whether CoT reasoning actually reflects the model's internal decision process. ● Intervention protocol: Uses paraphrasing, mistake-injection, and truncation of reasoning chains to test whether final answers depend on the visible reasoning. ● Inverse scaling finding: Demonstrates that as models get larger and more capable, the reasoning becomes less faithful - an important inverse-scaling signal. ● Task variability: Faithfulness varies significantly across tasks; some tasks/model-sizes support CoT that is meaningfully tied to the answer. ● Interpretability foundation: Influential for subsequent interpretability and safety work on whether chain-of-thought can be trusted for monitoring model reasoning. |
Paper, Tweet |
| 5) Generative TV & Showrunner Agents - Fable Studio's approach to generate episodic TV content using LLMs and multi-agent simulation. ● Multi-agent storytelling: Uses agent simulation to generate plot, character actions, and dialogue which are then rendered as episodic content. ● Full-pipeline generation: Integrates story generation, image/audio synthesis, and lip-sync into a single end-to-end show creation pipeline. ● "South Park AI" demo: The accompanying animated demo in the style of South Park generated significant public attention as a preview of AI-generated entertainment. ● AI creative industries: An early proof-of-concept for agent-driven entertainment production that informed later efforts in AI-generated TV, games, and interactive fiction. |
Paper, Tweet |
| 6) Challenges & Application of LLMs - A comprehensive enumeration of open challenges and application domains for LLMs. ● Challenge taxonomy: Catalogs technical challenges (evaluation brittleness, prompt brittleness, hallucination, context limits, bias) and practical ones (cost, safety, data). ● Application breadth: Reviews applications spanning education, law, medicine, chemistry, biology, and software engineering with honest accounting of current limitations. ● Experimental-design gaps: Highlights the lack of robust experimental protocols in LLM evaluation - a prelude to 2024's improved eval practices. ● Community reference: Frequently cited as a shared vocabulary for describing the 2023 state of LLM applied research. |
Paper, Tweet |
| 7) Retentive Network (RetNet) - Microsoft's proposed foundation architecture aiming to replace Transformer attention for LLMs. ● Three-mode formulation: Supports parallel training, recurrent inference, and chunkwise recurrent representation - combining Transformer-style training with RNN-style inference. ● O(1) inference cost: Achieves constant-memory inference per step via the recurrent form, dramatically cheaper than attention's O(n) per-token cost. ● Retention mechanism: Replaces softmax attention with an exponentially-decaying retention kernel that supports both parallel and recurrent computation. ● Post-Transformer contender: Positioned alongside Mamba, RWKV, and Hyena as one of the credible attempts to dethrone attention - though attention remained dominant through 2024. |
Paper, Tweet |
| 8) Meta-Transformer - A unified framework performing learning across 12 different modalities with a shared backbone. ● 12-modality coverage: Handles text, image, point cloud, audio, video, X-Ray, infrared, hyperspectral, IMU, graph, tabular, and time-series data. ● Frozen encoder design: Uses a frozen modality-agnostic encoder paired with modality-specific tokenizers and lightweight task heads. ● Extreme generality: Demonstrates that a single backbone can serve both fundamental perception and practical applications like medical imaging and industrial sensing. ● Universal encoder direction: Points toward future architectures where a single foundation model serves as the universal encoder for any modality. |
Paper, Tweet |
| 9) Retrieve In-Context Examples for LLMs - A framework to iteratively train dense retrievers that identify high-quality in-context examples. ● Iterative training: Trains retrievers using LLM feedback in an iterative loop - retrieved examples that help the LLM answer correctly are used as positive signals. ● 30-task evaluation: Evaluated across 30 NLP tasks showing consistent improvements over random or similarity-based retrieval. ● Pattern-similar examples: Confirms that examples sharing abstract patterns (not just surface similarity) are most useful for ICL. ● Scale-invariant gains: Improvements are consistent across model sizes, suggesting dense retrieval is a robust ICL enhancement that transfers across model scales. |
Paper, Tweet |
| 10) FLASK - Proposes fine-grained evaluation of LLMs decomposed into 12 alignment skill sets. ● 12-skill taxonomy: Decomposes holistic LLM evaluation into skills like logical reasoning, factuality, commonsense, readability, harmlessness, etc. ● Instance-level annotation: Each evaluation instance is labeled with which skills, domains, and difficulty levels it exercises, enabling fine-grained performance analysis. ● Skill-specific insights: Reveals that models excel differently on different skills - useful for targeted model selection and iteration. ● Evaluation paradigm shift: Part of the broader move from single-number benchmarks to multi-dimensional skill-based evaluation that shaped 2024's eval ecosystem. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) CM3Leon - Meta's retrieval-augmented multi-modal language model that generates both text and images. ● Autoregressive multi-modal: Unifies text and image generation in a single autoregressive token-based architecture, handling both modalities in any order. ● 5x training efficiency: Achieves SOTA image generation quality with 5x less training compute than comparable methods due to retrieval augmentation and instruction tuning. ● Instruction tuning for images: Demonstrates that supervised fine-tuning and instruction tuning - originally developed for LLMs - also massively improves multimodal generation quality. ● Any-to-any direction: Early proof-of-concept for unified any-to-any multi-modal models, pre-dating and inspiring 2024 systems like Chameleon and GPT-4o. |
Paper, Tweet |
| 2) Claude 2 - Anthropic's second-generation LLM with a detailed model card on safety, alignment, and capabilities. ● 100K context: Launched with a 100K token context window, enabling document-scale reasoning use cases that were impractical with earlier models. ● Safety evaluations: Comprehensive safety evaluations including harmlessness benchmarks, bias probes, and red-teaming results transparently disclosed. ● Capabilities gains: Significant improvements on coding (71.2% HumanEval), math (GSM8k), and legal reasoning over Claude 1.3. ● Consumer release: First Claude model available to consumers via claude.ai in the US and UK, broadening Anthropic's public footprint. |
Paper, Tweet |
| 3) Secrets of RLHF in LLMs - A deep investigation into RLHF with a focus on the inner workings of PPO, including open-source code. ● PPO internals exposed: Documents critical implementation details (reward normalization, advantage estimation, KL penalty scaling) that aren't in the original papers but make or break training. ● Empirical ablations: Systematically studies which PPO components matter most, providing practical guidance for RLHF practitioners. ● Open-source code: Releases a clean reference implementation that others can use to reproduce and iterate on RLHF. ● RLHF demystification: Part of a broader 2023 wave demystifying RLHF, preparing the ground for simpler alternatives like DPO that arrived later that year. |
Paper, Tweet |
| 4) LongLLaMA - Extends LLaMA's context length using a contrastive training process that reshapes the (key, value) space. ● Focused Transformer: Uses contrastive training to make memory-augmented attention more discriminative, reducing distraction from irrelevant context. ● Length extrapolation: Demonstrates long-context capability well beyond the original LLaMA 2K/4K window through its memory mechanism. ● Long-context tasks: Shows improvements on passkey retrieval and long-form summarization tasks that stress long-range attention. ● Efficient extension: Part of the 2023 explosion of context-window-extension techniques that would culminate in ~1M-token proprietary models the following year. |
Paper, Tweet |
| 5) Patch n' Pack: NaViT - A vision transformer handling any aspect ratio and resolution through sequence packing. ● Native resolution processing: Packs image patches of arbitrary resolution/aspect-ratio into a single sequence, preserving original information instead of resize-and-crop. ● Flexible deployment: Enables compute-quality tradeoffs at inference time without requiring separate models per resolution. ● Training efficiency: Sequence packing provides significant training efficiency gains versus fixed-resolution pipelines. ● Foundation ViT update: Influenced subsequent multi-modal models (LLaVA, Qwen-VL) that adopted NaViT-style native-resolution image processing. |
Paper, Tweet |
| 6) LLMs as General Pattern Machines - Demonstrates LLMs serve as general sequence modelers without additional training. ● Zero-shot sequence modeling: Shows LLMs can complete arbitrary symbolic sequences, not just language - they're general pattern completers driven by in-context learning. ● Word-to-action transfer: Applies pattern-completion to robotics, transferring abstract sequence patterns from text directly into robot action sequences. ● Robotics without robot data: Achieves meaningful robot control without any training on robot data - purely through language model pattern-matching. ● Conceptual framing: Influential perspective paper reframing LLMs as general compression/pattern machines rather than just language models. |
Paper, Tweet |
| 7) HyperDreamBooth - A smaller, faster, and more efficient version of DreamBooth for personalizing text-to-image models. ● HyperNetwork design: Uses a HyperNetwork to predict LoRA weights from a single input image, bypassing per-subject optimization. ● 25x speedup: Achieves ~25x faster personalization than DreamBooth while maintaining visual fidelity to the subject. ● Single-image input: Requires only one input image of the subject - a major UX improvement over prior methods needing 3-5 images. ● On-device personalization: Compact adapter footprint makes HyperDreamBooth-style techniques attractive for on-device personalization in consumer apps. |
Paper, Tweet |
| 8) Teaching Arithmetic to Small Transformers - Trains small transformers on chain-of-thought style data for arithmetic with large gains. ● Data format matters: Shows that reformulating arithmetic into explicit step-by-step data dramatically improves small-model accuracy and convergence. ● Emergence from curriculum: Fine-grained reasoning traces enable small transformers to learn multi-digit arithmetic that would otherwise require orders-of-magnitude more scale. ● High-quality data thesis: Supports the emerging 2023 thesis that instructive, well-formatted data beats brute-force scaling for specific skills. ● Small-model research: Informed the later Phi-series (Phi-1, Phi-1.5, Phi-2) "textbooks are all you need" data-quality research program. |
Paper, Tweet |
| 9) AnimateDiff - Animates frozen text-to-image diffusion models via a plug-in motion modeling module. ● Motion module: Adds a motion modeling module on top of frozen T2I models that learns to produce temporally coherent frame sequences. ● Model-agnostic: Works with any personalized T2I checkpoint (LoRAs, DreamBooth fine-tunes) without retraining - animating existing Stable Diffusion models. ● Community adoption: Became the dominant open-source video generation tool in late 2023, powering countless community animations on ComfyUI and WebUI. ● Open video generation: Established the architectural pattern (frozen image model + learned motion module) that many subsequent open video models followed. |
Paper, Tweet |
| 10) Generative Pretraining in Multimodality (Emu) - A transformer-based multimodal foundation model for generating images and text. ● Unified pretraining: Pretrains on mixed image-text sequences to generate either modality in multimodal context. ● Instruction tuning for assistants: Combines generative pretraining with instruction tuning to produce performant multimodal assistants. ● In-context multimodal: Supports in-context learning across images and text, enabling few-shot multimodal tasks. ● Multi-modal assistants: Part of the 2023 push (alongside LLaVA, MiniGPT-4) that established the pattern of visual-instruction-tuned assistants. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) A Survey on Evaluation of LLMs - A comprehensive overview of evaluation methods covering what, where, and how to evaluate LLMs. ● Three-axis taxonomy: Organizes evaluation along what-to-evaluate (NLP tasks, robustness, ethics, trustworthiness), where-to-evaluate (benchmarks, datasets), and how-to-evaluate (automatic, human, LLM-as-judge). ● Benchmark catalog: Surveys the major benchmarks of 2023 including MMLU, HELM, BIG-bench, and AgentBench with strengths and limitations. ● Failure-mode analysis: Documents where current evaluations fall short - contamination, saturation, prompt sensitivity, and lack of task diversity. ● Evaluation field primer: Became a standard citation for researchers entering LLM evaluation, helping formalize the sub-field. |
Paper, Tweet |
| 2) How Language Models Use Long Contexts (Lost-in-the-Middle) - Shows LLM performance drops when relevant information is in the middle of a long context. ● U-shaped performance curve: LMs perform best when relevant info is at the start or end of context, with substantial degradation for middle positions. ● Cross-model phenomenon: Confirmed across GPT-3.5, GPT-4, Claude, and open-weight models - indicating a fundamental attention pattern rather than a bug. ● QA and retrieval benchmarks: Demonstrated on multi-document QA and key-value retrieval tasks with varying context positions. ● Foundational finding: Coined the phrase "lost in the middle" - one of the most widely-cited 2023 findings that shaped subsequent long-context benchmark and model design. |
Paper, Tweet |
| 3) LLMs as Effective Text Rankers - A prompting technique that enables open-source LLMs to perform SOTA text ranking. ● Pairwise ranking prompt: Uses pairwise prompting (A vs. B) rather than pointwise scoring, which aligns better with LLM reasoning strengths. ● Open-source SOTA: Achieves state-of-the-art text ranking on standard benchmarks using only open-weight LLMs - no proprietary API required. ● Retrieval pipeline fit: Designed to slot into existing retrieval pipelines as a re-ranker stage. ● RAG infrastructure: Influenced 2024's RAG reranker ecosystem, with LLM-based reranking becoming standard in production retrieval stacks. |
Paper, Tweet |
| 4) Multimodal Generation with Frozen LLMs - Maps images to LLM token space enabling models like PaLM and GPT-4 to handle visual tasks without parameter updates. ● Frozen LLM design: Keeps the underlying LLM completely frozen - only a lightweight image-to-token projection layer is trained. ● Parameter-efficient multimodal: Enables multimodal capabilities without fine-tuning large LLMs, drastically reducing compute cost. ● In-context visual tasks: Uses in-context learning to tackle VQA, image captioning, and visual reasoning with zero LLM modification. ● Plug-in VLM pattern: An early example of the "frozen LLM + visual adapter" design that became dominant in open-source VLMs through 2024. |
Paper, Tweet |
| 5) CodeGen2.5 - Salesforce's new 7B code LLM trained on 1.5T tokens and optimized for fast sampling. ● Small-but-competitive: 7B model matches or beats prior >15B code-generation models, demonstrating data quality can substitute for model scale. ● Fast-sampling optimization: Architecturally tuned for inference speed, making it practical for IDE integration use cases. ● Multilingual code: Handles multiple programming languages with strong Python, JavaScript, and TypeScript performance. ● Open code LLM: Part of the 2023 open-source code LLM wave (CodeGen, StarCoder, CodeLlama) that made private code assistants viable for enterprise. |
Paper, Tweet |
| 6) Elastic Decision Transformer - An advance over Decision Transformers that enables trajectory stitching at inference time. ● Adaptive history length: Adjusts to shorter history at test time, enabling transitions to diverse and better future states. ● Trajectory stitching: Unlike vanilla Decision Transformers that treat trajectories as fixed, EDT composes segments from different trajectories. ● Offline RL gains: Achieves stronger performance on offline RL benchmarks where data quality and coverage vary. ● Decision Transformer evolution: Part of the broader effort to make Decision Transformers competitive with Q-learning approaches on offline RL tasks. |
Paper, Tweet |
| 7) Robots That Ask for Help - A framework for calibrating LLM-based robot planners so they ask for help when uncertain. ● Uncertainty alignment: Measures and aligns the uncertainty of LLM planners so help-requests correlate with real task difficulty. ● Conformal prediction: Uses conformal prediction to provide rigorous statistical guarantees on when to defer to humans. ● Safer autonomy: Reduces the risk of silent failures in robot deployments where an LLM confidently executes wrong plans. ● Human-robot collaboration: An early contribution to the know-when-you-don't-know literature for LLM-driven agents - a theme that became central to 2024 agent safety work. |
Paper, Tweet |
| 8) Physics-based Motion Retargeting in Real-Time - Uses RL to retarget motions from sparse human sensor data to characters of various morphologies. ● Physics simulator policies: Trains RL policies that control characters in a physics simulator, producing physically plausible motion. ● Sparse sensor input: Works from sparse human sensor data (e.g., VR headset + controllers) rather than requiring full motion capture. ● Cross-morphology: Generalizes across characters of different morphologies without per-character re-training. ● VR/AR deployment: Practical for real-time VR/AR avatar control where users have only a few tracking points but want natural character motion. |
Paper, Tweet |
| 9) Scaling Transformer to 1 Billion Tokens (LongNet) - Microsoft's Transformer variant scaling sequence length past 1B tokens. ● Dilated attention: Introduces dilated attention that exponentially grows the attention field, enabling linear complexity in sequence length. ● No short-sequence loss: Achieves extreme long-context scaling with no degradation on shorter sequences. ● 1B token demo: Demonstrates viability at the 1-billion token context scale - an order of magnitude beyond anything previously attempted. ● Long-context frontier: Pushed the frontier of what's theoretically possible for ultra-long-context Transformers, even though production models stayed in the hundreds-of-thousands-of-tokens range. |
Paper, Tweet |
| 10) InterCode - A framework treating interactive coding as a reinforcement learning environment. ● Interactive paradigm: Moves beyond static sequence-to-sequence coding benchmarks to multi-turn interactive coding with execution feedback. ● Standardized RL environment: Provides Bash, SQL, and Python environments with consistent APIs for training and evaluating code agents. ● Feedback-loop evaluation: Tests whether models can use execution errors, test failures, and intermediate outputs to iteratively improve their code. ● Code-agent foundation: Anticipated and enabled the 2024 explosion of interactive coding agents (SWE-agent, OpenDevin, Aider) that leverage execution feedback loops. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) LeanDojo - An open-source Lean playground consisting of toolkits, data, models, and benchmarks for theorem proving. ● Theorem-proving infrastructure: Full stack for LLM-based theorem proving in Lean, including the first large-scale extraction of proof data from the Mathlib library. ● ReProver model: Releases a retrieval-augmented LLM-based prover that selects relevant premises from a vast math library rather than memorizing everything. ● Academic accessibility: Makes theorem-proving research accessible to smaller groups that lack the resources to build Lean tooling from scratch. ● Formal math acceleration: A foundational piece that enabled 2024 breakthroughs like DeepMind's AlphaProof and the broader surge in LLM-driven formal math research. |
Paper, Tweet |
| 2) Extending Context Window of LLMs (PI) - Position Interpolation extends LLaMA's context to 32K with minimal fine-tuning (within 1000 steps). ● Position interpolation: Linearly interpolates positional indices so pretrained RoPE attention generalizes to longer sequences without breaking. ● 1000-step adaptation: Requires only ~1000 fine-tuning steps versus prior methods that needed much more compute. ● Quality preservation: Maintains strong performance on tasks while reaching 32K context - both long-context tasks and standard-length benchmarks. ● Standard long-context recipe: Became the standard approach for extending open-source model context windows throughout 2023 and early 2024. |
Paper, Tweet |
| 3) Computer Vision Through the Lens of Natural Language - A modular approach solving CV problems by routing through LLM reasoning. ● Modular CV pipeline: Uses LLMs to reason over outputs from independent, descriptive vision modules that each provide partial information about an image. ● Interpretable intermediate: Intermediate language descriptions are human-readable, improving debugability versus end-to-end VLMs. ● Tool-augmented vision: Part of the broader "LLM as cognitive core" research direction where LLMs orchestrate specialized tools. ● VLM alternative: Offers a complementary paradigm to end-to-end VLM training, trading compute for modularity and interpretability. |
Paper, Tweet |
| 4) Visual Navigation Transformer (ViNT) - A foundation model for vision-based robotic navigation built on flexible Transformers. ● Cross-embodiment: Works across different robotic platforms (quadrupeds, wheeled robots, drones) without per-robot retraining. ● Pretrained + fine-tuned: Leverages pretrained vision models and fine-tunes on navigation-specific data for strong transfer. ● Multi-task navigation: Handles goal-reaching, exploration, and map-building within a single Transformer backbone. ● Robotics foundation models: An early robotics-specific foundation model that preceded RT-2 and the VLA explosion of late 2023. |
Paper, Tweet |
| 5) Generative AI for Programming Education - Evaluates GPT-4 and ChatGPT on programming education scenarios versus human tutors. ● Structured comparison: Compares GPT-4, ChatGPT, and human tutors on tasks like code explanation, bug fixing, and student-facing hint generation. ● GPT-4 near-human: GPT-4 outperforms ChatGPT and comes close to human tutor performance on many education tasks. ● Pedagogical limitations: Identifies gaps where LLMs still fall short - nuanced misconception detection, maintaining pedagogical scaffolding, avoiding spoiler answers. ● EdTech roadmap: Influential for the wave of AI-powered coding education products that launched in 2024. |
Paper, Tweet |
| 6) DragDiffusion - Extends interactive point-based image editing to diffusion models. ● Latent optimization: Optimizes the diffusion latent directly to achieve precise spatial control over image content. ● DragGAN for diffusion: Brings the intuitive drag-to-edit interaction (popularized by DragGAN) to the more capable diffusion model backbone. ● High-quality edits: Achieves high-quality edits while preserving overall image coherence - objects move realistically rather than just warping pixels. ● Interactive generation: Part of the broader move toward interactive, controllable image generation over one-shot text-to-image. |
Paper, Tweet |
| 7) Understanding Theory-of-Mind in LLMs with LLMs - A framework for procedurally generating ToM evaluations using LLMs themselves. ● LLM-generated benchmarks: Uses LLMs to procedurally create diverse ToM scenarios, avoiding benchmark contamination and enabling unlimited test generation. ● Social reasoning study: Evaluates whether LLMs can track beliefs, intentions, and false beliefs of multiple agents - classic ToM challenges. ● Controlled difficulty: Procedural generation allows varying difficulty (number of agents, nesting depth) to map capability boundaries. ● Evaluation pattern: Early example of using LLMs to generate evaluations for LLMs - a pattern that would become standard in 2024 synthetic evaluation work. |
Paper, Tweet |
| 8) Evaluations with No Labels - Self-supervised evaluation of LLMs via sensitivity/invariance to input transformations. ● Label-free evaluation: Evaluates LLMs without requiring ground-truth labels, using consistency under input perturbations as the signal. ● Transformation-based probes: Measures sensitivity or invariance to paraphrasing, irrelevant-context addition, and other transformations that shouldn't change correct answers. ● Live deployment monitoring: Useful for monitoring LLM behavior on datasets streamed during production deployment, catching drift without manual labeling. ● Deployment infrastructure: An early contribution to the continuous evaluation tooling that would become standard for 2024 LLM production systems. |
Paper, Tweet |
| 9) Long-range Language Modeling with Self-Retrieval - Jointly trains a retrieval-augmented LM from scratch for long-range modeling. ● End-to-end retrieval training: Unlike retro-fitted RAG, trains the retriever and LM jointly from scratch for long-range consistency. ● Long-form coherence: Targets tasks requiring retrieval of distant past context within a long document, not just factual lookup. ● Architecture innovation: Introduces training procedures and architectural choices that make joint training stable and efficient. ● Long-context RAG: Presaged the research direction of treating RAG and long-context as complementary rather than competing solutions. |
Paper, Tweet |
| 10) Scaling MLPs: A Tale of Inductive Bias - Shows MLPs scale with compute despite their lack of inductive bias. ● Pure-MLP scaling: Demonstrates that large pure-MLP models trained on enough data can reach surprisingly strong performance on image classification. ● Inductive bias is compensable: Challenges the dogma that CNN/Transformer inductive biases are necessary - scale and data can substitute. ● Bitter lesson evidence: Adds to the "bitter lesson" empirical evidence that general methods leveraging computation outperform those leveraging human-designed priors. ● Architecture agnosticism: Part of the 2023 trend showing that many architectures (MLPs, State Space Models, RNNs, Transformers) converge at scale. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Textbooks Are All You Need (phi-1) - Introduces a 1.3B parameter code LLM trained on textbook-quality data. ● Data-quality thesis: Trained on a curated selection of textbook-quality web data plus synthetic textbooks/exercises generated with GPT-3.5. ● Small model, strong HumanEval: Achieves 50.6% pass@1 on HumanEval despite being 1.3B - beating much larger models on code generation. ● 4-day training: Trained in just 4 days on 8 A100s, showing that aggressive data selection can substitute for massive compute. ● Phi-series launch: Kicked off Microsoft's Phi-series (Phi-1.5, Phi-2, Phi-3) and catalyzed the "small-but-smart" model research program. |
Paper, Tweet |
| 2) RoboCat - DeepMind's self-improving foundation agent that operates different robotic arms from as few as 100 demonstrations. ● Cross-embodiment: Single agent controls multiple different robotic arms and grippers, generalizing across hardware. ● Self-improving loop: Generates new training data via fine-tuning on its own demonstrations, progressively improving its own capabilities. ● Few-shot adaptation: Adapts to new tasks from as few as 100 demonstrations - practical for real-world deployment. ● Robotics foundation agent: A key data point that robotics was moving toward the same foundation-model + self-improvement paradigm as LLMs. |
Paper, Tweet |
| 3) ClinicalGPT - A language model optimized through extensive and diverse medical data and multi-turn dialogue. ● Medical data diversity: Trained on medical records, domain knowledge corpora, and multi-round consultation dialogues spanning multiple medical specialties. ● Chinese medical focus: Strong coverage of Chinese medical data, filling a gap that general-purpose medical LLMs didn't address. ● Dialog-first design: Optimized for realistic multi-turn consultations rather than single-shot medical QA. ● Regional medical LLMs: Part of the broader trend of region/language-specific medical LLMs emerging alongside global systems like Med-PaLM. |
Paper, Tweet |
| 4) An Overview of Catastrophic AI Risks - Dan Hendrycks' comprehensive overview of catastrophic AI risk categories. ● Four risk categories: Organizes catastrophic AI risks into malicious use, AI race dynamics, organizational risks, and rogue AIs. ● Policy-relevant framing: Written for researchers, policymakers, and the broader public - influenced AI governance discussions through 2023-2024. ● Risk concretization: Grounds abstract risk discussions in specific, plausible scenarios that can be analyzed and mitigated. ● Governance reference: Widely cited in AI policy proposals, UK AI Safety Summit materials, and national AI strategies. |
Paper, Tweet |
| 5) LOMO - A memory-efficient optimizer that combines gradient computation and parameter update in one step. ● Fused grad-update: Fuses backpropagation and SGD update into a single operation, eliminating the need to store all gradients in memory simultaneously. ● Full-parameter tuning: Enables full-parameter fine-tuning of a 65B LLM on a single 8x24GB GPU machine. ● Democratization: Makes full fine-tuning (not just LoRA) accessible to researchers without multi-node GPU clusters. ● Optimizer memory research: Joined the 2023 wave of optimizer memory innovations (8-bit Adam, AdaFactor, GaLore) democratizing large-model tuning. |
Paper, Tweet |
| 6) SequenceMatch - Formulates sequence generation as imitation learning, enabling backtracking via a backspace action. ● Imitation learning framing: Views autoregressive generation as imitation learning with expert data, opening the door to standard IL techniques. ● Backspace action: Introduces a "backspace" action that lets the model undo tokens that led to out-of-distribution sequences. ● Compounding error mitigation: Addresses the classical autoregressive problem where small early errors compound catastrophically. ● Training innovation: An interesting precursor to later work on self-correcting LLMs and reasoning with error recovery. |
Paper, Tweet |
| 7) LMFlow - An extensible and lightweight toolkit for fine-tuning and inference of large foundation models. ● Full training stack: Supports continuous pretraining, instruction tuning, parameter-efficient fine-tuning, alignment tuning, and inference in one toolkit. ● Lightweight design: Easier to use and extend than heavier frameworks like Megatron or DeepSpeed for practitioners who want to iterate quickly. ● Community adoption: Became a popular tool in the open-source LLM ecosystem for reproducing fine-tuning recipes. ● Training ecosystem: Part of the broader 2023 proliferation of accessible LLM training tooling (Axolotl, LLaMA-Factory, LitGPT) that enabled community fine-tuning. |
Paper, Tweet |
| 8) MotionGPT - Generates consecutive human motions from multimodal control signals via LLM instructions. ● Motion quantization: Quantizes motion into discrete tokens that LLMs can produce in the same stream as text. ● Multimodal control: Accepts text, audio, and other control signals as input, producing corresponding human motion outputs. ● LLM-as-motion-generator: Treats motion generation as a token-prediction task, unifying motion with other LLM capabilities. ● Animation and VR: Applicable to character animation, VR avatars, and content creation workflows where text-driven motion is valuable. |
Paper, Tweet |
| 9) Wanda - A simple, effective pruning approach for LLMs requiring no retraining. ● Weight×activation pruning: Prunes weights with the smallest magnitude × corresponding input activations on a per-output basis. ● Zero retraining: Requires no retraining or weight updates, making it immediately deployable. ● Simple beats complex: Outperforms magnitude-only pruning and matches or exceeds more complex training-based pruning methods. ● Production pruning: Became a widely-adopted baseline in LLM pruning research due to its simplicity and strong performance. |
Paper, Tweet |
| 10) AudioPaLM - Fuses PaLM-2 and AudioLM into a multimodal architecture supporting speech understanding and generation. ● Unified speech-text: Represents both speech and text as tokens in a shared vocabulary, enabling any-to-any conversion between modalities. ● Zero-shot translation: Performs zero-shot speech-to-text translation into languages never seen as translation targets during training. ● Speech generation: Generates high-quality speech in the voice of the input speaker while preserving prosody. ● Unified speech foundation: A precursor to 2024's fully multimodal systems like GPT-4o that natively process and generate speech. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Voicebox - Meta's all-in-one generative speech model supporting 6 languages and many speech tasks in-context. ● Flow-matching training: Uses flow-matching with text-guided context to unify TTS, denoising, editing, and style transfer in one model. ● 20x faster: Outperforms specialized TTS systems while running 20x faster than prior state-of-the-art diffusion-based speech models. ● Speech ICL: Supports in-context learning for speech - give it an audio prompt and it matches the speaker's voice, style, and prosody zero-shot. ● Generalist speech: A major step toward generalist speech foundation models that would accelerate with 2024 systems like VoiceCraft and XTTS. |
Paper, Tweet |
| 2) FinGPT - An open-source LLM for the finance sector with a data-centric approach. ● Data-centric finance: Focuses on curating high-quality financial data (SEC filings, earnings calls, news, market data) as the key lever for FinLLM quality. ● Accessible resources: Provides pipelines, fine-tuning scripts, and evaluation benchmarks so practitioners can develop their own FinLLMs. ● Multi-task financial NLP: Covers sentiment analysis, earnings surprise prediction, news summarization, and more within a unified framework. ● Open finance AI: An early open-source counterpoint to proprietary financial LLMs like BloombergGPT, accelerating community research. |
Paper, Tweet |
| 3) Crowd Workers Widely Use LLMs for Text Production - Empirical evidence that 33-46% of MTurk crowd workers used LLMs on text tasks. ● LLM-generated contamination: Estimates that a third to almost half of crowd-worker text production involved LLMs - a massive data quality issue. ● Benchmark contamination risk: Implications for NLP datasets produced via crowdsourcing, potentially invalidating many "human baseline" numbers. ● Methodology: Uses statistical analysis comparing completion times, stylistic features, and output consistency to estimate LLM usage. ● Community wake-up: Sparked widespread discussion about the future of human-generated data and the need for AI-usage detection. |
Paper, Tweet |
| 4) Reliability of Watermarks for LLMs - Studies whether watermarks survive human rewriting and LLM paraphrasing. ● Robustness testing: Evaluates whether watermarks remain detectable after human rewrites, paraphrasing attacks, and translation round-trips. ● Surprisingly robust: Finds that statistical watermarks (Kirchenbauer et al.) remain detectable even after aggressive transformations, with enough output text. ● Text-length dependence: Detection confidence scales with text length - short watermarked snippets are much easier to obliterate than long ones. ● AI detection realism: Provides a sober evaluation of watermarking's practical viability amid concerns about AI-generated content. |
Paper, Tweet |
| 5) Applications of Transformers - A new survey highlighting major applications of Transformers across deep learning. ● Cross-domain coverage: Surveys Transformers in NLP, vision, speech, multi-modal, reinforcement learning, graph, and time-series tasks. ● Model catalog: Comprehensive list of Transformer architectures with their design choices and application niches. ● Application-driven taxonomy: Organizes by application domain rather than architecture, useful for practitioners evaluating Transformers for new domains. ● Reference document: A broad reference for teaching material and onboarding readings on the Transformer architecture's reach. |
Paper, Tweet |
| 6) Benchmarking NN Training Algorithms (AlgoPerf) - A new benchmark for rigorously evaluating optimizers using realistic workloads. ● Realistic workloads: Tests optimizers on actual production-scale tasks (ImageNet, language modeling, translation) rather than toy problems. ● Wall-clock benchmarking: Evaluates optimizers on time-to-target-accuracy rather than just step counts, reflecting real training budgets. ● Hyperparameter rules: Standardizes hyperparameter tuning budgets for fair cross-optimizer comparisons. ● Optimizer research infrastructure: Enabled credible claims about new optimizers versus Adam and SGD - raising the bar for optimizer papers going forward. |
Paper, Tweet |
| 7) Unifying LLMs & Knowledge Graphs - A roadmap for combining LLMs with knowledge graphs for stronger reasoning. ● Three integration paradigms: Organizes integration into KG-enhanced LLMs (pretraining/inference), LLM-augmented KGs (QA, completion), and synergized LLM+KG reasoning. ● Bidirectional reasoning: Argues for bidirectional systems where KGs ground LLM claims and LLMs extend KGs, rather than one-way augmentation. ● Hallucination mitigation: Positions KG grounding as a principled tool for reducing LLM hallucinations. ● Hybrid AI direction: Influential for the 2024 resurgence of knowledge-graph + LLM systems, especially in enterprise search and agents. |
Paper, Tweet |
| 8) Augmenting LLMs with Long-term Memory (LongMem) - Enables LLMs to memorize long history via memory-augmented adaptation. ● Memory-augmented training: Dedicated adaptation training teaches the LLM to retrieve and use its memory of long past context. ● ICL over long history: Enables in-context learning that spans far longer contexts than the model's raw attention window. ● Decoupled retrieval: Separates the retrieval mechanism from the main model, allowing memory to grow without increasing model size. ● Long-context direction: Part of 2023's multi-pronged attack on context-window limits, complementary to position interpolation and ring attention. |
Paper, Tweet |
| 9) TAPIR - Tracks any queried point on any physical surface throughout a video sequence faster than real-time. ● Any-point tracking: Generalizes object tracking to arbitrary query points, handling occlusions and re-appearances robustly. ● Faster than real-time: On modern GPUs, tracks points faster than real-time on long, high-resolution videos - practical for real-world applications. ● SOTA across benchmarks: Outperforms all prior baselines on standard point-tracking benchmarks. ● Video understanding building block: Point tracking is a fundamental primitive for video understanding, editing, and robotics - TAPIR made it practical. |
Paper, Tweet |
| 10) Mind2Web - A dataset for evaluating generalist web agents with 2,350 tasks across 137 websites and 31 domains. ● Broad web coverage: 137 real-world websites across 31 domains (travel, shopping, information seeking) - far more diverse than prior web benchmarks. ● Generalization-focused: Tests cross-task, cross-website, and cross-domain generalization rather than in-distribution performance. ● Realistic tasks: Uses real user tasks rather than synthetic scripts, capturing the messiness of actual web interactions. ● Web-agent benchmark: Became a central benchmark for the 2024 explosion of web agents (WebAgent, WebVoyager, Browser Use, Operator). |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Tracking Everything Everywhere All at Once (OmniMotion) - Test-time optimization for dense, long-range motion estimation. ● Per-pixel motion: Estimates motion for every pixel across every frame of a video, producing dense long-range trajectories. ● Test-time optimization: Optimizes a quasi-3D representation per video at test time, producing coherent long-range correspondences. ● Through occlusions: Maintains point tracking even through long occlusions and complex camera motion - prior methods struggled with both. ● Video understanding primitive: A foundational capability that enables downstream video editing, object removal, and 3D reconstruction applications. |
Paper, Tweet |
| 2) AlphaDev - DeepMind's deep RL agent discovering faster sorting algorithms from scratch, now in LLVM. ● Assembly-level discovery: Searches over CPU assembly instructions rather than high-level code, finding micro-optimizations humans would miss. ● LLVM integration: Discovered sorting routines were integrated into the LLVM C++ standard library - the first major AI-discovered algorithm in production compiler infrastructure. ● Human-beating benchmarks: Found 70% faster sorting for very small inputs and 1.7% faster for large inputs, running billions of times per day worldwide. ● Algorithm discovery AI: A proof point for AI-driven algorithm discovery that would later be extended to matrix multiplication (AlphaEvolve) and other primitives. |
Paper, Tweet |
| 3) Sparse-Quantized Representation (SpQR) - Tim Dettmers' near-lossless LLM compression technique. ● 4.75-bit inference: Enables LLM inference at 4.75 bits per parameter with a 15% speedup over FP16 baselines. ● Near-lossless: Maintains model quality close to full-precision, with degradation measured in fractions of a percent on standard benchmarks. ● Outlier-aware quantization: Identifies and preserves sensitive "outlier" weights in higher precision while aggressively quantizing the rest. ● Quantization lineage: Part of Dettmers' influential quantization research (LLM.int8, QLoRA, SpQR) that made large-model inference accessible on consumer hardware. |
Paper, Tweet |
| 4) MusicGen - A simple and controllable model for music generation using a single-stage Transformer. ● Single-stage design: Unlike prior hierarchical music models, MusicGen uses a single Transformer predicting interleaved audio tokens. ● Multi-conditioning: Supports conditioning on text descriptions, melody audio, or both simultaneously. ● SOTA on text-to-music: Achieves strong performance on standard text-to-music benchmarks while being simpler to train and deploy. ● Open music generation: Meta's open release of MusicGen weights and code democratized music generation research and spawned community applications. |
Paper, Tweet |
| 5) Augmenting LLMs with Databases (ChatDB) - Combines an LLM with SQL databases as a symbolic memory framework. ● LLM-orchestrated SQL: The LLM generates SQL queries to read from and write to a database as its persistent memory. ● Structured reasoning: By externalizing state to a database, enables LLMs to handle complex multi-step tasks with consistent memory. ● Symbolic memory: Offers a more reliable alternative to embedding-based memory for tasks requiring exact recall and structured queries. ● Tool-use precursor: Part of the early 2023 research establishing LLM-as-orchestrator patterns that matured into today's agent frameworks. |
Paper, Tweet |
| 6) Concept Scrubbing in LLM (LEACE) - Least-squares Concept Erasure - erases a target concept from every layer of a neural network. ● Closed-form erasure: Provides a closed-form solution for removing linearly-encoded concepts (like gender) from representations at every layer. ● Theoretical guarantees: Mathematically guarantees the concept cannot be linearly recovered after erasure. ● Bias reduction: Applied to reduce gender bias in BERT embeddings while minimizing impact on other capabilities. ● Interpretability tool: Became a standard tool in the model-editing and interpretability literature for studying what information models use. |
Paper , Tweet |
| 7) Fine-Grained RLHF - Trains LMs with segment-level human feedback rather than whole-response preferences. ● Segment-level rewards: Provides multiple reward models targeting specific dimensions (factuality, relevance, fluency) at the span level. ● Long-form QA gains: Substantial improvements on long-form question answering where whole-response preferences are too coarse. ● Toxicity reduction: Enables targeted reduction of toxic spans without degrading overall response quality. ● Controllable RLHF: Enables model customization by emphasizing different reward dimensions at inference time. |
Paper, Tweet |
| 8) Hierarchical Vision Transformer (Hiera) - Pretrains ViTs with MAE while removing unnecessary multi-stage complexity. ● Simplified architecture: Strips away hand-designed components (shifted windows, relative position biases) from hierarchical ViTs like Swin. ● MAE pretraining: Leverages masked autoencoder pretraining to compensate for reduced inductive bias. ● Faster and more accurate: Achieves better accuracy and faster inference/training than prior hierarchical ViTs. ● Architecture minimalism: Reinforces the "bitter lesson" direction - simpler architectures with better pretraining beat complex hand-designed ones. |
Paper, Tweet |
| 9) Humor in ChatGPT - Explores ChatGPT's capabilities to grasp and reproduce humor. ● Joke repetition: Over 90% of 1,008 generated jokes were the same 25 jokes - revealing extreme mode collapse in humor generation. ● Structural overfitting: ChatGPT is overfit to particular joke structures (e.g., "Why did X? Because Y") and struggles with diverse humor styles. ● Humor comprehension: While generation is limited, ChatGPT can explain joke structure and recognize humor - showing a partial understanding. ● Creativity evaluation: An influential paper in the creativity-evaluation literature, documenting specific failures of LLM creative generation. |
Paper, Tweet |
| 10) Imitating Reasoning Process of Larger LLMs (Orca) - Microsoft's 13B model that imitates GPT-4's reasoning traces. ● Explanation tuning: Trains on detailed step-by-step explanations from GPT-4, not just final answers - capturing the reasoning process. ● Scale and diversity: Leverages millions of diverse imitation examples spanning reasoning tasks, dialogue, and instruction-following. ● Beats Vicuna-13B: Surpasses instruction-tuned Vicuna-13B in zero-shot reasoning, demonstrating explanation-data quality matters. ● Small-model reasoning: Kicked off a line of research on reasoning distillation that would continue through Orca 2 and into 2024's reasoning-specific SLMs. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Let's Verify Step by Step - OpenAI's landmark paper on process reward models for mathematical reasoning. ● Process supervision: Rewards each correct step of reasoning rather than just the final answer, capturing partial credit and providing much denser training signal. ● 78% MATH solve rate: Achieves state-of-the-art on a representative subset of the MATH benchmark - a significant jump over outcome-reward baselines. ● PRM800K dataset: Releases a massive dataset of 800K step-level correctness labels, enabling follow-up research on process reward models. ● Reasoning revolution foundation: Directly influenced OpenAI's o1/o3 reasoning models and the broader 2024-25 push toward process-supervised reasoning. |
Paper, Tweet |
| 2) No Positional Encodings (NoPE) - Shows explicit position embeddings aren't essential for decoder-only Transformers. ● Implicit positional learning: Decoder-only Transformers learn positional information from the causal attention mask alone - no explicit encoding needed. ● Length generalization: NoPE generalizes better to longer sequences than ALiBi and Rotary, which have surprising length-generalization issues. ● Architectural simplification: Removing positional encodings simplifies the architecture with no quality loss on standard tasks. ● Long-context influence: Informed the 2024 resurgence of interest in length-generalization-friendly architectures. |
Paper, Tweet |
| 3) BiomedGPT - A unified biomedical GPT for vision, language, and multimodal tasks. ● Unified biomedical model: Single model handling 5 task types across 20 public datasets spanning 15+ biomedical modalities (images, text, genomics). ● SOTA across benchmarks: Achieves state-of-the-art on biomedical VQA, summarization, and classification benchmarks. ● Generalist medical direction: Complements Med-PaLM M in proving that generalist medical AI models outperform task-specific specialists. ● Medical AI democratization: As an open model, makes generalist biomedical AI accessible to academic medical centers and healthcare startups. |
Paper, Tweet |
| 4) Thought Cloning - Imitation learning framework that learns to think as well as act. ● Cloning thoughts AND behavior: Clones both the actions and the internal verbal thoughts of human demonstrators, not just behavioral trajectories. ● BabyAI benchmark: Demonstrated on BabyAI with substantial improvement over behavior-only cloning, especially on out-of-distribution tasks. ● Interpretability bonus: Because the agent thinks in natural language, its decisions are interpretable and debuggable. ● Reasoning-agent precursor: A conceptual precursor to 2024's "reasoning agents" that produce explicit thought traces before acting. |
Paper, Tweet |
| 5) Fine-Tuning Language Models with Just Forward Passes (MeZO) - A memory-efficient zeroth-order optimizer for LLM fine-tuning. ● No backpropagation: Uses a memory-efficient zeroth-order SGD algorithm that requires only forward passes, eliminating the memory overhead of backprop. ● Inference-like memory: Fine-tunes large LLMs with the same memory footprint as inference - democratizes full-parameter fine-tuning. ● Comparable quality: Reaches comparable quality to backpropagation-based fine-tuning on many tasks despite using only forward passes. ● Memory-constrained tuning: Opens new possibilities for fine-tuning huge models on modest hardware by trading compute for memory. |
Paper , Tweet |
| 6) MERT - An acoustic music understanding model with large-scale self-supervised training. ● Music-specific SSL: Designed specifically for music (not speech/general audio) with appropriate teacher models and training objectives. ● Multi-teacher design: Combines multiple teacher models to capture different aspects of music (pitch, rhythm, timbre, harmony). ● Cross-task performance: Outperforms speech and generic audio approaches on music understanding benchmarks (genre, mood, tagging). ● Music foundation model: Part of the 2023 push toward domain-specific audio foundation models rather than one-size-fits-all speech/audio models. |
Paper , Tweet |
| 7) Bytes Are All You Need - Performs classification directly on file bytes without decoding. ● Raw-byte input: Trains Transformers directly on raw file bytes (PNG, WAV, etc.) rather than decoded tensors. ● Strong results: Achieves 77.33% ImageNet Top-1 accuracy on raw bytes and 95.42% on raw WAV for Speech Commands v2. ● Format-agnostic: A single architecture handles any file format without preprocessing pipelines. ● Infrastructure simplification: Suggests a future where models eat raw bytes and skip format-specific codecs - simpler pipelines with less preprocessing error. |
Paper, Tweet |
| 8) Direct Preference Optimization (DPO) - Rafailov et al.'s simpler alternative to RLHF that rivals full RL-based alignment. ● Classification, not RL: Reformulates preference learning as a classification problem on preference pairs, skipping the complex RL loop entirely. ● Theoretical equivalence: Mathematically equivalent to RLHF under certain assumptions, extracting the implicit reward function directly. ● Training stability: Much more stable and hyperparameter-robust than PPO-based RLHF, dramatically lowering the barrier to entry. ● Industry-wide adoption: Became the default alignment method throughout 2024 (Zephyr, Tulu, Llama 3 pipelines) and ushered in the era of RL-free preference optimization. |
Paper, Tweet |
| 9) SQL-PaLM - An LLM-based Text-to-SQL system built on PaLM-2. ● SOTA in both settings: Achieves state-of-the-art on Spider benchmark in both in-context learning and fine-tuning settings. ● Beats GPT-4 few-shot: Few-shot SQL-PaLM outperforms few-shot GPT-4 by 9.9% using a simple prompting approach. ● Improves on fine-tuned baselines: The few-shot setting even outperforms the previous fine-tuned SOTA by 3.8%. ● Text-to-SQL direction: Part of the Text-to-SQL surge that led to production NL-to-SQL systems in analytics platforms through 2024. |
Paper, Tweet |
| 10) CodeTF - An open-source Transformer library for state-of-the-art code LLMs. ● Code-LLM infrastructure: Provides pretrained code LLMs, popular code benchmarks, and standard methods for training and serving them efficiently. ● Unified interface: Consistent API across different code LLMs makes comparison and swapping straightforward. ● Benchmark-driven: Built-in evaluation on HumanEval, MBPP, and other code benchmarks enables easy empirical comparisons. ● Open-source code AI: Part of the 2023 expansion of open-source code LLM tooling that made private coding assistants practical for enterprise. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) QLoRA - Tim Dettmers' breakthrough technique enabling 65B LLM fine-tuning on a single 48GB GPU. ● 4-bit NF4 quantization: Introduces the NormalFloat 4-bit datatype optimized for normally-distributed weights with double-quantization for further memory savings. ● Paged optimizers: Uses paged NVIDIA Unified Memory to handle optimizer state memory spikes without OOM failures. ● 16-bit quality: Achieves quality matching full 16-bit fine-tuning despite aggressive quantization during training. ● Community fine-tuning enabler: Arguably the single most impactful 2023 paper for democratizing LLM fine-tuning - powered thousands of community checkpoints on Hugging Face. |
Paper, Tweet |
| 2) LIMA - Meta's 65B LLaMA fine-tuned on just 1,000 curated examples - showing alignment needs less data than believed. ● 1,000-example SFT: Achieves strong alignment with only 1,000 carefully curated prompt-response pairs, no RLHF needed. ● "Superficial Alignment Hypothesis": Proposes that a model's knowledge is learned in pretraining and alignment mostly teaches response style. ● GPT-4 competitive: Generates responses preferred over or equivalent to GPT-4 in 43% of cases, and much higher versus Bard. ● Data-quality over quantity: Became a foundational reference for the "quality over quantity" SFT paradigm that dominated later alignment work. |
Paper, Tweet |
| 3) Voyager - An LLM-powered embodied lifelong learning agent in Minecraft exploring autonomously. ● Skill library: Maintains a growing library of skills written as code - new skills are composed from existing ones, creating cumulative learning. ● Automatic curriculum: The LLM proposes its own curriculum of tasks, driving open-ended exploration without human intervention. ● GPT-4 integration: Uses GPT-4 for both planning and skill generation, demonstrating the power of modern LLMs as agent cognitive cores. ● Agent research milestone: A landmark agent paper showing LLM-powered agents can exhibit autonomous, cumulative learning in complex environments. |
Paper, Tweet |
| 4) Gorilla - A fine-tuned LLaMA-based model that surpasses GPT-4 on API call generation. ● API-specialized LLM: Specifically trained on massive API documentation corpora to produce correct API calls for TensorFlow Hub, HuggingFace, and PyTorch Hub. ● Beats GPT-4 on APIs: Outperforms GPT-4 on writing correct API calls - a narrow but important capability for tool use. ● Hallucination reduction: Major reduction in hallucinated API names and parameters compared to general-purpose LLMs. ● Tool-use LLM research: Established that specialized LLMs can meaningfully beat generalists at narrow capabilities - informing the later ecosystem of task-specialized models. |
Paper, Tweet |
| 5) The False Promise of Imitating Proprietary LLMs - Berkeley's critical analysis of open-source imitation of proprietary LLMs. ● Imitation limits: Shows that fine-tuning small open models on GPT-4 outputs creates a stylistic illusion without meaningfully improving factual capabilities. ● Stylistic mimicry: Imitation models learn to sound like GPT-4 but retain the base model's underlying capability ceiling. ● Base model leverage: Argues the higher-leverage action for open-source is building better base models, not imitating proprietary outputs. ● Field-redirecting: Shifted open-source research focus from distillation toward better pretraining data and scale, preparing the ground for strong foundation models like Llama 2. |
Paper , Tweet |
| 6) Sophia - A simple, scalable second-order optimizer with negligible per-step overhead. ● Second-order optimization: Uses a diagonal Hessian estimate to capture curvature information, going beyond first-order Adam. ● 2x speedup over Adam: On language modeling, achieves 2x speedup in step count, total compute, and wall-clock time. ● Practical efficiency: Despite being second-order, has only marginal per-step overhead versus Adam. ● Optimizer innovation: Part of the late-2023 wave of optimizer research (Lion, Sophia, Shampoo) aiming to replace Adam as the LLM-training default. |
Paper , Tweet |
| 7) The Larger They Are, the Harder They Fail - Reveals inverse-scaling failures in LLM code generation. ● Function-name swap test: Swaps default Python function names and observes that larger LLMs fail harder to adapt - they prefer memorized patterns. ● Inverse scaling: Counter to the usual "bigger is better" narrative, larger models prefer incorrect memorized continuations more strongly than smaller ones. ● Memorization vs. reasoning: Highlights the tension between memorization (which helps on training data) and reasoning (which helps on novel data). ● Safety implications: Important for safety/robustness - bigger models may be more brittle in adversarial or out-of-distribution settings. |
Paper, Tweet |
| 8) Model Evaluation for Extreme Risks - DeepMind's framework for evaluating models for catastrophic-risk capabilities. ● Dangerous-capability evaluation: Argues for evaluations targeting specifically dangerous capabilities (cyberattacks, bioweapons, manipulation) rather than general performance. ● Responsible decisions: Connects evaluation results to decisions about training, deployment, access control, and security investments. ● Red-team integration: Builds on dangerous-capability red-teaming methodology, formalizing it for frontier model governance. ● Governance influence: Directly informed the UK AI Safety Institute's frontier model evaluation framework and similar efforts. |
Paper, Tweet |
| 9) LLM Research Directions - A list of research directions for students entering LLM research. ● Research roadmap: Provides an organized list of open LLM research problems (factuality, reasoning, alignment, efficiency, evaluation). ● Accessibility focus: Specifically aimed at students and newcomers, identifying problems tractable on limited compute budgets. ● Course-material input: Became a reference for LLM-focused graduate seminars and reading groups. ● Field-guide document: Helped widen the LLM research field by lowering the barrier for newcomers to find productive research directions. |
Paper, Tweet |
| 10) Reinventing RNNs for the Transformer Era (RWKV) - Combines parallelizable training of Transformers with efficient RNN inference. ● Hybrid design: Achieves Transformer-style parallelizable training with RNN-style O(1) inference memory - best of both worlds. ● Transformer-parity performance: Matches similarly-sized Transformers on language modeling benchmarks while being dramatically cheaper at inference. ● Open community: Developed as an open-community project with releases spanning multiple scales and substantial community fine-tuning. ● Post-Transformer contender: Alongside Mamba and RetNet, positioned as one of the credible attempts to dethrone attention for efficient long-context inference. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Drag Your GAN (DragGAN) - Interactive point-based image manipulation on the generative image manifold. ● Point-based control: User clicks handle points on an image and drags them to target locations; the GAN smoothly moves image content accordingly. ● Precision editing: Achieves pixel-level control over image content - opening/closing mouths, rotating objects, changing poses - with minimal artifacts. ● User-interactive: Real-time feedback enables intuitive editing workflows that previous generative editing approaches lacked. ● Viral impact: Became one of the most viral AI papers of 2023, inspiring widespread interest and later extensions to diffusion models (DragDiffusion). |
Paper, Tweet |
| 2) Evidence of Meaning in Language Models Trained on Programs - Argues LMs learn meaning despite only next-token prediction. ● Programs as controlled input: Uses programs (which have well-defined semantics) to study whether LMs learn meaning versus surface patterns. ● Intermediate-state prediction: Shows that LMs trained on programs learn to predict program state after each statement - evidence of semantic understanding. ● Probe experiments: Careful probing experiments distinguish surface correlations from semantic representations. ● Emergence argument: Adds empirical grounding to the "LLMs have world models" debate that dominated 2023's interpretability discussions. |
Paper, Tweet |
| 3) Towards Expert-Level Medical Question Answering (Med-PaLM 2) - Google's second-generation medical LLM. ● MedQA SOTA: Scored up to 86.5% on the MedQA dataset (USMLE-style questions) - a new state-of-the-art matching expert physicians. ● Multi-benchmark leadership: Approaches or exceeds SOTA across MedMCQA, PubMedQA, and MMLU clinical topics datasets. ● Human evaluation quality: Physician evaluators rated Med-PaLM 2 answers as comparable to those of other physicians on most axes. ● Medical AI frontier: Set the bar for medical LLMs and informed FDA's thinking on AI-assisted clinical workflows. |
Paper, Tweet |
| 4) MEGABYTE - Multiscale Transformers for predicting million-byte sequences. ● Two-level architecture: Combines a large global Transformer over patches with a smaller local Transformer over bytes within each patch. ● Sub-quadratic attention: Achieves sub-quadratic self-attention cost through the patch-level hierarchy, enabling million-byte sequences. ● Decoding parallelism: Improves decoding parallelism compared to flat Transformers that must decode token-by-token. ● Tokenization-free: Operates directly on bytes without tokenizers - potentially avoiding tokenizer failure modes. |
Paper, Tweet |
| 5) StructGPT - A general framework for LLM reasoning over structured data. ● Structured data interface: Provides specialized interfaces for tables, knowledge graphs, and databases that LLMs can query. ● Iterative reasoning: LLM iteratively invokes interfaces to narrow down relevant information rather than ingesting the full structure. ● Zero-shot improvements: Improves zero-shot reasoning over structured data without task-specific training. ● Structured QA foundation: Part of the early work establishing LLM-over-structured-data as a distinct research area leading to 2024 enterprise SQL agents. |
Paper , Tweet |
| 6) TinyStories - Explores how small LMs can be and still speak coherent English. ● Synthetic story dataset: Creates a dataset of short stories using words understandable to 3-4 year olds, generated by GPT-3.5/GPT-4. ● Tiny but fluent: Shows that very small models (1-10M parameters) trained on this focused data can produce coherent multi-paragraph stories. ● Reasoning emergence: Even tiny models demonstrate reasoning and instruction-following capabilities when trained on the right data. ● Data-quality evidence: A foundational piece in the argument that data quality beats scale for many capabilities, influencing Phi series and later SLM work. |
Paper , Tweet |
| 7) DoReMi - Optimizes data mixtures for faster language model pretraining. ● Proxy-model reweighting: Trains a small 280M proxy model with group-DRO to derive optimal domain weights for the actual pretraining mixture. ● Scale transfer: Weights found by 280M proxy transfer to training 8B models (30x larger) without retuning. ● Training speedup: Achieves faster convergence and better downstream performance than uniform or human-tuned mixtures. ● Data mixture research: Kicked off a wave of data-mixture optimization work that became central to 2024 pretraining recipes (Llama 3, DCLM). |
Paper, Tweet |
| 8) CodeT5+ - An open code LLM family for code understanding and generation. ● Flexible architecture: Supports encoder-only, decoder-only, and encoder-decoder modes to handle diverse code tasks. ● 20-benchmark evaluation: Tested on 20 code-related benchmarks across zero-shot, fine-tuning, and instruction tuning. ● SOTA on multiple tasks: Achieves SOTA on code completion, math programming, and text-to-code retrieval. ● Training efficiency: Uses multiple training objectives combined to improve efficacy and compute efficiency. |
Paper, Tweet |
| 9) Symbol tuning - Fine-tunes LMs on in-context input-label pairs with natural-language labels replaced by arbitrary symbols. ● Symbolic abstraction: Replacing semantic labels with random symbols forces the model to rely on the demonstrations rather than label priors. ● ICL improvements: Boosts performance on unseen in-context learning tasks where the model must infer label semantics from examples. ● Algorithmic reasoning: Particularly improves algorithmic reasoning tasks that require following abstract patterns. ● ICL mechanism insight: Provides evidence about how ICL works and how to train models that better generalize the mechanism. |
Paper), Tweet |
| 10) Incidental Bilingualism in PaLM's Translation Capability - Explores where PaLM's translation ability actually comes from. ● 30M+ translation pairs: PaLM is exposed to over 30 million translation pairs across at least 44 languages within its training data, incidentally. ● Incidental bilingualism: Argues these "accidental" translation pairs substantially explain PaLM's translation capabilities. ● Scale-of-incidental-data: Highlights how large-scale pretraining can inadvertently cover specialized capabilities via byproducts of web data. ● Pretraining data insight: An influential study on understanding emergent capabilities via careful data auditing. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) LLM Explains Neurons in LLMs - OpenAI's automated interpretability pipeline using GPT-4 to explain GPT-2 neurons. ● GPT-4 as interpreter: Uses GPT-4 to generate natural-language explanations of what individual GPT-2 neurons detect. ● Automated scoring: Also uses GPT-4 to score how well an explanation predicts the neuron's actual activations on new text. ● Scale of interpretability: Enables scaling interpretability research to all neurons in a model, previously impractical with human effort. ● Automated interpretability era: Sparked the automated interpretability research program that continued in 2024 with SAE-based techniques and Golden Gate Claude demos. |
Paper, Tweet |
| 2) PaLM 2 - Google's second-generation PaLM powering Bard and Google products. ● Compute-optimal training: Trained compute-optimally on a larger, higher-quality, more multilingual corpus than PaLM 1. ● Multilingual strength: Major improvement in 100+ languages; supports translation, generation, and reasoning across a much broader language set. ● Reasoning competitive with GPT-4: Particularly strong on mathematical reasoning, approaching GPT-4 on several benchmarks. ● Flan-PaLM 2: The instruction-tuned version performs well on MMLU, BIG-bench Hard, and code generation - powering Google's consumer AI products. |
Paper, Tweet |
| 3) ImageBind - Meta's joint embedding across six modalities at once. ● Six-modality embedding: Learns a joint embedding space across images, text, audio, depth, thermal, and IMU data. ● Implicit binding via images: Images are the "central" modality that binds others - without requiring all-pairs training data. ● Zero-shot emergent capabilities: Enables cross-modal retrieval, arithmetic composition of modalities, and cross-modal generation/detection. ● Multi-modal foundation: Influenced 2024's unified multimodal models (Chameleon, GPT-4o) by showing the viability of unified embedding spaces. |
Paper, Tweet |
| 4) TidyBot - Combines LLM-based planning and perception with few-shot summarization to infer user preferences. ● Preference inference: Uses LLMs to infer generalized user preferences from a few examples of what objects belong where in a home. ● Generalization: Preferences inferred from specific examples generalize to future unseen objects. ● LLMs in embodied AI: Demonstrates LLMs' value for household robotics as high-level preference reasoners. ● Personalized robots: An early example of LLM-powered robot personalization - informing 2024 agent+robotics research. |
Paper, Tweet |
| 5) Unfaithful Explanations in Chain-of-Thought Prompting - Demonstrates CoT explanations can misrepresent the true reason for a model's prediction. ● Biased-CoT demonstration: Shows when models are biased toward incorrect answers (e.g., from few-shot bias), they generate CoT justifications supporting those wrong answers. ● Confident-but-wrong: The CoT sounds plausible and confident even when it's post-hoc rationalization rather than actual reasoning. ● Interpretability warning: An important caution that visible reasoning traces shouldn't be uncritically trusted as explanations. ● Safety implications: Part of the growing evidence base that CoT monitoring for safety has limitations. |
Paper , Tweet |
| 6) InstructBLIP - Visual-language instruction tuning built on BLIP-2. ● Instruction-aware Q-Former: Extends BLIP-2's Q-Former to be instruction-aware, dynamically extracting relevant visual features per instruction. ● 13 held-out datasets: Achieves state-of-the-art zero-shot performance on 13 held-out vision-language datasets. ● Beats BLIP-2 and Flamingo: Outperforms both BLIP-2 and Flamingo on most zero-shot benchmarks despite being a direct BLIP-2 extension. ● Open VLM progress: A prominent open-source VLM in 2023 that informed the later LLaVA-1.5, Qwen-VL, and InternVL lineage. |
Paper , Tweet |
| 7) Active Retrieval Augmented LLMs (FLARE) - Actively decides when and what to retrieve during generation. ● Dynamic retrieval: Retrieves only when the model's next-token confidence drops - not at fixed intervals. ● Anticipated content retrieval: Retrieves based on what the model is about to generate, not just the current context. ● Long-form knowledge-intensive tasks: Demonstrates superior or competitive performance on long-form knowledge-intensive generation tasks. ● Adaptive RAG: Established a research direction on adaptive/active retrieval that matured in 2024 with tools like Self-RAG and RankRAG. |
Paper, Tweet |
| 8) FrugalGPT - Strategies to reduce LLM inference cost while improving performance. ● Three-layer strategy: Combines prompt adaptation, LLM approximation, and LLM cascading to save cost. ● Model cascade: Routes easy queries to cheap models and escalates to expensive models only when needed. ● Cost reduction: Shows 98% cost savings while sometimes improving accuracy over using the most expensive model always. ● Production patterns: Influenced production LLM routing patterns and the 2024 ecosystem of LLM routers (RouteLLM, Martian). |
Paper, Tweet |
| 9) StarCoder - An open-access 15.5B code LLM with 8K context and 80+ programming languages. ● Fully-open release: Released under OpenRAIL with training data (The Stack), training code, and model weights all public. ● 80+ programming languages: Broadly multilingual in code, including non-English natural language in comments and strings. ● 8K context: Long context enables reasoning over larger code files than prior open code LLMs. ● Community base: Became the base for many community code models and powered LMStudio-style local coding assistants. |
Paper, Tweet |
| 10) MultiModal-GPT - A vision-language model for multi-round dialogue fine-tuned from OpenFlamingo. ● LoRA-based extension: Adds LoRA to OpenFlamingo's cross-attention and self-attention for efficient fine-tuning. ● Multi-round dialog: Specifically designed for multi-turn visual dialog, going beyond single-turn VQA. ● Open visual chatbot: An early fully-open visual chatbot that users could run locally. ● VLM dialog research: Informed the trajectory toward modern visual chatbots (LLaVA, Qwen-VL) that dominated open VLM research. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) scGPT - A foundation model for single-cell multi-omics pretrained on 10 million cells. ● Single-cell foundation: Applies LLM-style pretraining to single-cell transcriptomics data, tokenizing cells and genes. ● Massive scale: Pretrained on 10 million cells - the largest foundation model for single-cell biology at the time. ● Multi-task transfer: Transfers to cell-type annotation, gene perturbation prediction, multi-batch integration, and gene network inference. ● Bio-AI foundation: Part of the broader push toward domain-specific foundation models in biology, alongside ESMFold (proteins) and DNA foundation models. |
Paper, Tweet |
| 2) GPTutor - A ChatGPT-powered VSCode extension for code explanation. ● IDE integration: Delivered as a VSCode extension, making AI-assisted code explanation frictionless for developers. ● Prompt engineering for code: Uses code-relevant prompt engineering to produce more concise and accurate explanations than vanilla ChatGPT or Copilot. ● Context-aware prompts: Automatically includes relevant surrounding code in its prompts for better local explanations. ● Education use case: Particularly useful for junior developers learning unfamiliar codebases - an early AI-education product. |
Paper, Tweet |
| 3) Shap-E - OpenAI's conditional generative model for 3D assets producing implicit functions. ● Implicit function output: Generates implicit functions (NeRFs and signed distance functions) rather than fixed meshes - enabling both textured meshes and neural radiance field rendering. ● Text and image conditioning: Supports both text-to-3D and image-to-3D generation in a unified framework. ● Fast generation: Generates 3D assets in seconds rather than the minutes/hours required by optimization-based methods. ● 3D generative AI: A key step in the rapid evolution of 3D generation that would continue through 2024 with Splatter Image, TripoSR, and others. |
Paper, Tweet |
| 4) Are Emergent Abilities of LLMs a Mirage? - Stanford's critical re-examination of emergent abilities. ● Metric-choice argument: Argues "emergence" is often an artifact of using discontinuous metrics (like exact match) rather than smooth ones (like log-probability). ● Metric substitution: When re-analyzing with continuous metrics, many "emergent" capabilities appear smoothly with scale. ● Research methodology: Cautions the field against interpreting metric-choice artifacts as fundamental phase transitions. ● Best Paper at NeurIPS 2023: Influential paper that sparked extensive debate about what "emergence" really means in LLMs. |
Paper, Tweet |
| 5) Interpretable ML for Science with PySR - An open-source library for practical symbolic regression in the sciences. ● Distributed back-end: Built on a high-performance distributed back-end for scaling to larger scientific datasets. ● DL integration: Interfaces with several deep learning packages so symbolic regression can be used alongside neural networks. ● EmpiricalBench benchmark: Releases a new benchmark for quantifying the applicability of symbolic regression algorithms in science. ● Science-AI tool: Became a widely-used tool for scientists seeking interpretable equations from data, complementing black-box DL. |
Paper , Tweet |
| 6) PMC-LLaMA - A LLaMA model fine-tuned on 4.8 million medical papers. ● Domain-specific continued pretraining: Extends LLaMA's medical knowledge through continued pretraining on PubMed Central papers. ● Biomedical QA: Achieves high performance on biomedical QA benchmarks, narrowing the gap with proprietary medical LLMs. ● Open medical LLM: As a fully open model, accessible to academic medical researchers without proprietary model constraints. ● Medical LLM ecosystem: Part of the 2023 medical LLM boom that established the template of general LLM + medical continued pretraining + medical SFT. |
Paper , Tweet |
| 7) Distilling Step-by-Step! - A mechanism to train smaller models that outperform larger LLMs using fewer examples. ● Rationale extraction: Extracts CoT rationales from a larger teacher LLM, using them to augment smaller student model training. ● Smaller beats larger: Distilled student models outperform LLMs 500x+ larger in size on benchmark reasoning tasks. ● Data efficiency: Requires dramatically less labeled training data than standard fine-tuning by leveraging LLM rationales as free supervision. ● Distillation paradigm: Influential for the 2024 proliferation of reasoning-distilled small models like Orca 2, Phi-3, and later reasoning-specific SLMs. |
Paper, Tweet |
| 8) Poisoning Language Models During Instruction Tuning - Shows adversaries can poison LLMs via instruction tuning data. ● Poisoning attack: Demonstrates adversaries can contribute poisoned examples to instruction tuning datasets to induce specific misbehaviors. ● Cross-task poisoning: Poisoning can induce degenerate outputs across held-out tasks, not just the poisoned task - broad attack surface. ● Supply-chain vulnerability: Highlights the supply-chain vulnerability of using community-sourced instruction data. ● Alignment safety: Important for the field's thinking on data provenance and vetting for alignment datasets. |
Paper, Tweet |
| 9) Unlimiformer - Long-range Transformers with unlimited length input via external datastores. ● External datastore: Augments pre-trained encoder-decoder Transformers with a kNN datastore to support arbitrary-length input. ● Training-free: No additional training required - works with existing pretrained Transformers. ● Long-document tasks: Demonstrates usefulness in long-document summarization where context spans many thousands of tokens. ● RAG-enhancer: Could improve the performance of retrieval-enhanced LLMs by providing unlimited lookback over long conversations or documents. |
Paper, Tweet |
| 10) Learning to Reason and Memorize with Self-Notes - LLMs that deviate from input to explicitly "think" and memorize. ● Self-note generation: The model can pause processing input and generate explicit reasoning or memory notes in-stream. ● On-the-fly recall: Enables the LM to recall past information and perform reasoning when needed, not just in dedicated thinking phases. ● Length generalization: Scales better to longer sequences unseen during training than plain reasoning approaches. ● Scratchpad precursor: An intellectual precursor to 2024's reasoning models like o1 that produce long internal thinking traces. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Learning Agile Soccer Skills for a Bipedal Robot with Deep RL - DeepMind's bipedal humanoid robot playing soccer. ● End-to-end DRL: Synthesizes agile soccer skills (fast recovery, walking, kicking, tackling) for a miniature humanoid robot purely through deep RL. ● Dynamic movements: Produces genuinely athletic movements including falling and recovering - a major advance in bipedal robotics. ● Sim-to-real transfer: Successfully transfers policies from simulation to real hardware with robust performance. ● Humanoid robotics milestone: A visible capability demonstration that informed the 2024 boom in humanoid robot startups (Figure, 1X, Apptronik, Tesla). |
Paper, Tweet |
| 2) Scaling Transformer to 1M tokens with RMT - Recurrent Memory Transformer extends BERT's effective context to 2M tokens. ● Recurrent memory mechanism: Augments BERT with a recurrent memory that carries information across segments, enabling massive context lengths. ● 2M token context: Scales effective context to two million tokens while maintaining high memory retrieval accuracy. ● Segment-level recurrence: Processes input in segments while passing a compressed memory token stream across them. ● Long-context trend: Part of the 2023 explosion of long-context techniques that established ultra-long context as a viable research direction. |
Paper, Tweet |
| 3) Track Anything - An interactive tool for video object tracking and segmentation built on Segment Anything. ● SAM + tracking: Extends SAM's powerful single-image segmentation to video via click-based tracking over time. ● Flexible interaction: Users click on objects in any frame to start tracking, with propagation handling the rest automatically. ● Zero-shot video segmentation: Works zero-shot without per-video training - a major usability win. ● Video-editing tool: Quickly adopted for video editing, content creation, and autonomous system dataset labeling. |
Paper, Tweet |
| 4) A Cookbook of Self-Supervised Learning - A comprehensive overview of SSL techniques and practical considerations. ● Comprehensive coverage: Covers contrastive methods (SimCLR, MoCo), non-contrastive methods (BYOL, SimSiam), masked modeling (MAE, BEiT), and more. ● Practical guidance: Provides concrete advice on hyperparameters, augmentations, and debugging - not just theoretical overview. ● Failure modes: Documents known SSL failure modes (collapse, shortcut learning) and how to detect/mitigate them. ● Educational resource: Widely used as a reference by graduate students and newly-SSL-curious researchers. |
Paper, Tweet |
| 5) Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond - A practical guide for practitioners working with LLMs. ● Practitioner-focused: Organizes LLM knowledge for engineering and product teams deploying LLMs rather than just academic researchers. ● Use-case catalog: Walks through many concrete use cases with practical applications and limitations. ● Deployment considerations: Covers real-world concerns (cost, latency, hallucination) in a structured way. ● Applied LLM reference: Became a common reference in applied AI discussions during the 2023-2024 enterprise LLM rollout. |
Paper , Tweet |
| 6) AudioGPT - Connects ChatGPT with audio foundational models for speech, music, sound, and talking head tasks. ● LLM as audio orchestrator: ChatGPT plans and dispatches audio tasks across specialist models (TTS, ASR, music generation, sound effects). ● Modality transformation: Converts speech to text for ChatGPT processing, then generates speech from ChatGPT's text output. ● Spoken dialogue: Enables end-to-end spoken dialogue where users talk to ChatGPT and it talks back. ● Multi-modal agent pattern: An early example of the LLM-as-orchestrator pattern applied to audio, presaging 2024's fully multimodal voice agents. |
Paper , Tweet |
| 7) DataComp - A multimodal dataset benchmark with 12.8B image-text pairs. ● Scale and scope: 12.8 billion image-text pairs - one of the largest multimodal datasets ever released. ● Benchmark framework: Provides a benchmark where researchers compete to find the best data subset, not just train the best model on fixed data. ● Data-centric AI: Emphasizes data curation as the primary research axis, with model architecture and training held constant. ● Data research infrastructure: Enabled a wave of data-filtering research (DataComp-XL, fastText filtering) that significantly advanced multimodal model training. |
Paper, Tweet |
| 8) ChatGPT for Information Extraction - A deeper assessment of ChatGPT on information extraction tasks. ● Extraction-task benchmark: Evaluates ChatGPT on named entity recognition, relation extraction, event extraction, and more. ● Competitive but imperfect: Competitive with specialized IE models on many tasks but still falls short of fine-tuned SOTA on others. ● Prompt sensitivity: Highlights significant prompt sensitivity in extraction outputs - practical challenges for deployment. ● Practical assessment: A sober empirical reference informing whether to swap traditional IE pipelines for LLM-based alternatives. |
Paper, Tweet |
| 9) Comparing Physician vs ChatGPT (JAMA) - A JAMA Internal Medicine study comparing physician and ChatGPT responses. ● Rigorous study: Published in JAMA Internal Medicine - a high-bar medical journal, not just an arxiv preprint. ● ChatGPT preferred: Chatbot responses were preferred over physician responses and rated significantly higher in both quality and empathy. ● 79% preference: ChatGPT's responses were preferred in 79% of cases, often described as more empathetic. ● Medical AI discussion catalyst: Sparked widespread discussion about the role of AI in clinical communication and patient care. |
Paper, Tweet |
| 10) Stable and Low-Precision Training for Large-Scale Vision-Language Models - Methods for accelerating and stabilizing large VLM training. ● Mixed-precision techniques: Introduces stable training strategies for bfloat16/float16 mixed precision of large VLMs. ● Training speedup: Significantly accelerates VLM training while avoiding common instabilities (loss spikes, NaN). ● Scale-friendly: Scales to the largest open-source VLMs, enabling more research at serious scale. ● Infrastructure contribution: Practical infrastructure advances that benefited the entire VLM research community. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) DINOv2 - Meta's self-supervised vision foundation model producing robust features without labels. ● Fully self-supervised: Trained purely with SSL on 142M curated images - no labels needed, just clever pretraining objectives. ● Universal features: Produces features useful for image classification, instance retrieval, video understanding, depth estimation, and pixel-level tasks. ● Frozen-backbone usage: Features work well with simple linear probes, no fine-tuning - making DINOv2 a drop-in visual backbone. ● Vision foundation standard: Became the default vision backbone for open-source VLMs (LLaVA, InternVL) and vision research through 2024. |
Paper, Tweet |
| 2) Learning to Compress Prompts with Gist Tokens - Trains LMs to compress prompts into reusable "gist" tokens. ● Prompt compression: Compresses long prompts into a small set of gist tokens that encode the same instruction information. ● 26x compression: Achieves 26x prompt compression with negligible quality loss on downstream tasks. ● Up to 40% FLOPs reduction: Substantial inference-time compute savings on repeated prompts. ● Production optimization: Particularly valuable for systems with long system prompts reused across many requests - a pattern that became ubiquitous in 2024 agent systems. |
Paper, Tweet |
| 3) Scaling Biomolecular Simulations with Equivariant Models - A framework for large-scale biomolecular simulation using equivariant deep learning. ● Equivariant network scaling: Achieves high accuracy through equivariant deep learning that respects molecular symmetries. ● 44M atom HIV capsid: Simulated a complete, all-atom, explicitly solvated HIV capsid structure of 44 million atoms. ● Nanosecond-scale stable dynamics: Performs nanoseconds-long stable simulations of protein dynamics - much longer than prior ML-MD simulations. ● Perlmutter deployment: Scales to the Perlmutter supercomputer, demonstrating ML-accelerated molecular dynamics at HPC scale. |
Paper, Tweet |
| 4) Evaluating Verifiability in Generative Search Engines - Audits popular generative search engines for citation accuracy. ● Human evaluation: Performs rigorous human evaluation of Bing Chat, Perplexity AI, and NeevaAI responses. ● Citation failure rate: Finds only 52% of generated sentences are supported by citations and only 75% of citations actually support the claim. ● Verifiability gap: Reveals a significant gap between generative search engines' citation promises and their actual reliability. ● Trust-in-AI research: Important empirical foundation for subsequent research on grounded generation and RAG accuracy. |
Paper, Tweet |
| 5) Generative Disco: Text-to-Video Generation for Music Visualization - An LLM + T2I system for music visualization. ● LLM+T2I composition: Uses LLMs to interpret music and generate scene descriptions that text-to-image models then visualize. ● Music-video generation: Produces music-driven video visualizations - an early text-to-video adjacent capability. ● Creative tool direction: Part of the 2023 wave of creative AI tools targeting content creators and music producers. ● HCI contribution: Notable for its focus on user experience and creative workflow rather than pure model capability. |
Paper , Tweet |
| 6) Architectures of Topological Deep Learning: A Survey on Topological Neural Networks - A comprehensive survey on topological neural networks. ● Topological DL taxonomy: Surveys neural networks operating on topological structures beyond graphs (simplicial complexes, cell complexes, hypergraphs). ● Architecture catalog: Catalogs major topological DL architectures with their mathematical foundations. ● Beyond-graph DL: Positions topological DL as the natural generalization of GNNs for higher-order interactions. ● Reference survey: Standard reference for researchers entering the topological DL subfield. |
Paper , Tweet |
| 7) Visual Instruction Tuning (LLaVA) - Uses language-only GPT-4 to generate multimodal instruction-following data. ● GPT-4-generated multimodal data: Bootstraps multimodal instruction data using only language-only GPT-4 given captions and bounding boxes - no direct visual access needed. ● End-to-end training: Introduces LLaVA, an end-to-end trained large multimodal model combining CLIP vision encoder and Vicuna LLM. ● Lightweight architecture: Simple projection layer between vision encoder and LLM - cheap and effective. ● Open VLM revolution: LLaVA became the most influential open-source VLM architecture, spawning LLaVA-1.5, LLaVA-NeXT, and countless derivatives through 2024. |
Paper, Tweet |
| 8) ChatGPT: Applications, Opportunities, and Threats - A comprehensive overview of ChatGPT's applications and risks. ● Application mapping: Surveys ChatGPT applications across education, healthcare, law, research, and creative industries. ● Opportunities & threats: Explicitly balances productive applications with threats like misinformation, academic integrity, and job displacement. ● Policy-relevant: Widely cited in policy discussions about AI governance and educational institution responses. ● Field-orienting: Helped the broader research community orient to ChatGPT's implications during its initial rapid adoption. |
Paper, Tweet |
| 9) Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models - A framework inferring tool sequences for compositional reasoning. ● Tool composition: LLM plans sequences of tools (Python, search, calculator, knowledge retrievers) to solve complex problems. ● SOTA on ScienceQA: Achieves 87% accuracy on ScienceQA and 99% on TabMWP - surpassing prior specialized models. ● Plug-and-play design: Tools can be added/removed flexibly without retraining the LLM. ● Agent framework precursor: Influential in the agent/tool-use research direction leading to 2024 agent frameworks. |
Paper, Tweet |
| 10) Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models - High-resolution video synthesis with latent diffusion. ● Latent video diffusion: Extends Stable Diffusion-style latent diffusion to video generation with temporal attention layers. ● 512x1024 driving videos: Validates on real driving videos at 512x1024 resolution, achieving state-of-the-art performance. ● Creative content: Also validated on creative content creation tasks, demonstrating versatility beyond driving scenarios. ● Video generation foundation: A key paper in the latent-video-diffusion lineage that led to Stable Video Diffusion, SVD, and later open video models. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields - Combines mip-NeRF 360 with grid-based models for 22x faster training. ● Anti-aliasing for grids: Brings mip-NeRF's anti-aliasing technique to fast grid-based NeRF architectures, combining quality and speed. ● 22x training speedup: Trains 22x faster than mip-NeRF 360 while achieving comparable or better quality. ● Best of both worlds: Overcomes the historical tradeoff between slow-but-accurate MLP NeRFs and fast-but-aliased grid NeRFs. ● 3D reconstruction: A practical improvement that made high-quality NeRFs much more accessible to production use cases. |
Paper, Tweet |
| 2) Generative Agents: Interactive Simulacra of Human Behavior - Stanford/Google's landmark paper on LLM-powered social simulations. ● "Smallville" simulation: Creates a town of 25 LLM-powered agents who plan their days, remember experiences, form relationships, and even organize parties. ● Memory-reflection-planning: Combines a complete memory stream, synthesized reflections, and dynamic planning to create emergent social behavior. ● Emergent social dynamics: Agents exhibit emergent phenomena like information diffusion, relationship formation, and coordinated planning. ● Agent research foundation: One of the most influential 2023 agent papers, sparking the explosion of LLM agent simulation work including AutoGPT, BabyAGI, and CAMEL. |
Paper, Tweet |
| 3) Emergent Autonomous Scientific Research Capabilities of LLMs - An agent combining LLMs for autonomous scientific experiments. ● Autonomous experiment design: LLM agent designs, plans, and executes chemistry experiments with minimal human guidance. ● Real chemistry execution: Successfully performs catalyzed cross-coupling reactions - actual chemistry, not simulated. ● Emergent research behavior: Demonstrates emergent research capabilities like hypothesis generation, experimental iteration, and failure recovery. ● AI-scientist precursor: An influential paper establishing LLM-driven scientific agents as a research direction that would evolve through 2024's AI Scientist and BioDiscoveryAgent. |
Paper, Tweet |
| 4) Automatic Gradient Descent: Deep Learning without Hyperparameters - A hyperparameter-free first-order optimizer that leverages architecture. ● Architecture-aware optimization: Derives optimization algorithms that explicitly account for neural network architecture rather than treating it as a black box. ● No hyperparameters: Eliminates learning rate tuning - a hyperparameter-free optimizer that just works. ● ImageNet scale: Successfully trains CNNs at ImageNet scale, demonstrating the approach scales to realistic workloads. ● Optimizer research: Contributes to the ongoing search for optimizers that reduce tuning burden, complementing Adam-era hyperparameter-heavy methods. |
Paper, Tweet |
| 5) ChemCrow: Augmenting LLMs with Chemistry Tools - An LLM chemistry agent with 13 expert-designed tools. ● 13 chemistry tools: Integrates 13 expert-designed tools covering synthesis planning, molecule validation, safety checks, and more. ● Cross-domain chemistry: Handles synthesis, drug discovery, and materials design within a unified agent framework. ● Beats vanilla GPT-4: Substantially outperforms vanilla GPT-4 on chemistry tasks by grounding in specialized tools. ● Scientific-agent direction: Alongside BoilerBot and similar systems, established the template for domain-specific scientific agents using LLMs + tools. |
Paper , Tweet |
| 6) One Small Step for Generative AI, One Giant Leap for AGI - A complete survey on ChatGPT and GPT-4. ● Complete AIGC survey: Comprehensive survey of the ChatGPT/GPT-4 era covering models, applications, and future directions. ● AGI-oriented framing: Analyzes ChatGPT/GPT-4 as stepping stones toward AGI rather than endpoints themselves. ● Technology + society: Balances technical analysis with discussion of societal, economic, and ethical implications. ● Reference timeline: A widely-cited reference for summarizing the 2022-2023 generative AI inflection point. |
Paper , Tweet |
| 7) OpenAGI: When LLM Meets Domain Experts - An open-source research platform for LLM agents manipulating domain expert models. ● LLM-as-orchestrator platform: LLMs plan and orchestrate calls to specialized domain expert models (vision, speech, language). ● Multi-step task evaluation: Provides a standardized evaluation framework for complex multi-step tasks requiring tool composition. ● Open research tooling: Fully open-source platform for academic researchers to compare agent designs and tool-use strategies. ● Agent research infrastructure: Part of the 2023 wave establishing shared infrastructure for LLM agent research. |
Paper, Tweet |
| 8) AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models - A benchmark using real human standardized exams. ● Real human exams: Uses actual college entrance exams, law school admission tests, math competitions, and civil service exams - not synthetic benchmarks. ● Multilingual coverage: Includes English and Chinese versions of exams, testing bilingual capability. ● Human-comparable scoring: Makes it natural to compare foundation models to human performance percentiles on identical exams. ● Real-world evaluation: Became an important benchmark for claims about "expert-level" or "human-comparable" foundation model performance. |
Paper, Tweet |
| 9) Teaching Large Language Models to Self-Debug - Teaches LLMs to debug their own code via few-shot demonstrations. ● Self-debugging via explanation: LLMs identify mistakes by explaining their generated code in natural language, then iteratively fix errors. ● Few-shot teaching: Requires only a handful of debugging demonstrations to enable the capability across tasks. ● Text-to-SQL SOTA: Achieves state-of-the-art on several code generation tasks including text-to-SQL generation. ● Self-correction research: Influential paper establishing self-debugging as a distinct capability, informing 2024 reasoning + self-correction agents. |
Paper, Tweet |
| 10) Segment Everything Everywhere All at Once (SEEM) - A promptable, interactive segmentation model. ● Unified promptable model: Handles various segmentation tasks (semantic, instance, referring, interactive) in one promptable model. ● Multi-modal prompts: Accepts text, click, box, scribble, and mask prompts - broader than SAM's prompt vocabulary. ● Open-vocabulary: Competitive on open-vocabulary and interactive segmentation benchmarks. ● SAM-complement: A more flexible alternative to SAM with richer prompting - both pushed interactive segmentation to production. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Segment Anything (SAM) - Meta's foundational model for image segmentation with massive training data release. ● Largest segmentation dataset: Releases SA-1B with over 1 billion masks on 11 million licensed images - by far the largest segmentation dataset ever. ● Promptable segmentation: Introduces a new promptable segmentation task where users provide clicks, boxes, or text to indicate what to segment. ● Zero-shot SOTA: Zero-shot performance is competitive with or superior to fully supervised specialist models. ● Vision foundation model: One of the highest-impact vision papers of 2023, transforming how the field thinks about foundation models for dense prediction. |
Paper, Tweet |
| 2) Instruction Tuning with GPT-4 - Uses GPT-4 to generate instruction-following data for LLM fine-tuning. ● GPT-4 as data generator: First systematic attempt to use GPT-4 (rather than human annotators) to produce instruction-following data. ● 52K bilingual examples: Releases 52K unique English and Chinese instruction-following examples. ● LLaMA fine-tuning: Uses the dataset to instruction-tune LLaMA models, leading to superior zero-shot performance on new tasks. ● Synthetic data wave: Part of the 2023 wave establishing synthetic data from strong models as the dominant alignment data source. |
Paper, Tweet |
| 3) Eight Things to Know about Large Language Models - Sam Bowman's influential primer on key LLM considerations. ● Eight key insights: Organizes LLM knowledge into eight punchy observations covering capabilities, limitations, and emergent behaviors. ● Policy-relevant framing: Written in accessible language suitable for researchers, policymakers, and the broader public. ● Capability-risk balance: Each "thing to know" comes with practical implications for deployment and safety. ● Community reference: Became one of the most widely-shared overviews of LLMs in 2023, frequently cited in onboarding materials and policy discussions. |
Paper, Tweet |
| 4) A Survey of Large Language Models - A 50-page comprehensive survey on LLMs. ● Broad coverage: 50+ pages covering LLM architecture, pretraining, fine-tuning, alignment, evaluation, and applications. ● Chronological evolution: Traces the lineage from early transformers through GPT, PaLM, LLaMA, and beyond. ● Frequently updated: Authors have updated the survey multiple times to keep pace with rapidly evolving field. ● Go-to reference: Became one of the most widely cited LLM surveys, frequently used in graduate courses and research onboarding. |
Paper, Tweet |
| 5) Baize: An Open-Source Chat Model with Self-Chat Data - An open chat model fine-tuned with LoRA on self-chat dialogs. ● Self-chat data generation: Generates 100K dialogs by having ChatGPT converse with itself, then fine-tunes on these dialogs. ● LoRA fine-tuning: Uses parameter-efficient LoRA fine-tuning for compute efficiency. ● Multiple model sizes: Releases 7B, 13B, and 30B parameter models along with the dialog data. ● Open chatbot ecosystem: Part of the 2023 proliferation of open chat models (Vicuna, Alpaca, Koala, Baize) building on LLaMA. |
Paper , Tweet |
| 6) MACHIAVELLI Benchmark - A benchmark of 134 text-based Choose-Your-Own-Adventure games for measuring ethical trade-offs. ● 134 interactive games: Uses 134 text adventures with ~500K scenarios to evaluate agent behavior in rich social/ethical contexts. ● Reward vs. ethics trade-off: Specifically measures how agents trade off goal-achievement (rewards) against ethical behavior (harm, deception, power-seeking). ● Dark side measurement: Surfaces unethical behaviors like deception, manipulation, and power-seeking that may emerge when agents optimize for rewards. ● Agent safety research: A foundational benchmark for the emerging "agent safety" sub-field in 2023-2024. |
Paper , Tweet |
| 7) Better Language Models of Code through Self-Improvement - Self-improving code LLMs via pseudo-data generation. ● Self-improvement loop: Generates pseudo training data from the model's own knowledge gained through pretraining and fine-tuning. ● Iterative bootstrapping: Adds the generated data to the training set for the next training iteration, creating a self-improvement loop. ● Multi-framework gains: Shows consistent improvements across different code LLM frameworks on code generation tasks. ● Self-improvement research: An early example of the self-improvement paradigm for LLMs that would later mature in 2024's self-rewarding and self-play approaches. |
Paper, Tweet |
| 8) Summary of ChatGPT/GPT-4 Research - An overview of ChatGPT and GPT-4 applications based on 194 papers. ● 194-paper meta-analysis: Analyzes 194 relevant papers to produce an integrated overview of the ChatGPT/GPT-4 research landscape. ● Capability-limitation balance: Discusses capabilities, limitations, concerns, and research directions in structured fashion. ● Application catalog: Catalogs applications across education, healthcare, coding, writing, and specialized domains. ● Research synthesis: Useful as a condensed view of the first six months of post-ChatGPT research explosion. |
Paper, Tweet |
| 9) Pythia - EleutherAI's suite for analyzing LLMs across training and scaling. ● 16-model suite: 16 LLMs trained on public data (The Pile) ranging from 70M to 12B parameters, all with identical training recipes. ● Training checkpoints: Releases 154 training checkpoints per model, enabling analysis of learning dynamics across training. ● Scale-controlled research: The consistent methodology across sizes enables rigorous scaling analyses without confounders. ● Interpretability foundation: Became the foundational testbed for mechanistic interpretability research through 2024. |
Paper, Tweet |
| 10) SegGPT: Segmenting Everything In Context - Unifies segmentation tasks into a generalist in-context model. ● In-context segmentation: Uses in-context examples (input-mask pairs) to define the segmentation task at inference time. ● Task generalization: Handles semantic, instance, panoptic, and referring segmentation through the same in-context interface. ● Training-free adaptation: Adapts to new segmentation tasks without retraining - just provide example pairs. ● Prompt-based vision: Part of the 2023 push to bring LLM-style in-context learning to vision tasks. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) BloombergGPT - A 50B-parameter LLM specialized for finance. ● Largest finance dataset: 363 billion tokens of financial data plus 345 billion tokens from general-purpose datasets - the largest domain-specific LLM dataset at the time. ● Finance-task specialization: Outperforms existing models on financial NLP tasks (sentiment, NER, classification). ● General capability preservation: Maintains competitive performance on general LLM benchmarks despite heavy finance specialization. ● Domain-specific LLM blueprint: Established the template for well-resourced domain-specific LLMs (medical, legal, financial) through 2023-2024. |
Paper, Tweet |
| 2) Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ALOHA) - A low-cost bimanual robot manipulation system. ● Action Chunking with Transformers (ACT): Introduces ACT, a generative model that predicts action chunks (sequences) rather than single actions - dramatically improving task success. ● Low-cost hardware: The ALOHA platform uses ~$20K of off-the-shelf parts, making bimanual manipulation research broadly accessible. ● Fine-grained tasks: Demonstrates difficult real-world tasks like threading zip ties, unwrapping candy, and slotting battery cells. ● Robotics research catalyst: ALOHA became one of the most influential robotics platforms of 2023-2024, powering downstream research like Mobile ALOHA. |
Paper, Tweet |
| 3) HuggingGPT (Jarvis) - ChatGPT orchestrates HuggingFace models to solve complex AI tasks. ● LLM as controller: ChatGPT plans tasks, selects appropriate HuggingFace models, dispatches sub-tasks, and summarizes results. ● Model Hub integration: Directly leverages the HuggingFace model hub, giving ChatGPT access to thousands of specialized models. ● Four-stage pipeline: Task planning → model selection → task execution → response generation - a clear architecture influential in later agent frameworks. ● LLM-as-orchestrator pattern: A canonical example of the LLM-as-orchestrator paradigm that dominated 2023 agent research. |
Paper, Tweet |
| 4) ChatDoctor - A medical chat model fine-tuned on LLaMA with medical domain knowledge. ● 700 diseases covered: Collects data on approximately 700 diseases to provide broad medical coverage. ● 5K doctor-patient conversations: Generates 5,000 doctor-patient conversations for fine-tuning, simulating realistic clinical dialog. ● LLaMA foundation: Built on LLaMA, part of the 2023 wave of LLaMA-based domain-specific fine-tunes. ● Medical LLM lineage: Early entry in the medical LLM space that would continue with PMC-LLaMA, Meditron, and later specialized clinical LLMs. |
Paper, Tweet |
| 5) LLaMA-Adapter - Efficient fine-tuning of LLaMA with zero-init attention. ● Zero-init attention: Uses zero-initialized attention layers so the adapter starts from identity function, preserving pretrained behavior. ● Tiny parameter count: Only 1.2M trainable parameters adapt LLaMA into an instruction-follower - extremely parameter-efficient. ● Alpaca-quality responses: Matches Alpaca's response quality (fully fine-tuned 7B) with far fewer trainable params. ● Multimodal extension: Extended to accept multi-modal inputs (images), an early step toward efficient VLM adapters. |
Paper , Tweet |
| 6) ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks - Empirically shows ChatGPT beats MTurk on text annotation. ● Multi-task comparison: Evaluates ChatGPT against MTurk crowd workers on relevance, topic, stance, frames, and general annotation tasks. ● Higher accuracy: ChatGPT achieves higher zero-shot accuracy than crowd workers on most tested annotation tasks. ● 20x cost reduction: ChatGPT's per-annotation cost is approximately 20x cheaper than MTurk. ● Annotation economy shift: Marked a real turning point in how NLP researchers think about dataset construction, accelerating LLM-powered annotation pipelines. |
Paper , Tweet |
| 7) Language Models can Solve Computer Tasks (RCI) - LLM agent executes computer tasks via recursive self-criticism. ● Recursive Criticism and Improvement: A prompting scheme where the LLM generates actions, critiques its own output, and improves iteratively. ● Computer task execution: Demonstrates LLMs can execute real computer tasks (navigation, form-filling, data entry) with simple prompting. ● Zero-shot without training: Works zero-shot without any task-specific fine-tuning, using only prompting. ● Web-agent foundation: An early demonstration of LLM-based web/computer agents that informed 2024's agent framework explosion. |
Paper, Tweet |
| 8) DERA - Dialog-Enabled Resolving Agents for enhancing LLM completions. ● Multi-agent dialog: Uses multiple LLM "agents" that communicate feedback and iteratively refine outputs through dialog. ● Role-based agents: Typically pairs a Researcher and Decider with distinct responsibilities, producing higher-quality outputs. ● Beats base GPT-4: DERA outperforms base GPT-4 on clinically-focused tasks requiring careful reasoning. ● Multi-agent LLM pattern: An early example of the multi-agent debate/collaboration pattern that became widespread in 2024 (AutoGen, CrewAI). |
Paper, Tweet |
| 9) Natural Selection Favors AIs over Humans - Dan Hendrycks on why AI systems will outcompete humans evolutionarily. ● Evolutionary framing: Argues that AI systems will become more evolutionarily "fit" than humans in competition for resources and influence. ● Selection pressures: Identifies specific selection pressures (efficiency, resource acquisition, goal-directedness) that favor AI over humans. ● Risk analysis: Discusses potential dangers including loss of human agency, and strategies to mitigate them. ● AI safety framing: Contributed a memorable framing to AI safety discussions during the 2023 existential-risk conversation. |
Paper, Tweet |
| 10) Machine Learning for Partial Differential Equations - A review of ML approaches to PDEs. ● Comprehensive review: Examines ML avenues for solving, learning, and discovering partial differential equations. ● Method taxonomy: Covers neural PDE solvers, Fourier neural operators, physics-informed neural networks, and learned simulators. ● Scientific ML reference: Positions ML-for-PDEs as a coherent sub-field with its own methods and benchmarks. ● SciML roadmap: Influential in the growing scientific machine learning community, informing later foundation-model work on physics simulation. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Sparks of Artificial General Intelligence: Early Experiments with GPT-4 - Microsoft Research's influential investigation of early GPT-4. ● Pre-release GPT-4 access: Examines an early, less-aligned GPT-4 while still in active development at OpenAI. ● "Sparks of AGI" claim: Argues GPT-4 shows sparks of general intelligence across diverse domains - a provocative and widely-debated claim. ● Rich demonstrations: Includes stunning demonstrations of GPT-4's capabilities on math, coding, vision, theory-of-mind, and more. ● Discourse-defining paper: Set much of the 2023 public discourse around AGI timelines and LLM capabilities. |
Paper, Tweet |
| 2) Reflexion - An autonomous agent with dynamic memory and self-reflection. ● Self-reflection loop: Agent reflects on failed attempts in natural language and stores reflections in episodic memory for future use. ● Verbal reinforcement: Uses verbal self-feedback rather than gradient updates - an alternative to RL for agent improvement. ● Task-specific action choice: Enhances task-specific action selection through reflection on prior reasoning traces. ● Agent paradigm: Became one of the foundational agent papers of 2023, widely cited as a canonical example of LLM self-improvement via verbal reflection. |
Paper, Tweet |
| 3) Capabilities of GPT-4 on Medical Challenge Problems - Microsoft's medical evaluation showing GPT-4 passing USMLE handily. ● 20+ points above passing: Exceeds USMLE passing score by over 20 points - a remarkable margin for a generalist model. ● Beats Med-PaLM: Outperforms specialist medical models including Med-PaLM (prompt-tuned Flan-PaLM 540B). ● No medical fine-tuning: Achieves these results without any medical-specific fine-tuning - pure generalist capability. ● Medical-AI turning point: A key data point showing generalist frontier models could match or beat specialist medical LLMs, shifting the medical AI strategic landscape. |
Paper, Tweet |
| 4) GPTs are GPTs - OpenAI/UPenn's early look at LLM labor market impacts. ● Occupational analysis: Systematically assesses which US occupations and tasks are most exposed to LLM automation. ● 80% of workers exposed: Estimates ~80% of US workers have at least 10% of tasks affected, and 19% have at least 50% affected. ● White-collar focus: Shows exposure concentrated in higher-paying, more educated occupations - reversing traditional automation patterns. ● Policy-defining paper: Shaped 2023 policy discussions about AI's economic impact and informed subsequent labor economics research. |
Paper, Tweet |
| 5) CoLT5 - Faster long-range Transformers via conditional computation. ● Conditional computation: Routes important tokens through heavy branches while light tokens get a cheap path - saving compute on easy tokens. ● Per-layer conditioning: Applies conditional computation in both feedforward and attention layers. ● Long-input efficiency: Particularly effective for long documents where most tokens are routine and only a few need deep processing. ● Long-context efficiency: Part of the efficient-attention research line that would continue with MoE and conditional-routing approaches through 2024. |
Paper , Tweet |
| 6) Artificial Muses: Generative AI Chatbots Have Risen to Human-Level Creativity - Compares AI and human creativity. ● Head-to-head comparison: Compares human-generated ideas with those from ChatGPT, YouChat, and other chatbots on creativity metrics. ● Only 9.4% beat GPT-4: Only 9.4% of humans were judged more creative than GPT-4 - a striking finding about LLM creative capabilities. ● Collaborative creative use: Concludes AI systems are valuable creative assistants rather than mere imitators. ● Creativity evaluation: Part of the 2023 creativity-research cluster empirically testing claims about LLM creative limitations. |
Paper , Tweet |
| 7) A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series - Systematic evaluation of the GPT series. ● 9 NLU tasks, 21 datasets: Evaluates GPT-3 and GPT-3.5 variants on 9 natural language understanding tasks using 21 datasets. ● Series-wide comparison: Covers the full GPT-3 and GPT-3.5 family (davinci, davinci-002, davinci-003, ChatGPT) enabling lineage-tracking. ● Capability regression detection: Identifies task-specific regressions and improvements across generations. ● Practical reference: Used by practitioners choosing between OpenAI API model variants for specific tasks. |
Paper, Tweet |
| 8) Context-faithful Prompting for Large Language Models - Prompting techniques to improve LLM faithfulness to given context. ● Faithfulness-improving strategies: Introduces opinion-based prompts and counterfactual demonstrations that improve context adherence. ● Parametric-knowledge override: Helps LLMs prioritize context-provided information over conflicting parametric knowledge. ● RAG-relevant: Particularly useful for RAG setups where LLMs must prioritize retrieved documents over their baseline knowledge. ● Grounding research: Part of the broader 2023 work on making LLMs more faithful to provided context. |
Paper, Tweet |
| 9) Text2Room - Extracts textured 3D meshes of rooms from 2D text-to-image models. ● Text-to-3D rooms: Generates room-scale textured 3D meshes purely from text prompts by leveraging 2D T2I models. ● Iterative view generation: Progressively generates 2D views, reconstructs depth, and fuses into a coherent 3D mesh. ● 2D-to-3D lifting: Demonstrates how to lift powerful 2D generation to 3D without needing 3D training data. ● 3D generation lineage: An influential step in the 2023 explosion of text-to-3D methods informed by Stable Diffusion's success. |
Paper, ProjectTweet |
| 10) PanGu-Σ - Huawei's trillion-parameter LM with sparse heterogeneous computing. ● 1 trillion parameters: Scales to 1T total parameters using sparse mixture-of-experts routing to keep inference compute manageable. ● Heterogeneous computing: Designed to leverage heterogeneous hardware (GPUs, Ascend NPUs) at massive scale. ● Chinese language focus: Particularly strong on Chinese NLP tasks while also supporting multilingual capabilities. ● Trillion-scale era: Part of the 2023 trillion-parameter wave alongside GLaM and Switch Transformer extensions. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) GPT-4 Technical Report - OpenAI's landmark GPT-4 release marking the frontier of 2023. ● Multimodal capabilities: Large multimodal model accepting text and image inputs and producing text outputs with substantially broader reasoning. ● Human-level exams: Scores in top percentiles on simulated bar exams, SAT, GRE, and similar - markedly better than GPT-3.5. ● Alignment improvements: Extensive RLHF and red-teaming produce significantly safer and more helpful outputs than predecessors. ● Industry-defining release: Set the capability bar that defined the 2023 AI landscape and triggered the global race to match frontier model performance. |
Paper, Tweet |
| 2) LERF: Language Embedded Radiance Fields - Grounds CLIP language embeddings into NeRF for 3D language queries. ● CLIP embeddings in 3D: Lifts CLIP's language-image features into 3D NeRF representations at every location. ● Open-ended 3D queries: Enables open-ended text queries like "where is the espresso" - the NeRF highlights relevant 3D regions. ● Dense 3D-language features: Per-voxel language features enable both localization and retrieval in 3D scenes. ● 3D semantic understanding: Influential for subsequent research combining language grounding with 3D representations. |
Paper, Tweet |
| 3) An Overview on Language Models: Recent Developments and Outlook - Comprehensive LM overview covering structures and future directions. ● Full-stack coverage: Covers linguistic units, model structures, training methods, evaluation, and applications. ● Structured taxonomy: Organizes LM research into clear categories useful for newcomers entering the field. ● Trend analysis: Identifies major research trends and open problems as of early 2023. ● Reference overview: A widely-used survey for orienting to the rapidly-evolving LM landscape. |
Paper, Tweet |
| 4) Eliciting Latent Predictions from Transformers with the Tuned Lens - An interpretability method tracing LM predictions layer-by-layer. ● Tuned lens: Learns per-layer linear probes that translate intermediate hidden states into next-token probability distributions. ● Logit lens improvement: An improved version of "logit lens" that works more reliably across layers and models. ● Layer-by-layer prediction evolution: Reveals how predictions form gradually across transformer layers rather than instantaneously. ● Interpretability toolkit: Became a standard tool in the mechanistic interpretability research community. |
Paper, Tweet |
| 5) Meet in the Middle - A new pretraining paradigm combining data efficiency with infilling capability. ● Bidirectional pretraining: Trains LMs to predict from both directions, meeting in the middle of sequences. ● Data efficiency: Jointly improves training data efficiency and downstream LM capability. ● Infilling strength: Particularly strong on infilling tasks where both prefix and suffix context matter. ● Code generation gains: Demonstrates improvements in code generation tasks where infilling is a common use case (IDE autocomplete). |
Paper , Tweet |
| 6) Resurrecting Recurrent Neural Networks for Long Sequences (LRU) - Deep RNNs matching state-space model performance. ● Linear Recurrent Unit: Introduces a carefully-designed LRU architecture using standard signal propagation principles. ● S4 parity: Matches the performance of deep state-space models (S4) on long-range reasoning benchmarks. ● RNN renaissance: Demonstrates that classical RNNs, with proper initialization and design, remain competitive. ● SSM lineage: Informed subsequent state-space model research including Mamba and the broader 2024 SSM renaissance. |
Paper , Tweet |
| 7) UPRISE: Universal Prompt Retrieval - A lightweight retriever for zero-shot prompt selection. ● Universal prompt pool: Builds a universal pool of prompts that can be retrieved for diverse tasks without task-specific setup. ● Lightweight retriever: Trains a small, versatile retriever to select the best prompts for a given input at inference time. ● Zero-shot improvements: Significant zero-shot performance gains and hallucination reduction. ● Prompt retrieval research: Part of the broader research direction on automated prompt engineering that matured in 2024. |
Paper, Tweet |
| 8) Patches Are All You Need? (ConvMixer) - A parameter-efficient fully-convolutional ViT alternative. ● Conv-based mixing: Replaces self-attention and MLP layers in ViTs with depthwise and pointwise convolutional layers. ● Parameter efficiency: Achieves competitive accuracy with far fewer parameters and simpler architecture. ● Patches-are-enough argument: Suggests much of ViT's success comes from patch-based processing, not attention itself. ● Architecture minimalism: Reinforces the 2023 trend toward simpler architectures that match complex ones. |
Paper, Tweet |
| 9) NeRFMeshing - Distills NeRFs into geometrically-accurate 3D meshes. ● NeRF-to-mesh: A compact, flexible architecture that extracts accurate 3D meshes from any NeRF-driven approach. ● Geometric accuracy: Produces meshes with good geometric quality, useful for downstream graphics and simulation applications. ● NeRF-approach agnostic: Works with multiple NeRF variants rather than being tied to one architecture. ● Production bridge: Helps bridge NeRF research to production graphics pipelines that require traditional meshes. |
Paper, Tweet |
| 10) High-throughput Generative Inference with a Single GPU (FlexGen) - High-throughput LLM inference on limited GPU memory. ● Memory offloading: Offloads weights/KV-cache to CPU/disk and streams them into GPU memory as needed. ● High throughput batch inference: Optimized for offline batch inference workloads where latency is less critical than throughput. ● Single-GPU practicality: Makes running large LLMs on a single consumer-grade GPU feasible for research and hobbyist use. ● Inference infrastructure: Influenced later inference optimization tools like vLLM and the broader inference-engine ecosystem. |
Paper, Code , Tweet |
| Paper | Links |
|---|---|
| 1) PaLM-E - Google's embodied multimodal language model. ● Sensor-modality integration: Incorporates real-world continuous sensor modalities (images, robot states) directly as tokens for the LM. ● Embodied reasoning: Performs robotic manipulation planning, visual QA, and other embodied reasoning tasks via a single model. ● 562B parameters: One of the largest multimodal models at the time, built on PaLM + ViT encoders. ● Embodied AI foundation: A major step toward generalist embodied agents that bridge language, vision, and action. |
Paper, Demo , Tweet |
| 2) Prismer: A Vision-Language Model with An Ensemble of Experts - a parameter-efficient vision-language model powered by an ensemble of domain experts; it efficiently pools expert knowledge from different domains and adapts it to various vision-language reasoning tasks. | Paper, GitHub, Project , Tweet |
| 3) Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models - it connects ChatGPT and different visual foundation models to enable users to interact with ChatGPT beyond language format. | Paper, GitHub Tweet |
| 4) A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT - an overview of generative AI - from GAN to ChatGPT. | Paper, Tweet |
| 5) Larger language models do in-context learning differently - shows that with scale, LLMs can override semantic priors when presented with enough flipped labels; these models can also perform well when replacing targets with semantically-unrelated targets. | Paper , Tweet |
| 6) Foundation Models for Decision Making: Problems, Methods, and Opportunities - provides an overview of foundation models for decision making, including tools, methods, and new research directions. | Project , Tweet |
| 7) Hyena Hierarchy: Towards Larger Convolutional Language Models - a subquadratic drop-in replacement for attention; it interleaves implicit long convolutions and data-controlled gating and can learn on sequences 10x longer and up to 100x faster than optimized attention. | Paper, Code, Blog, Tweet |
| 8) OpenICL: An Open-Source Framework for In-context Learning - a new open-source toolkit for in-context learning and LLM evaluation; supports various state-of-the-art retrieval and inference methods, tasks, and zero-/few-shot evaluation of LLMs. | Paper, Repo, Tweet |
| 9) MathPrompter: Mathematical Reasoning using Large Language Models - a technique that improves LLM performance on mathematical reasoning problems; it uses zero-shot chain-of-thought prompting and verification to ensure generated answers are accurate. | Paper, Tweet |
| 10) Scaling up GANs for Text-to-Image Synthesis - enables scaling up GANs on large datasets for text-to-image synthesis; it’s found to be orders of magnitude faster at inference time, synthesizes high-resolution images, & supports various latent space editing applications. | Paper, Project , Tweet |
| Paper | Links |
|---|---|
| 1) Language Is Not All You Need: Aligning Perception with Language Models - Microsoft's Kosmos-1 unifies perception and language in one foundation model. ● Multimodal LLM: Trains a single model on web-scale multimodal corpora including arbitrarily interleaved text and images, image-caption pairs, and text data. ● OCR-free NLP: Directly reads and reasons over images containing text without a separate OCR pipeline. ● Broad task coverage: Strong zero-shot and few-shot performance on language understanding, perception-language tasks, visual QA, and visual dialog. ● Perception-aware foundation: Early step toward general-purpose models that ground language in perception — a core prerequisite for AGI-style systems. |
Paper, Tweet |
| 2) Evidence of a predictive coding hierarchy in the human brain listening to speech - Nature study linking LLM activations to brain hierarchy. ● Brain–LM mapping: Uses fMRI on 304 subjects listening to stories to compare brain activations against modern LM representations. ● Long-range predictions: Finds brain activity is best explained by LMs augmented with long-range and hierarchical predictions, not single next-word predictions. ● Cortical hierarchy: Distance of prediction scales along a clear cortical hierarchy, echoing predictive coding theory. ● Neuro-AI bridge: Provides strong empirical support for treating LMs as computational models of language in the human brain. |
Paper, Tweet |
| 3) EvoPrompting: Language Models for Code-Level Neural Architecture Search - uses LLMs as evolutionary operators to discover novel NN architectures. ● Evolutionary prompting: Combines evolutionary search with soft prompt-tuning to iteratively mutate in-context code examples of neural architectures. ● Code-level NAS: Generates valid architecture code using LMs, then scores and selects the best to seed the next generation. ● Outperforms baselines: Finds models surpassing hand-designed architectures on MNIST-1D and CLRS Algorithmic Reasoning. ● LMs as optimizers: Shows LLMs can act as design agents for ML research, not just text generators. |
Paper, Tweet |
| 4) Consistency Models - OpenAI introduces one-step generative models with diffusion-quality samples. ● Single-step sampling: Maps any noise level directly to the clean data, enabling high-quality generation in just 1-2 steps. ● Two training regimes: Trains either via consistency distillation from a pre-trained diffusion model, or standalone as a new class of generative models. ● Competitive quality: Achieves strong FID on CIFAR-10 and ImageNet without adversarial training. ● Fast inference: Offers ~10-100x speedups over diffusion sampling, shaping later real-time generative systems. |
Paper, Tweet |
| 5) Goal Driven Discovery of Distributional Differences via Language Descriptions - defines the D5 task: auto-discovering differences between two corpora as natural language. ● New task formulation: Given two text corpora + a research goal, the system outputs a language description of how they differ. ● Benchmark + system: Introduces OpenD5 with 675 open-ended problems across domains, plus a GPT-based discovery method. ● Real findings: Uncovers insights from product reviews, error patterns in NLP systems, and political speeches. ● Discovery-as-service: A template for using LMs as scientific-discovery tools, not just predictors. |
Paper , Code, Tweet |
| 6) High-resolution image reconstruction with latent diffusion models from human brain activity - reconstructs photos subjects actually saw from fMRI signal. ● Stable Diffusion + brain: Maps fMRI voxels into text and image latents consumed by Stable Diffusion. ● No fine-tuning: Uses off-the-shelf Stable Diffusion with learned linear mappings from brain activity to latent spaces. ● High fidelity: Produces high-resolution reconstructions preserving semantic and structural detail of the viewed images. ● Neuro-decoding at scale: Demonstrates how foundation diffusion models can serve as powerful priors for brain decoding. |
Project , Tweet |
| 7) Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control - couples LLM planning with grounding functions during decoding. ● Joint decoding: At each token, combines LM probabilities with scores from grounded models (affordance, safety, preferences). ● Robot planning: Generates task plans for robots that respect the current environment and robot capabilities. ● General framework: Supports many grounding signals without retraining the LM — plug-and-play alignment at inference. ● Embodied generalization: Shows strong results across tabletop and mobile manipulation tasks, enabling flexible embodied reasoning. |
Paper, Project Tweet |
| 8) Language-Driven Representation Learning for Robotics - Voltron: visual pretraining guided by language from human videos. ● Video + captions: Learns representations from Ego4D-style human videos paired with captions, unifying MAE-style masked reconstruction with language. ● Controllable tradeoff: Lets practitioners balance between low-level grounded features and high-level semantic features. ● Robotics-friendly evaluation suite: Introduces a benchmark of imitation learning, grasp affordance, and referring expression tasks. ● Pretraining recipe: Establishes language-guided video pretraining as a strong backbone for robot policies. |
Paper, Models, Evaluation, Tweet |
| 9) Dropout Reduces Underfitting - surprising finding that early-phase dropout helps underfit models. ● Early dropout: Applying dropout in the initial training epochs (then turning it off) improves generalization for underfitting models. ● Mechanism: Reduces gradient variance across mini-batches, counteracting SGD stochasticity. ● Late dropout: Conversely shows late dropout helps overfit regimes, inverting conventional usage. ● Regularization rethought: Forces a broader rethink of dropout's role beyond simple overfitting prevention. |
Paper, Tweet |
| 10) Enabling Conversational Interaction with Mobile UI using Large Language Models - uses a single LLM to drive diverse mobile UI conversational tasks. ● Unified prompting: Feeds UI screen representations into an LLM and prompts for QA, summarization, and screen mapping. ● Four tasks: Covers screen question generation, screen summarization, screen QA, and mapping instructions to UI actions. ● Competitive results: Matches task-specific models without any task-specific training. ● Foundation for UI agents: Foreshadows LLM-based UI agents that later power phone-control systems. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) LLaMA: Open and Efficient Foundation Language Models - Meta's landmark open foundation model family. ● Four scales: Releases 7B, 13B, 33B, and 65B parameter models trained entirely on publicly available data. ● Compute-efficient: Trained on 1-1.4T tokens — more tokens per parameter than Chinchilla, optimizing inference over training cost. ● Benchmark-beating: LLaMA-13B outperforms GPT-3 (175B) on most benchmarks; 65B is competitive with PaLM-540B. ● Research catalyst: Release sparked the open-weight LLM explosion (Alpaca, Vicuna, LLaMA-2 ecosystem). |
Paper, Tweet |
| 2) Composer: Creative and Controllable Image Synthesis with Composable Conditions - 5B diffusion model enabling compositional control over generation. ● Decomposition-then-composition: Decomposes images into representative conditions (text, sketch, depth, color) and recomposes them flexibly at inference. ● 5B parameters: Trained on billions of (text, image) pairs for strong base quality. ● Rich control: Supports colorization, style transfer, image translation, and more without task-specific retraining. ● Pre-ControlNet era milestone: One of the earliest general frameworks for multi-condition controllable diffusion. |
Paper, Project , GitHub , Tweet |
| 3) The Wisdom of Hindsight Makes Language Models Better Instruction Followers - HIR: alignment without RL. ● Hindsight Instruction Relabeling: Relabels failed outputs with instructions they would have been correct for, turning mistakes into supervised data. ● Supervised-only: Replaces PPO/RLHF pipelines with a simple two-stage SFT loop. ● BigBench results: Outperforms baselines including RLHF on 12 BigBench reasoning tasks with much simpler training. ● Algorithmic minimalism: Demonstrates that careful data relabeling can rival RL for alignment. |
Paper, GitHub Tweet |
| 4) Active Prompting with Chain-of-Thought for Large Language Models - active learning meets CoT prompt engineering. ● Uncertainty-driven selection: Ranks candidate questions by LLM disagreement across sampled CoTs, then asks humans to annotate only the most uncertain. ● Adaptive exemplars: Replaces static few-shot CoT prompts with task-specific ones crafted via targeted annotation. ● Reasoning gains: Beats self-consistency and CoT baselines on arithmetic, commonsense, and symbolic reasoning benchmarks. ● Label-efficient alignment: A practical recipe for getting the most out of limited annotation budget. |
Paper, Code Tweet |
| 5) Modular Deep Learning - comprehensive survey of modular NN design. ● Unified taxonomy: Organizes modular methods along four axes — computation function, routing, aggregation, and training regime. ● Covers adapters, MoE, hypernetworks: Analyzes how LoRA, adapters, mixture-of-experts, and composable functions map into this taxonomy. ● Use-case breadth: Discusses modularity in scaling LMs, causal inference, hierarchical RL, and multilingual transfer. ● Research roadmap: Frames an emerging subfield and exposes open problems in routing, specialization, and cross-module generalization. |
Paper , Project, Tweet |
| 6) Recitation-Augmented Language Models - RECITE: self-retrieval via recitation. ● Memory recitation: Prompts the LLM to first recite relevant passages it has memorized, then condition on those passages to answer. ● No external retriever: Replaces document stores with the model's own parametric memory, then conditions answers on recited evidence. ● Strong on closed-book QA: Improves accuracy on TriviaQA, NaturalQuestions, and HotpotQA without any retrieval corpus. ● Practical technique: Cheap, drop-in method that later informed search-augmented and agentic inference strategies. |
Paper , Tweet |
| 7) Learning Performance-Improving Code Edits - LLMs as code performance optimizers. ● Dataset: Curates over 77K competitive programming C++ edits that correctly improve runtime performance. ● Prompting + fine-tuning: Benchmarks zero-shot, few-shot, and fine-tuned models for generating performance-improving refactors. ● Measured gains: Best configuration achieves ~2.5x average speedup across held-out programs while preserving correctness. ● AI code optimization: Formalizes performance editing as a learning problem and introduces evaluation protocols. |
Paper, Tweet |
| 8) More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models - early foundational analysis of indirect prompt injection. ● Threat taxonomy: Defines direct vs indirect prompt injection and enumerates attacker capabilities against LLM-powered apps. ● Real exploits: Demonstrates data exfiltration, phishing, and persistent memory injections against Bing Chat and ChatGPT plugins. ● Attack vectors: Hidden instructions in retrieved pages, emails, and tool outputs can silently hijack the LM. ● Security agenda: Catalyzed prompt-injection research and defensive designs across the industry. |
Paper, Tweet |
| 9) Aligning Text-to-Image Models using Human Feedback - brings RLHF-style alignment to diffusion models. ● Human reward model: Collects human ratings of image-text alignment to train a reward function over generated images. ● Supervised alignment fine-tuning: Re-weights generation to favor higher-reward samples via reward-weighted likelihood. ● Improved text-image matching: Increases faithfulness for counting, color, and composition prompts without sacrificing image quality. ● T2I alignment blueprint: Early template later expanded by DDPO, DPO-Diffusion, and other RL-based T2I tuning methods. |
Paper, Tweet |
| 10) MERF: Memory-Efficient Radiance Fields for Real-time View Synthesis in Unbounded Scenes - makes large-scale NeRF playable in the browser. ● Hybrid volumetric rep: Combines a low-res 3D feature grid with two 2D feature planes for compact yet expressive scene representation. ● Real-time rendering: Achieves interactive frame rates in a browser for unbounded outdoor scenes. ● Memory-efficient: Roughly order-of-magnitude smaller memory footprint than competing NeRF baselines at similar quality. ● Deployable NeRF: A practical step toward shipping neural scene reps in consumer web experiences. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Symbolic Discovery of Optimization Algorithms - Google discovers Lion optimizer via evolutionary search. ● Program search: Uses an evolutionary symbolic search over programs to find new optimizers starting from primitive operations. ● Lion emerges: Discovers Lion (EvoLved Sign Momentum), simpler and more memory-efficient than Adam/AdamW. ● Broad gains: Improves ViT on ImageNet, vision-language training, and LM pretraining with significant compute savings. ● ML automation: Demonstrates that symbolic program search can produce genuinely novel, widely-useful training algorithms. |
Paper, Tweet |
| 2) Transformer models: an introduction and catalog - comprehensive catalog and tutorial on the transformer family. ● Unified reference: Organizes prominent transformer-based models into a browsable catalog with architecture details, training data, and usage. ● Encoder/decoder/encoder-decoder split: Covers BERT-style, GPT-style, and T5-style branches with historical context. ● Ecosystem snapshot: Captures a mid-2023 survey including LLaMA, Flan-T5, PaLM, and multimodal variants. ● Teaching resource: Widely used as an onboarding reference for practitioners entering the LLM space. |
Paper, Tweet |
| 3) 3D-aware Conditional Image Synthesis - Pix2Pix3D: structure-to-image generation with view consistency. ● NeRF + conditional GAN: Extends conditional image generation with neural radiance fields for 3D structure awareness. ● Multi-view editing: Generates photorealistic images from segmentation/edge maps and lets users rotate or edit from novel viewpoints. ● Consistent across views: Preserves identity and layout when the camera moves, unlike 2D-only baselines. ● 3D generative assets: Step toward controllable 3D-aware content creation pipelines. |
Project Tweet |
| 4) The Capacity for Moral Self-Correction in Large Language Models - Anthropic study on emergent ethical reasoning. ● RLHF-trained LMs self-correct: Finds evidence that larger RLHF-tuned models can reduce biased or stereotyped outputs when prompted to. ● Emergence threshold: The capability emerges at ~22B parameters and strengthens with further scale. ● Benchmarks: Evaluates on BBQ (bias), Winogender (gender bias), and law school admissions bias. ● Alignment implication: Suggests instruction-tuned models can be steered toward fairness via prompting — a building block for safety research. |
Paper, Tweet |
| 5) Vision meets RL - applies RLHF-style reward fine-tuning to vision models. ● RL with task rewards: Treats CV models as policies and aligns them using task-specific rewards (IoU, accuracy, user-defined metrics). ● Big gains: Reports large improvements on object detection, panoptic segmentation, colorization, and image captioning. ● Generalizes prior work: Unifies RL post-training across heterogeneous CV tasks with a single recipe. ● Post-training for vision: Mirrors the LM alignment playbook — pretrain, then RL-tune toward task objectives. |
Paper |
| 6) Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment - LQAE: image features quantized in the LM vocabulary. ● Quantize to text tokens: Learns a VQ autoencoder where codes are drawn from a pretrained LM's token vocabulary, aligning vision to language without captions. ● Unsupervised alignment: No image-caption pairs needed — the visual quantizer aligns with the LM's embedding geometry by construction. ● Few-shot classification: Enables LLMs to do few-shot image classification purely in-context. ● Bridge to LLMs: Offers a path for injecting vision into language models without expensive paired data. |
Paper , Code Tweet |
| 7) Augmented Language Models: a Survey - Meta's foundational survey of reasoning + tool use in LLMs. ● ALM definition: Formalizes augmented LMs as models with reasoning skills (CoT, self-consistency) and tool-using ability (retrievers, calculators, code). ● Taxonomy: Organizes the literature across reasoning, tools, and learning strategies (in-context vs fine-tuned). ● Open problems: Highlights challenges in tool orchestration, skill composition, and evaluation. ● Pre-agentic-era blueprint: Anticipates much of the agentic LLM wave that dominates the rest of 2023. |
Paper, Tweet |
| 8) Geometric Clifford Algebra Networks - GCANs for modeling physical and geometric systems. ● Geometric priors: Parametrizes layers using Clifford (geometric) algebra to natively encode rotations, reflections, and translations. ● Physics-oriented: Targets rigid-body dynamics, fluid simulation, and scientific computing where geometric structure matters. ● Equivariance for free: Respects symmetries of the underlying problem by construction, improving generalization. ● Scientific ML: Part of a growing trend of symmetry-aware architectures for physical simulation. |
Paper, Tweet |
| 9) Auditing large language models: a three-layered approach - governance framework for accountable LLM deployment. ● Three layers: Proposes governance audits (provider-level), model audits (behavioral), and application audits (deployment context). ● Concrete responsibilities: Maps each layer to who is accountable, what gets audited, and how to audit it. ● Policy-ready: Designed to inform regulators and practitioners shaping emerging AI policy regimes. ● Foundational reference: Frequently cited in later LLM governance and regulatory proposals (EU AI Act, NIST). |
Paper, Tweet |
| 10) Energy Transformer - transformers as associative memories. ● Hopfield-inspired: Replaces stacked feedforward transformer blocks with one large associative memory that iteratively minimizes an energy function. ● Unified perspective: Reinterprets attention, feedforward, and norm layers through the lens of energy-based retrieval. ● Empirical validation: Matches or exceeds baseline transformers on image classification and graph anomaly detection. ● Architecture rethink: Part of a broader push to ground transformers in well-understood dynamical systems theory. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Toolformer: Language Models Can Teach Themselves to Use Tools - Meta's seminal paper on self-supervised tool learning. ● Self-supervised annotation: LLM inserts candidate API calls into text, keeps only those that reduce perplexity of the continuation. ● Five tools: Teaches a model to use calculator, Q&A system, search engine, translator, and calendar. ● Zero human annotation: Achieves strong zero-shot tool use using only self-generated training data. ● Foundation of agentic era: Direct inspiration for ReAct, function-calling APIs, and the broader agentic LLM stack. |
Paper, Tweet |
| 2) Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents - DEPS agent framework for Minecraft. ● Four-stage loop: Describe current state, Explain failures, Plan next steps, Select actions — all driven by an LLM. ● Multi-task Minecraft: Achieves strong performance across 70+ open-world Minecraft tasks with a single agent. ● Interactive planning: Re-plans after failed steps using error descriptions as feedback, enabling robust long-horizon behavior. ● Open-ended agents: Early demonstration that LLMs can steer complex embodied agents in rich game environments. |
Paper, Tweet |
| 3) A Categorical Archive of ChatGPT Failures - early systematic taxonomy of ChatGPT weaknesses. ● 11 failure categories: Reasoning, logic, math, coding, factual errors, bias, ethics, humor, self-awareness, etc. ● Concrete examples: Documents hundreds of reproducible failure modes across categories. ● Evaluation scaffolding: Provides a structure for subsequent LLM evaluation and red-teaming efforts. ● Historical snapshot: Captures the limits of GPT-3.5-era ChatGPT right before the GPT-4 release. |
Paper, Tweet |
| 4) Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery - PEZ optimizer for discrete text prompts. ● Continuous proxy: Optimizes continuous embeddings, projects them to nearest tokens each step, producing readable, transferable hard prompts. ● Cross-model portability: Hard prompts discovered on one model often transfer to others. ● Text + image: Works for text-to-image personalization and text-to-text tasks. ● Prompt engineering automation: Makes gradient-based prompt search practical, influential for later jailbreak research (e.g., GCG). |
Paper, Tweet |
| 5) Data Selection for Language Models via Importance Resampling - DSIR: target-distribution matching for LM pretraining. ● Importance resampling: Selects pretraining data that matches a target downstream distribution using hashed n-gram importance weights. ● Cheap and scalable: Operates over huge corpora without fine-tuning or running forward passes. ● Downstream gains: Improves GLUE and domain-specific benchmarks vs random or heuristic selection. ● Data-centric pretraining: Part of the broader shift from "more data" to "better data" as a lever for LM quality. |
Paper, Tweet |
| 6) Structure and Content-Guided Video Synthesis with Diffusion Models - Runway Gen-1, structure-preserving video-to-video diffusion. ● Dual conditioning: Disentangles structure (depth, frames) from content (text, reference image) for guided video synthesis. ● Latent video diffusion: Operates in a latent space for tractable training and inference on video. ● Broad edits: Supports stylization, compositional edits, and driven animation with temporal coherence. ● Commercial milestone: Underpins Runway's Gen-1 product, a flagship for early generative video. |
Paper , Project, Tweet |
| 7) A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity - sweeping ChatGPT evaluation. ● 21 tasks: Evaluates ChatGPT across 9 NLP task categories, multiple languages, and multimodal prompts. ● Three axes: Probes reasoning ability, hallucination rates, and interactive multi-turn behavior. ● Mixed results: ChatGPT is strong on many tasks but brittle on multi-step logical reasoning and low-resource languages. ● Community benchmark: One of the most-cited empirical evaluations of ChatGPT during the GPT-3.5 era. |
Paper, Tweet |
| 8) Noise2Music: Text-conditioned Music Generation with Diffusion Models - Google's text-to-music diffusion system. ● Cascaded diffusion: Uses a text-conditioned generator plus super-resolution diffusion stages to produce 30-second audio. ● Two variants: Compares waveform- and spectrogram-level diffusion models. ● High quality: Captures genre, instrumentation, mood, and temporal structure from natural language prompts. ● Generative audio: A key reference point for subsequent music generation systems (MusicGen, Stable Audio). |
Paper, Project, Tweet |
| 9) Offsite-Tuning: Transfer Learning without Full Model - privacy-preserving LLM fine-tuning. ● Emulator + adapter: Model owner shares a lossy "emulator" plus adapter; users fine-tune the adapter on local data without ever seeing the full model. ● Mutual privacy: Protects both the model owner's weights and the user's data. ● Efficient transfer: Reduces compute and memory substantially vs full fine-tuning of frontier LLMs. ● Deployment-relevant: Offers a path for specialized fine-tuning when distributing base weights is not viable. |
Paper, Project, Tweet |
| 10) Zero-shot Image-to-Image Translation - pix2pix-zero: prompt-driven diffusion editing without fine-tuning. ● Edit via text pairs: Translates images between concepts (e.g., "dog" → "cat") given just before/after text phrases — no training data or fine-tuning. ● Cross-attention guidance: Uses attention maps to preserve layout and identity during editing. ● Structure preserving: Unlike prior T2I editors, keeps the input image's geometry intact across large semantic edits. ● Training-free diffusion editing: Influential in the broader push toward zero-shot image editing (e.g., MasaCtrl, InstructPix2Pix). |
Paper, Project, Tweet |
| Paper | Links |
|---|---|
| 1) REPLUG: Retrieval-Augmented Black-Box Language Models - turns any black-box LLM into a retrieval-augmented system. ● Retriever adapts to LM: Trains the retriever using LM output signal (not LM gradients) — works with closed APIs like GPT-3. ● Ensembled inference: Retrieves and processes multiple documents independently, ensembling predictions at the output. ● Strong RAG gains: Improves language modeling and MMLU substantially over few-shot GPT-3 baselines. ● API-era RAG: Makes retrieval augmentation viable even when model weights are inaccessible. |
Paper, Tweet |
| 2) Extracting Training Data from Diffusion Models - landmark paper showing diffusion models memorize images. ● Extraction attack: Reconstructs individual training images (including copyrighted art) from Stable Diffusion and Imagen. ● Memorization rate: Finds hundreds of near-exact copies extractable, especially for frequently-seen images. ● Privacy + IP implications: Raises legal and ethical questions about training on copyrighted or personal data. ● Training-data leakage: Core evidence in ongoing copyright debates and inspires subsequent mitigation work. |
Paper, Tweet |
| 3) The Flan Collection: Designing Data and Methods for Effective Instruction Tuning - Google's comprehensive instruction-tuning dataset. ● Massive scale: Combines 1,800+ tasks across multiple domains with diverse template formats. ● Design insights: Studies how mixing zero-shot, few-shot, and CoT prompts during training affects downstream capability. ● Flan-T5/PaLM release: Produces Flan-T5 and Flan-PaLM models that outperform base counterparts on MMLU and reasoning benchmarks. ● Open resource: Core public asset for the instruction-tuning research community. |
Paper, Tweet |
| 4) Multimodal Chain-of-Thought Reasoning in Language Models - Amazon extends CoT to multimodal inputs. ● Two-stage pipeline: First generates a natural-language rationale grounded in the image, then uses that rationale to produce the final answer. ● Vision grounding: Fuses visual features with text at both rationale and answer stages. ● ScienceQA gains: Sub-1B model outperforms GPT-3.5 by ~16 points on ScienceQA, exceeding human-level performance. ● Efficient reasoning: Demonstrates that smaller multimodal LMs can outperform much larger text-only models through structured reasoning. |
Paper, Code Tweet |
| 5) Dreamix: Video Diffusion Models are General Video Editors - Google's text-driven video editor. ● Motion + appearance edits: Modifies existing videos via text while preserving core object identity and high-level motion. ● Image-to-video: Also animates still images with text-driven motion, bridging image and video generation. ● Mixed training objective: Combines unmasked and masked video training to support edits and animation with one model. ● Versatile video editor: One of the first general-purpose text-driven video editing systems with coherent temporal dynamics. |
Paper, Project, Tweet |
| 6) Benchmarking Large Language Models for News Summarization - rigorous evaluation of LLM summarization quality. ● Human study: Evaluates 10 LLMs on news summarization with professional freelance writers as reference baselines. ● Instruction tuning matters: Finds instruction-tuned LLMs match freelance writer quality, while base LLMs lag significantly. ● Prompt sensitivity: Demonstrates that prompt design has substantial impact on summarization quality. ● Automated metrics gap: Highlights the poor correlation between ROUGE and human preferences, pushing for better metrics. |
Paper , Tweet |
| 7) Mathematical Capabilities of ChatGPT - deep dive into ChatGPT's math reasoning. ● GHOSTS benchmark: Introduces a graduate-level holistic math benchmark spanning proofs, problem solving, and olympiad-style tasks. ● Mixed performance: ChatGPT handles undergraduate-level math but struggles with formal proofs and advanced reasoning. ● Qualitative analysis: Catalogs typical mistake patterns — hallucinated theorems, invalid inferences, symbolic errors. ● Math evaluation rigor: Provides a template for evaluating LLMs on structured mathematical reasoning. |
Paper, Tweet |
| 8) Emergence of Maps in the Memories of Blind Navigation Agents - shows mental maps emerge in memory-only agents. ● Blind navigation: Trains RL agents with only egomotion and compass — no vision, no audio, no GPS. ● Emergent mapping: Despite lacking explicit spatial sensing, agents develop map-like internal representations of environments. ● Probing analysis: Decodable positional and topological information appears spontaneously in recurrent hidden states. ● Neuroscience parallel: Mirrors how animals build cognitive maps, supporting broader theories of spatial representation learning. |
Paper, Project, Tweet |
| 9) SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections - synthesizes infinite 3D landscapes from 2D data alone. ● 2D-only supervision: Trains from only in-the-wild 2D image collections — no 3D ground truth required. ● BEV scene representation: Uses bird's-eye-view (BEV) plus height field representations to structure scene generation. ● Unbounded synthesis: Produces explorable, consistent 3D worlds across arbitrary camera trajectories. ● 3D generative scale: Demonstrates feasibility of large-scale 3D scene generation without expensive paired 3D assets. |
Paper, Tweet |
| 10) Large Language Models Can Be Easily Distracted by Irrelevant Context - exposes brittleness of LLM reasoning under noise. ● GSM-IC benchmark: Extends GSM8K by injecting irrelevant sentences into arithmetic word problems. ● Large accuracy drops: CoT, self-consistency, and other prompting methods lose 20+ points when irrelevant context is present. ● Mitigations: Shows that explicitly instructing the model to ignore irrelevant information partially recovers performance. ● Robustness gap: Signals a key weakness in LLM reasoning that later motivates robustness benchmarks and prompt design practices. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) MusicLM: Generating Music From Text - Google's hierarchical text-to-music generator. ● Hierarchical tokens: Casts music generation as conditional language modeling over multiple streams of semantic, coarse, and fine audio tokens. ● 24kHz, minutes long: Generates high-fidelity music at 24kHz that remains coherent for several minutes. ● MusicCaps benchmark: Releases a 5.5K hand-labeled text-music caption dataset for evaluation. ● Generative music frontier: Defines the state of the art for text-to-music in early 2023 and anchors follow-up work (MusicGen, Stable Audio). |
Paper, Tweet |
| 2) Hungry Hungry Hippos: Towards Language Modeling with State Space Models - H3 architecture closes the SSM-attention gap. ● Diagnostic lenses: Identifies synthetic copying tasks where existing SSMs lag attention, then designs H3 layer to fix them. ● FlashConv kernel: Custom IO-aware FFT convolution implementation that makes SSMs hardware-efficient. ● 2.8x training speedup: Hybrid H3 + attention model trains 2.8x faster than Transformer baselines. ● Mamba precursor: Key stepping stone toward the Mamba and selective SSM architectures that followed. |
Paper, Tweet |
| 3) A Watermark for Large Language Models - Kirchenbauer et al. propose a detectable LM watermark. ● Green/red tokens: Partitions vocab into green/red lists per context via hashed seed; biases sampling toward green tokens. ● Statistical detection: A statistical test on the fraction of green tokens detects watermark with arbitrary confidence even on short samples. ● No quality loss: Empirically has negligible impact on generation quality while enabling provable detection. ● Provenance tooling: Foundational technique for LLM output attribution and later standardization efforts. |
Paper, Tweet |
| 4) Text-To-4D Dynamic Scene Generation - Meta's Make-A-Video3D: 4D from text prompts. ● 4D synthesis: Generates dynamic 3D scenes (3D + time) directly from text descriptions. ● Video-SDS optimization: Uses score distillation sampling from Make-A-Video to supervise a time-varying NeRF. ● No 3D/video training data: Requires no 3D or 4D supervision — leverages 2D video priors. ● 4D generative pipeline: Establishes a framework for text-to-4D synthesis later refined by 4DGen, Animate124, and others. |
Paper, GitHub, Tweet |
| 5) ClimaX: A foundation model for weather and climate - Microsoft's first foundation model for atmospheric science. ● Flexible architecture: Transformer-based design that handles heterogeneous variables and spatio-temporal resolutions. ● Pretrained on CMIP6: Trains on climate model simulations before fine-tuning on real forecasting tasks. ● Multi-task performance: Competitive on forecasting, downscaling, climate projection, and S2S prediction. ● Climate AI: Establishes a template for foundation models in geosciences, foreshadowing GraphCast and Aurora. |
Paper, Tweet, Blog |
| 6) Open Problems in Applied Deep Learning - comprehensive map of practical DL challenges. ● 300+ references: Surveys ~300 papers to catalog where applied DL struggles in practice. ● End-to-end view: Covers data collection, architecture, training, evaluation, deployment, and monitoring. ● Actionable problems: Enumerates concrete research opportunities across each stage of the ML lifecycle. ● Community resource: Widely used as a reading list for graduate-level applied ML courses. |
Paper , Tweet |
| 7) DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature - Stanford's probability-curvature detection. ● Curvature hypothesis: LM-generated text sits at a local maximum of the model's log-probability — perturbations predictably reduce probability. ● Zero-shot detector: Compares log-probability of a passage vs minor paraphrases without training a classifier. ● Strong accuracy: Outperforms supervised detectors across GPT-2, GPT-Neo, and ChatGPT. ● AI-generated content provenance: Influential in ongoing work on LLM text detection and authorship verification. |
Paper, Tweet |
| 8) StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis - revives GANs for large-scale T2I. ● Scaled-up generator: Increases StyleGAN capacity and training data to handle complex text-to-image distributions. ● Fast inference: Orders of magnitude faster sampling than diffusion — single forward pass per image. ● Competitive quality: Narrows the quality gap to diffusion models on 64x64 and 256x256 resolutions. ● Latency-driven generation: Positions GANs as a compelling option for interactive T2I applications. |
Paper, Project, Code Tweet |
| 9) Large language models generate functional protein sequences across diverse families - ProGen: LLMs for protein design. ● 1.2B protein LM: Trained on ~280M protein sequences spanning broad taxonomy and functional annotation. ● Functional validation: Wet-lab experiments confirm generated enzymes are active — including sequences far from any natural homolog. ● Controllable generation: Condition-on-family prompts produce proteins with specified properties. ● Generative biology: Landmark Nature Biotechnology result demonstrating LLMs as bona fide design tools for synthetic biology. |
Paper, Tweet |
| 10) The Impossibility of Parallelizing Boosting - theoretical lower bound on boosting parallelization. ● Inherent serial cost: Proves that boosting algorithms cannot be dramatically parallelized without increasing total work. ● Trade-off theorem: Establishes a formal trade-off between parallel rounds and total training time. ● Implications for ML systems: Shows boosting is fundamentally different from parallelizable algorithms like SGD. ● Theoretical contribution: Settles a long-standing open question in learning theory and shapes future algorithm design. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Google AI Research Recap (2022 Edition) - Jeff Dean's annual review of Google AI research. ● Breadth of impact: Surveys advances across language, vision, multimodal, generative models, and scientific AI. ● Key 2022 milestones: Highlights PaLM, Flamingo, Imagen, Parti, Minerva, LaMDA, and DeepMind's AlphaCode and AlphaFold work. ● Responsible AI: Dedicated sections on fairness, privacy, and sociotechnical research. ● Community reference: Frequently cited as an organizational snapshot of the AI research frontier at year-end 2022. |
Blog, Tweet |
| 2) Dissociating language and thought in large language models: a cognitive perspective - Mahowald et al.'s landmark cognitive review. ● Formal vs functional language: Separates knowledge of linguistic rules from its use in reasoning, world knowledge, and social cognition. ● LLM assessment: Argues LLMs excel at formal linguistic competence but are deficient in functional competence. ● Cognitive science lens: Draws on decades of neuroscience to interpret LLM capabilities and failures. ● Framework influence: Widely adopted framing for discussing LLM reasoning, hallucination, and world models. |
Paper, Tweet |
| 3) Human-Timescale Adaptation in an Open-Ended Task Space - DeepMind's AdA: meta-learned embodied adaptation. ● Vast task distribution: Trains RL agents over a procedurally-generated task space spanning millions of 3D environments. ● In-context adaptation: Agent adapts to never-seen tasks within a few timesteps, matching human-level adaptation speed. ● Scale + memory matters: Shows meta-RL agents need both scale and attention-based memory to match human adaptation. ● General agents: Evidence that meta-RL at scale can produce broadly-capable embodied learners. |
Paper, Tweet |
| 4) AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation - attention-based explanations for generative LMs. ● Token importance: Identifies which input tokens most affect model predictions by selectively masking attention. ● Memory-efficient: Avoids gradient computation by manipulating attention instead, enabling efficient analysis of large LMs. ● Multimodal generalization: Works for both language models and multimodal transformers like MAGMA. ● Interpretability tooling: Provides a scalable alternative to gradient-based attribution methods. |
Paper, Tweet |
| 5) Everything is Connected: Graph Neural Networks - Veličković's concise GNN primer. ● Unified perspective: Presents GNNs as a generalization of permutation-equivariant layers, connecting CNNs and transformers. ● Message passing: Covers the core message-passing formalism and its variants (GCN, GAT, MPNN). ● Key applications: Highlights GNNs in drug discovery, traffic prediction, physics simulation, and recommendation. ● Teaching resource: Compact reference for anyone entering the graph ML field. |
Paper, Tweet |
| 6) GLIGEN: Open-Set Grounded Text-to-Image Generation - adds grounded control to frozen diffusion models. ● Grounding inputs: Conditions pre-trained diffusion models on bounding boxes, keypoints, and reference images without retraining the base model. ● Gated self-attention: Inserts new attention layers that inject grounding signals while preserving existing generation quality. ● Open-set capabilities: Generalizes to novel concepts and layouts unseen during grounding training. ● Controlled generation: A key milestone in the spatially-controllable diffusion research line alongside ControlNet. |
Paper, Tweet, Project |
| 7) InstructPix2Pix: Learning to Follow Image Editing Instructions - Berkeley's instruction-tuned image editor. ● Synthetic training data: Uses GPT-3 and Stable Diffusion to automatically generate (image, instruction, edited-image) triplets. ● Forward-only edits: Single forward pass edits images given natural-language instructions — no per-image optimization. ● Wide editing scope: Handles style changes, object swaps, additions, and attribute edits. ● Accessible image editing: Makes text-driven image editing accessible without inversion or fine-tuning per image. |
Paper, Tweet |
| 8) Dataset Distillation: A Comprehensive Review - comprehensive review of dataset distillation. ● Problem definition: Formalizes dataset distillation as synthesizing a small dataset that preserves model training performance. ● Method taxonomy: Categorizes approaches by matching objective — meta-learning, gradient matching, trajectory matching, distribution matching. ● Applications: Surveys use cases in continual learning, privacy, neural architecture search, and federated learning. ● Open challenges: Identifies scaling, cross-architecture transfer, and theoretical understanding as key open problems. |
Paper, Tweet |
| 9) Learning-Rate-Free Learning by D-Adaptation - eliminates manual learning-rate tuning. ● Parameter-free optimizer: Adaptively estimates an effective learning rate from observed gradient norms, eliminating the need for an LR schedule. ● Optimal convergence: Matches the asymptotic convergence of optimally-tuned gradient descent. ● Broad applicability: Demonstrated on 12+ diverse ML problems from convex to large-scale deep learning. ● Production adoption: Later used in training practical models (precursor to Prodigy, Schedule-Free SGD). |
Paper, Tweet |
| 10) RecolorNeRF: Layer Decomposed Radiance Field for Efficient Color Editing of 3D Scenes - interactive color editing for NeRFs. ● Layer decomposition: Decomposes NeRF scenes into color layers that can be edited independently. ● View-consistent recoloring: Color edits propagate coherently across all viewpoints of the 3D scene. ● Interactive workflow: Enables palette-based editing tools familiar from 2D image editing. ● 3D asset editing: Makes NeRFs practical for creative workflows that require post-hoc appearance edits. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Mastering Diverse Domains through World Models - DreamerV3: scalable world-model RL. ● Single algorithm: Uses identical hyperparameters to solve 150+ diverse tasks spanning continuous control, Atari, and Minecraft. ● Minecraft diamond milestone: First algorithm to collect diamonds in Minecraft from scratch without human demonstrations or curricula. ● Robust world model: Learns a latent dynamics model with techniques (symlog prediction, KL balancing) that eliminate per-task tuning. ● General-purpose RL: Establishes world-model RL as a viable general algorithm across domains. |
Paper, Tweet |
| 2) Tracr: Compiled Transformers as a Laboratory for Interpretability - DeepMind's RASP-to-transformer compiler. ● Program-to-weights: Compiles human-readable RASP programs directly into transformer weights with known ground-truth mechanisms. ● Interpretability testbed: Provides models where every computation is known, enabling rigorous evaluation of interpretability methods. ● Toolkit for circuit research: Supports ablation studies, probing methods, and causal analysis with certainty. ● Mechanistic interpretability: Foundational tool for the mechanistic interpretability research program. |
Paper, Tweet, Code |
| 3) Multimodal Deep Learning - comprehensive textbook on multimodal DL. ● Full textbook: 200+ page arXiv publication covering architectures, training, and applications of multimodal systems. ● Modality coverage: Discusses vision-language, vision-audio, and three-way multimodal models in depth. ● Architectural foundations: Details fusion techniques, cross-attention, contrastive learning, and joint embedding. ● Graduate-level teaching resource: Widely adopted for multimodal AI courses and self-study curricula. |
Book, Tweet |
| 4) Forecasting Potential Misuses of Language Models for Disinformation Campaigns—and How to Reduce Risk - OpenAI's disinformation threat assessment. ● Kill chain framework: Analyzes LMs' role across disinformation pipeline — actor capabilities, content generation, distribution, and audience reach. ● Threat vectors: Identifies how generative LMs lower cost, increase scale, and enable tailored influence operations. ● Mitigation taxonomy: Proposes interventions at model design, platform, content distribution, and media literacy levels. ● Policy-relevant research: Shaped subsequent AI safety and elections-integrity efforts. |
Paper, Tweet |
| 5) Why do Nearest Neighbor Language Models Work? - empirical analysis of kNN-LM benefits. ● Interpolation effect: Identifies that mixing a kNN distribution with parametric LM softmax improves calibration more than knowledge addition. ● Representation capacity: Finds the LM's own context representations are the primary driver of kNN-LM gains. ● Softmax bottleneck: Shows kNN retrieval helps overcome the softmax bottleneck in expressive output distributions. ● Retrieval theory: Clarifies when and why retrieval augmentation helps parametric LMs. |
Paper, Code, Tweet |
| 6) Memory Augmented Large Language Models are Computationally Universal - proves LLMs + memory achieve Turing completeness. ● Formal proof: Shows Flan-U-PaLM 540B with associative external memory can simulate any universal Turing machine. ● Stored-program computation: Demonstrates that prompting LLMs with memory reads/writes produces arbitrary computation. ● Theoretical framing: Positions LLMs as programmable computational substrates, not just statistical models. ● Foundations of agentic LLMs: Theoretical backing for the later wave of tool-using and memory-augmented LLM agents. |
Paper , Tweet |
| 7) A Survey on Transformers in Reinforcement Learning - comprehensive survey of transformers in RL. ● TransRL taxonomy: Organizes work by use — representation, policy architecture, world models, sequence-to-sequence RL. ● Offline vs online RL: Surveys Decision Transformer and Trajectory Transformer alongside online training variants. ● Partial observability: Highlights transformers' strength in long-horizon and partially-observable RL settings. ● Roadmap: Identifies open problems in training stability, sample efficiency, and generalization of transformer-based RL. |
Paper, Tweet |
| 8) Scaling Laws for Generative Mixed-Modal Language Models - Meta's scaling laws for multimodal generation. ● Mixed-modal regime: Studies loss scaling when training on combinations of text, code, image, and speech. ● Cross-modal interference: Identifies when adding modalities helps vs hurts, formalizing competition and synergy effects. ● Compute-optimal ratios: Derives compute-optimal recipes for mixing different modalities during pretraining. ● Multimodal scaling roadmap: Informs the design of subsequent large multimodal models (Chameleon, Gemini). |
Paper, Tweet |
| 9) DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching - transformer-based local feature matcher. ● SlimFormer + InterFormer: Novel transformer designs for efficient intra- and inter-image feature interaction. ● Robust across challenges: Handles large viewpoint changes, illumination variation, and low-texture scenes. ● SOTA matching: Outperforms prior SOTA on HPatches, YFCC100M, and other matching benchmarks. ● Computer vision utility: Strengthens foundation tasks for 3D reconstruction, SfM, and visual localization. |
Paper, Tweet |
| 10) Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement - D3VAE for time series forecasting. ● Triple-D framework: Combines Diffusion, Denoising, and Disentanglement in a bidirectional VAE backbone. ● Noise-aware training: Diffusion strengthens the model's ability to handle noisy time series data. ● Interpretable latent: Disentanglement yields interpretable latent factors linking to underlying temporal dynamics. ● SOTA forecasting: Beats transformer and deep-learning baselines on multiple real-world datasets. |
Paper, Tweet |
| Paper | Links |
|---|---|
| 1) Muse: Text-To-Image Generation via Masked Generative Transformers - Google's masked-token T2I model. ● Masked transformer: Generates images via parallel masked token prediction instead of autoregressive or diffusion sampling. ● Dramatic speedup: 10x faster sampling than Imagen and Parti, producing high-quality images in few steps. ● Editing capabilities: Supports inpainting, outpainting, and mask-free editing natively via masked prediction. ● Alternative T2I paradigm: Demonstrates that non-diffusion approaches remain competitive for large-scale text-to-image generation. |
Paper, Project, Code, Tweet |
| 2) VALL-E Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers - Microsoft's neural codec TTS model. ● Codec-based TTS: Treats text-to-speech as conditional language modeling over discrete audio codec tokens (EnCodec). ● 3-second cloning: Clones a speaker's voice from just a 3-second acoustic prompt, preserving timbre and emotion. ● Zero-shot voice synthesis: Zero-shot speaker adaptation without fine-tuning, a huge leap over prior TTS systems. ● Generative speech milestone: Bridges LLM methodology to speech, enabling a wave of prompt-based audio generation research. |
Project, Tweet |
| 3) Rethinking with Retrieval: Faithful Large Language Model Inference - retrieval-augmented CoT. ● CoT-conditioned retrieval: Decomposes reasoning into steps via chain-of-thought, then retrieves evidence for each step. ● Faithful inference: Ensures answers are grounded in external knowledge rather than hallucinated. ● Strong accuracy: Improves over vanilla CoT on TriviaQA, NaturalQuestions, and other knowledge-intensive benchmarks. ● Retrieval reasoning: Early blueprint for the step-level RAG patterns now common in agentic systems. |
Paper, Tweet |
| 4) SparseGPT: Massive Language Models Can Be Accurately Pruned In One-Shot - one-shot unstructured LLM pruning. ● No retraining: Prunes OPT-175B and BLOOM-176B to 50-60% sparsity in a few GPU-hours with no fine-tuning. ● Layer-wise solver: Frames pruning as a layer-wise reconstruction problem solved via efficient second-order updates. ● Minimal perplexity loss: Negligible accuracy degradation even at high sparsity ratios. ● Production-ready compression: Makes aggressive LLM compression practical at the largest scales, enabling cheaper deployment. |
Paper, Tweet |
| 5) ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders - Meta's self-supervised ConvNet revival. ● Fully conv MAE: Adapts masked autoencoder pretraining for ConvNets using sparse convolutions over masked patches. ● GRN module: Introduces Global Response Normalization to boost feature diversity and training stability. ● Strong ImageNet results: Matches/beats ViT-based MAE on ImageNet, detection, and segmentation. ● CNN competitiveness: Demonstrates that ConvNets remain competitive when properly scaled with modern self-supervised pretraining. |
Paper, Code, Tweet |
| 6) Large Language Models as Corporate Lobbyists - LLMs applied to real-world lobbying tasks. ● Lobbying pipeline: Uses GPT-3.5 to classify relevant bills, summarize them, and generate corporate lobbying responses. ● Practical experiment: Deploys end-to-end LLM lobbying on real US Congressional bills affecting corporate interests. ● Ethics discussion: Probes implications for democratic discourse as LLMs lower the cost of scaled political engagement. ● Sociotechnical precedent: Informs broader debate about AI influence on governance and policy formation. |
Paper , Code, Tweet |
| 7) Superposition, Memorization, and Double Descent - Anthropic's toy-model study of memorization dynamics. ● Superposition of features: Shows how toy networks represent more features than neurons via superposition during memorization. ● Double descent explained: Provides mechanistic explanation for why test loss can decrease then spike then fall again with scale. ● Phase transitions: Observes clean transitions between memorization and generalization regimes. ● Mechanistic interpretability: Builds foundational theory for understanding feature representations in larger transformers. |
Paper, Tweet |
| 8) StitchNet: Composing Neural Networks from Pre-Trained Fragments - modular NN construction from existing weights. ● Fragment stitching: Composes new networks by stitching together layers from multiple pretrained models. ● Compatibility metric: Proposes measures of fragment compatibility to guide composition. ● Efficient reuse: Avoids expensive training by reusing existing components for new tasks. ● Modular deep learning: Early exploration of the growing modular ML space (model merging, adapter composition). |
Paper, Tweet |
| 9) Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes - human-in-the-loop LM program refinement. ● Iterative decomposition: Breaks down complex QA tasks into subtasks and refines the decomposition through human feedback. ● Process supervision: Supervises intermediate reasoning steps rather than just final answers. ● ICE tool: Introduces the ICE (Interactive Composition Explorer) library for building compositional LM programs. ● Precursor to agent frameworks: Anticipates later LLM orchestration frameworks (LangChain, DSPy). |
Paper, Code Tweet |
| 10) A Succinct Summary of Reinforcement Learning - compact overview of key RL concepts. ● Core ideas: Covers Markov decision processes, value iteration, policy gradients, and actor-critic methods. ● Modern methods: Touches on PPO, DQN, AlphaZero, and RLHF in a unified notation. ● Concise reference: Designed as a 20-page primer suitable for ML engineers needing quick RL grounding. ● Teaching resource: Useful pocket reference for those entering RL-adjacent areas like RLHF for LLM training. |
Paper, Tweet |
We use a combination of AI-powered tools, analytics, and human curation to build the lists of papers.
Subscribe to our NLP Newsletter to stay on top of ML research and trends.
Join our Discord.