2023.md 372 KB

AI Papers of the Week — 2023

← Back to main index

This page collects every weekly issue of AI Papers of the Week from 2023. For other years, see the main index.


Top AI Papers of the Week (December 25 - December 31)

Paper Links
1) CogAgent - Tsinghua's CogAgent is an 18B-parameter visual-language model purpose-built for GUI understanding and navigation, with unusually high input resolution.
● High-res GUI input: Supports 1120x1120 input resolution via a dedicated high-res cross-module, letting it read small fonts and dense UI elements that typical VLMs blur out.
● Dual-tower vision: Combines a low-res general vision encoder with a high-res cross-module, balancing context understanding with fine-grained icon/text perception.
● Broad capabilities: Handles visual Q&A, visual grounding, and end-to-end GUI agent tasks on web and desktop, positioning as a general GUI backbone.
● SoTA VQA: Achieves state-of-the-art on 5 text-rich (e.g., OCR-heavy) and 4 general VQA benchmarks, covering document, chart, and scene understanding.
Paper, Tweet
2) From Gemini to Q-Star - A 300+-paper survey mapping the state of Generative AI and the research frontiers that followed the Gemini + rumored Q* news cycle.
● Broad coverage: Surveys developments across language, vision, audio, and multimodal generative systems, treating Gen AI as a unified field rather than siloed modalities.
● Computational challenges: Catalogs scalability, efficiency, and alignment challenges currently gating further progress, including training compute, inference serving, and evaluation.
● Real-world applications: Reviews Gen AI impact across healthcare, finance, and education, highlighting where genuine deployment signals diverge from hype.
● Future directions: Identifies agent frameworks, reasoning, grounded multimodality, and alignment as the most live research areas heading into 2024.
Paper, Tweet
3) PromptBench - A unified library for comprehensive evaluation and analysis of LLMs that consolidates multiple evaluation concerns under one roof.
● Prompt-construction tooling: Ships with utilities for prompt construction, prompt engineering, and dataset/model loading, covering the end-to-end LLM evaluation workflow.
● Adversarial prompt attacks: Built-in adversarial prompt-attack capabilities let users stress-test LLMs against perturbations rather than just measuring clean accuracy.
● Dynamic evaluation: Supports dynamic evaluation protocols to detect dataset contamination and measure robustness beyond static benchmark numbers.
● Unified interface: Replaces the ad-hoc evaluation scripts many teams maintain with a consistent API, reducing friction when comparing across models and prompt variants.
Paper, Tweet
4) Exploiting Novel GPT-4 APIs - A red-team study of three newer GPT-4 API surfaces - fine-tuning, function calling, and knowledge retrieval - that reveals each introduces new attack vectors.
● Fine-tuning strips safeguards: As few as 15 harmful examples - or even 100 benign examples - fine-tuned into GPT-4 is enough to remove core safety behaviors.
● Function-call schema leakage: GPT-4 Assistants can be coerced into divulging their function-call schemas and then tricked into executing arbitrary function calls.
● Retrieval hijacking: The knowledge-retrieval endpoint is vulnerable to prompt injection via documents in the retrieval corpus, letting attackers steer model behavior through uploaded content.
● Policy implication: Expanding API surface area introduces alignment risks that weren't present for text-only completions, and API providers need surface-specific defenses rather than relying on base-model alignment.
Paper, Tweet
5) Fact Recalling in LLMs - A mechanistic-interpretability study showing that early MLP layers function as a lookup table for factual recall.
● Athletes-to-sports task: Scoped to how Pythia 2.8B recalls which of 3 different sports various athletes play - a clean task for dissecting a single type of factual recall.
● Early MLPs as lookup table: Early MLP layers perform a structured lookup rather than distributed reasoning, with specific neurons keyed to entity-attribute pairs.
● Multi-token embedding view: Recommends treating factual knowledge recall as operating over multi-token embeddings rather than single-token representations.
● Interpretability payoff: Provides a concrete, testable account of where and how facts live inside transformers, enabling targeted editing and auditing of parametric memory.
Paper, Tweet
6) Generative AI for Math (OpenWebMath / MathPile) - Releases a diverse, high-quality math-centric corpus of ~9.5B tokens designed for training math-capable foundation models.
● 9.5B-token corpus: Curated from mathematical content across the web, textbooks, papers, and Q&A, rebalanced for math-specific token distribution.
● Quality filtering: Applies math-specific filtering to surface content dense in symbolic notation, proofs, and problem solutions rather than surface-level mentions of math.
● Diverse sources: Explicitly mixes proof-heavy formal math with applied problem-solving to avoid over-fitting to any single mathematical register.
● Training signal: Positioned as a drop-in pretraining or continual-pretraining corpus to lift math reasoning in existing LLMs without changing the architecture.
Paper, Tweet
7) Principled Instructions Are All You Need - Distills effective LLM prompting into 26 guiding principles and validates them across multiple model families.
● 26 principles: Covers prompt structure, audience specification, example selection, formatting, role assignment, and stepwise decomposition.
● Broad model validation: Tested on LLaMA-1/2 (7B, 13B, 70B) and GPT-3.5/4, finding the principles generalize across scales and families.
● Both small and large benefits: Smaller models benefit more from structured prompting (higher variance reduction), while larger models benefit in absolute accuracy on harder tasks.
● Practical reference: Functions as a cheat-sheet for practitioners, converting scattered prompting folklore into testable recipes.
Paper, Tweet
8) Survey of Reasoning with Foundation Models - A comprehensive survey of reasoning with foundation models, covering tasks, methods, benchmarks, and future directions.
● Task coverage: Surveys math reasoning, commonsense reasoning, logical reasoning, symbolic reasoning, and multimodal reasoning - showing how each evolves with model scale.
● Methodology catalog: Covers prompting techniques (CoT, ToT, self-consistency), fine-tuning strategies, and neurosymbolic approaches under a unified framework.
● Benchmarks: Systematizes the reasoning benchmarks landscape and flags contamination and robustness concerns specific to reasoning evaluation.
● Adjacencies: Discusses how multimodal learning, autonomous agents, and super-alignment research intersect with and extend the reasoning agenda.
Paper, Tweet
9) LLaRA - LLaRA adapts a decoder-only LLM for dense retrieval via two tailored pretext tasks that leverage text embeddings from the LLM itself.
● EBAE pretext task: Embedding-Based Auto-Encoding uses LLM embeddings to reconstruct tokens of the input sentence, aligning the embedding space with semantic content.
● EBAR pretext task: Embedding-Based Auto-Regression predicts tokens of the next sentence from the current embedding, injecting discourse-level signal into retrieval embeddings.
● LLaMA 2 7B base: A LLaMA 2-7B base model is adapted into a retriever with these pretext tasks, yielding significant gains on MSMARCO and BEIR.
● Decoder retrievers validated: Provides another data point that decoder-only LLMs, with the right adaptation, rival specialized encoder retrievers - a theme that continued through 2024.
Paper
10) Gemini vs GPT-4V - A qualitative side-by-side comparison of Gemini and GPT-4V across vision-language tasks, documenting systematic behavioral differences.
● Head-to-head cases: Evaluates both models on a curated set of tasks covering document understanding, chart reading, everyday scenes, and multi-image reasoning.
● GPT-4V style: Produces precise, succinct answers with strong preference for brevity and factual minimalism.
● Gemini style: Returns more expansive, narrative answers frequently accompanied by relevant images and links - leveraging its deeper integration with search.
● Complementary strengths: Concludes that the models are substitutable for many core VLM tasks but differ sharply on response length, multimedia, and augmentation patterns.
Paper, Tweet

Top AI Papers of the Week (December 18 - December 24)

Paper Links
1) Gemini's Language Abilities - CMU's impartial, reproducible evaluation of Gemini Pro against GPT and Mixtral across standard LLM benchmarks.
● Reproducible methodology: Provides an open, reproducible evaluation pipeline - a response to concerns about Google's own Gemini launch benchmarks being hard to independently verify.
● Gemini Pro vs. GPT 3.5 Turbo: Gemini Pro achieves comparable but slightly lower accuracy than GPT 3.5 Turbo, countering marketing claims of broad parity on language tasks.
● Gemini & GPT beat Mixtral: Both Gemini and GPT outperform Mixtral on these benchmarks, suggesting open mixture-of-experts has not yet closed the gap to frontier proprietary models.
● Evaluation norms: Positioned as evidence that independent replications remain essential, and that first-party model reports shouldn't be the final word on comparative capability.
Paper, Tweet
2) PowerInfer - A high-speed LLM inference engine for consumer GPUs that exploits sparse neuron activation patterns to run large models on commodity hardware.
● Hot/cold neurons: Analysis shows that a small fraction of "hot" neurons activate on most inputs while the majority of "cold" neurons activate rarely - a power-law pattern across many LLMs.
● GPU-CPU hybrid: Hot neurons are preloaded onto the GPU for fast access, while cold neurons live on the CPU and are computed lazily, dramatically reducing GPU memory pressure.
● Reduced memory + transfer: This split reduces both GPU memory demand and the CPU-GPU data transfer that typically dominates hybrid inference cost.
● 11x speedup over llama.cpp: Achieves up to ~11x faster token generation than llama.cpp on a single consumer GPU for OPT-175B-class models - a step-change for local deployment.
Paper, Tweet
3) Antibiotic Discovery with Graph Deep Learning (Nature) - MIT researchers use explainable graph neural networks to discover a new structural class of antibiotics.
● Graph neural networks: Trains GNNs on molecular graphs to predict antibiotic activity, with explainability layers that surface chemical substructures driving predictions.
● Explainable discovery: Unlike black-box property predictors, the explanation module identifies substructures underlying antibiotic activity - a feature drug chemists can actually use.
● New structural class: The discovered compounds belong to a novel structural class, not a variant of existing antibiotic scaffolds - an unusually strong generalization signal.
● Real-world pipeline: Demonstrates end-to-end pipeline from GNN prediction to wet-lab validation, reinforcing explainable ML as a practical discovery tool for biomedicine.
Paper, Tweet
4) VideoPoet - Google Research's VideoPoet is a large language model for zero-shot video generation that treats video as just another token stream.
● Unified token stream: Uses multiple tokenizers to map video, image, audio, and text into a shared discrete token space for a single autoregressive model.
● Zero-shot task variety: The same model handles image-to-video, video stylization, video-to-audio, and text-to-video without task-specific fine-tuning.
● Language-model paradigm: Demonstrates that a plain autoregressive LM, given the right tokenizers, can handle video generation - challenging the diffusion-everywhere default for video.
● Temporal consistency: Produces videos with reasonable motion coherence over short durations, a meaningful milestone for LM-based video generation.
Paper, Tweet_
5) AppAgent - Introduces an LLM-based multimodal agent that operates real smartphone apps through touch actions and screenshots.
● Multimodal control: The agent reads the phone screen (visual input) and issues low-level touch actions (tap, swipe, type), operating apps the way humans do rather than via APIs.
● Two learning modes: Learns new apps either via autonomous exploration (discovering functionality through self-play) or by observing human demonstrations.
● Cross-app generality: Demonstrates proficiency across email, social media, shopping, and creative apps, suggesting that multimodal LLMs can generalize across smartphone UIs.
● Early mobile-agent blueprint: An early example of the on-device multimodal agent pattern that would become a major 2024 deployment theme.
Paper, Tweet_
6) LLM in a Flash - Apple researchers show how to run LLMs larger than available DRAM by streaming weights from flash storage on demand.
● Flash as swap: Stores model weights on flash and streams only the rows/columns needed per forward pass into DRAM, exploiting the sparsity of relevant parameters.
● 2x DRAM headroom: Enables running models up to 2x the size of available DRAM without catastrophic slowdown, critical for on-device deployment where memory is tight.
● Major speedups vs. naive loading: 4-5x faster on CPU and 20-25x faster on GPU compared to naive parameter loading, thanks to selective transfer and row-column bundling.
● On-device LLM groundwork: Directly enabled Apple's later on-device LLM plans by showing that flash-based streaming can make phone-scale LLM inference practical.
Paper, Tweet_
7) ReST Meets ReAct - Proposes a ReAct-style agent that improves itself via reinforced self-training on its own reasoning traces.
● Self-critique ReAct: A ReAct-style agent with a self-critique step that evaluates its own reasoning and answers, generating a filterable trace dataset.
● ReST-style iterative RL: Uses growing-batch RL from AI feedback to iteratively fine-tune on the agent's successful reasoning traces, improving over rounds without human labels.
● Human-label-free: Minimizes human involvement; synthetic data with self-improvement from AI feedback is the primary training signal throughout.
● Distillation to small models: The improved agent can be distilled into models 1-2 orders of magnitude smaller with comparable performance, dramatically cutting inference cost.
Paper, Tweet_
8) Adversarial Attacks on GPT-4 - Demonstrates that a trivially simple random-search procedure can jailbreak GPT-4 with high reliability.
● Adversarial suffix: Appends a suffix to a harmful request and iteratively perturbs it, keeping changes that increase the log-probability of the response starting with "Sure".
● No gradients needed: Operates purely via the API in a black-box setting, without model gradients or weights - a much lower bar than prior white-box jailbreak work.
● Strong success rate: Achieves high attack-success rates on GPT-4 with a small number of API calls, despite ongoing alignment efforts.
● Alignment implication: Shows that current safety training is still vulnerable to near-trivial optimization attacks, pointing to the need for stronger behavioral defenses.
Paper, Tweet_
9) RAG for LLMs - A broad survey of Retrieval-Augmented Generation research, organizing the rapidly growing literature into a coherent map.
● Three-paradigm taxonomy: Organizes RAG approaches into Naive RAG, Advanced RAG (pre/post-retrieval enhancements), and Modular RAG (orchestrated component-based systems).
● Core components: Reviews retrievers, generators, and augmentation strategies separately, clarifying which design choices sit in which component.
● Evaluation and datasets: Catalogs RAG-specific benchmarks and evaluation metrics, surfacing the still-uneven state of RAG evaluation.
● Frontier directions: Highlights agentic retrieval, multimodal RAG, and long-context RAG as the key research areas driving the 2024 RAG landscape.
Paper, Tweet_
10) BabyLLM Challenge Findings - Reports results from a challenge on sample-efficient pretraining using a developmentally plausible corpus.
● Constrained pretraining: Participants pretrain on a small, child-directed-style corpus rather than on internet-scale data, testing how efficiently models can learn from limited input.
● LTG BERT wins: The winning submission, LTG BERT, beat Llama 2 70B on 3 of 4 evaluations despite vastly less training data.
● Data preprocessing pays: Strong-performing entries relied heavily on data preprocessing and training on shorter contexts, challenging assumptions about long-context training for small data.
● Cognitive-science bridge: Provides an empirical platform connecting language-model training to developmental psycholinguistics, informing both fields.
Paper, Tweet_

Top AI Papers of the Week (December 11 - December 17)

Paper Links
1) FunSearch - DeepMind's FunSearch uses LLMs as a mutation operator in an evolutionary loop to discover genuinely new mathematical knowledge.
● LLM + evaluator loop: Combines a pretrained LLM that proposes candidate programs with a systematic evaluator that scores them, iteratively evolving low-scoring programs into high-scoring ones.
● New math discoveries: Produces novel solutions to open problems in combinatorics, including cap-set and online bin-packing, not memorized from the training data.
● Hallucination mitigation: The evaluator acts as a hard filter - only programs that actually work are kept - so LLM hallucinations don't propagate into the "discovered" knowledge.
● General recipe: Positions LLM-in-the-loop search as a general tool for scientific discovery beyond math, applicable wherever candidates can be automatically scored.
Paper, Tweet
2) Weak-to-Strong Generalization - OpenAI's superalignment team shows that weak supervisors can still elicit capabilities from much stronger models - a first empirical signal for scalable oversight.
● Weak-to-strong setup: A weak model (e.g., GPT-2) generates labels, and a strong pretrained model (e.g., GPT-4) is fine-tuned on those labels - an analog of humans supervising superhuman AI.
● Better than the supervisor: Naively fine-tuning the strong model on weak-model labels often yields a model better than the supervisor itself, demonstrating useful capability elicitation.
● ~GPT-3.5 from GPT-2 supervision: Fine-tuning GPT-4 with GPT-2-level supervision recovers close to GPT-3.5-level performance on NLP tasks - a surprising amount of capability without strong labels.
● Superalignment signal: Offers an early empirical footing for the bet that humans can align superhuman systems using their own (weaker) judgments - provided the right training recipe.
Paper, Tweet
3) Audiobox - Meta's Audiobox is a unified flow-matching audio model that generates speech, sound effects, and music from natural-language and example prompts.
● Unified audio generation: Single model handles speech, sound, and music - ending the typical pattern of one model per audio modality.
● Description + example prompting: Supports both natural-language descriptions and reference-audio examples for style control, letting users mix semantic and acoustic conditioning.
● Self-supervised infilling: Adapts a self-supervised infilling objective to pretrain on large unlabeled audio, reducing dependence on scarce labeled speech/music datasets.
● Novel voice/styles: Unlocks generation of novel vocal and acoustic styles by interpolating in the learned audio space, going beyond reproduction of training-set styles.
Paper, Tweet
4) Mathematical LLMs Survey - A survey on the progress of LLMs on mathematical reasoning tasks, covering methods, benchmarks, and open problems.
● Task taxonomy: Covers math word problem solving, symbolic reasoning, and theorem proving, showing which capabilities emerge at which model scales.
● Methods landscape: Reviews prompting techniques (CoT, PoT, ToT, self-verification) alongside fine-tuning and tool-use approaches.
● Dataset reference: Catalogs the dominant math benchmarks (GSM8K, MATH, MiniF2F, etc.) and their evaluation methodologies.
● Frontier problems: Highlights reasoning-faithfulness, formal-vs-informal math integration, and reward-model design as the key open questions.
Paper, Tweet
5) LLM360 - LLM360 is a framework for fully transparent open-source LLM development, with everything from data to training dynamics released.
● End-to-end transparency: Ships training code, the pretraining corpus, intermediate checkpoints, evaluation code, and analyses - going well beyond the "just weights" openness of earlier "open" LLMs.
● Two 7B models: Releases AMBER (general) and CRYSTALCODER (code-specialized) 7B models pretrained from scratch under the framework.
● Enables training-dynamics research: Intermediate checkpoints let researchers study loss trajectories, emergent capabilities, and data-effect ablations - typically only possible inside frontier labs.
● Standard for openness: Pushes the community's definition of "open-source LLM" from weights to a full training-pipeline standard.
Paper, Tweet
6) LLMs in Medicine - A comprehensive survey (300+ papers) of LLMs applied to medicine, from clinical tasks to biomedical research.
● Principles and applications: Covers the core principles of medical LLMs and their applications across clinical decision support, patient communication, medical education, and biomedical research.
● Benchmark coverage: Reviews medical QA benchmarks (MedQA, PubMedQA, MedMCQA, etc.) and their limitations for real clinical settings.
● Challenges: Identifies challenges specific to medicine including hallucination in clinical advice, privacy, regulatory compliance, and equity/bias concerns.
● Deployment considerations: Discusses what's required for safe deployment, including evaluation, monitoring, and the role of clinician oversight.
Paper, Tweet
7) Beyond Human Data (ReST-EM) - DeepMind's ReST-EM shows that model-generated data plus a reward function can substantially reduce dependence on human-generated data.
● Expectation-Maximization framing: Generates candidate solutions from the current model, filters using a reward/verifier, and fine-tunes on the filtered set - repeat.
● Verifiable rewards: Uses automatic verifiers (e.g., correct-answer checks) as the reward signal, sidestepping the need for a learned reward model on scarce tasks.
● PaLM 2 gains: Scales effectively on PaLM 2 for math and code tasks, outperforming standard SFT on human data at matched compute.
● Synthetic-data signal: A strong empirical case that self-generated filtered data can replace much of the human data bottleneck for reasoning tasks - a theme that grew through 2024.
Paper, Tweet
8) Gaussian-SLAM - A neural RGBD SLAM method that extends 3D Gaussian Splatting to achieve photorealistic scene reconstruction without sacrificing speed.
● 3D Gaussians for SLAM: Represents scenes as 3D Gaussians rather than neural fields, inheriting the fast training and rendering of Gaussian Splatting.
● Photorealistic reconstruction: Produces significantly higher-fidelity reconstructions than prior neural SLAM methods at comparable or better runtime.
● RGBD input: Uses standard RGB+depth input streams, making it compatible with off-the-shelf depth cameras for practical deployment.
● Speed/quality Pareto: Advances the Pareto frontier for RGBD SLAM, where previous methods forced a trade-off between runtime and photorealism.
Paper, Tweet
9) Pearl - Meta's Pearl is a production-ready reinforcement learning agent package designed for real-world deployment constraints.
● Production-oriented design: Built for real-world environments with limited observability, sparse feedback, and high stochasticity - conditions that usually break research-oriented RL libraries.
● Modular components: Offers modular policy networks, exploration strategies, offline RL, and safety constraints that can be composed for specific applications.
● Research + practice: Targets both researchers building new RL agents and practitioners deploying RL in production recommender systems, ranking, and control.
● Meta internal use: Reflects learnings from Meta's internal deployments, making it a rare RL library that starts from production pain rather than benchmark scores.
Paper, Tweet
10) QuIP# - Cornell's QuIP# is a 2-bit LLM quantization scheme that combines lattice codebooks with incoherence processing to close the quality gap to FP16.
● Lattice codebooks: Uses E8 lattice codebooks for weight quantization, a classical lattice-quantization technique adapted to LLM weight matrices.
● Incoherence processing: Pre-processes weight matrices to make them "incoherent" (less structured along axes), which improves lattice-quantization fidelity.
● 2-bit at 16-bit quality: Significantly closes the gap between 2-bit quantized LLMs and their unquantized 16-bit counterparts across a range of LLaMA-family models.
● Deployment impact: Makes large LLMs (e.g., Llama 2 70B) fit into consumer-grade GPU memory without catastrophic quality loss, expanding the set of models hobbyists can run locally.
Paper, Tweet

Top AI Papers of the Week (December 4 - December 10)

Paper Links
1) Gemini 1.0 - Google launches Gemini 1.0, a multimodal family natively designed to reason across text, images, video, audio, and code from the ground up.
● Three tiers: Ships as Ultra (frontier), Pro (balanced), and Nano (on-device), covering everything from data-center reasoning to mobile inference.
● Native multimodality: Unlike "bolted-on" multimodal models, Gemini is trained multimodally from scratch, with joint tokenization across text, image, video, audio, and code.
● MMLU milestone: Gemini Ultra reports the first MMLU score above human-expert performance (90.0%), using chain-of-thought with uncertainty-weighted majority voting.
● Broad capability claims: Ultra sets SOTA on 30 of 32 benchmarks in the report, spanning multimodality, multilinguality, factuality, summarization, math/science, long-context, and reasoning.
Paper, Tweet
2) EfficientSAM - Meta's EfficientSAM is a lightweight Segment Anything variant that preserves most of SAM's zero-shot quality at a fraction of the compute.
● Masked autoencoder pretraining: Uses a SAMI (SAM-leveraged masked image) pretraining objective where a small student learns to reconstruct features aligned with the SAM teacher.
● 20x smaller and faster: Achieves roughly 20x fewer parameters and 20x faster runtime than the original SAM image encoder.
● Near-parity quality: 44.4 AP vs. 46.5 AP on zero-shot instance segmentation (within 2 points) despite the dramatic efficiency win.
● Deployment-ready: Makes SAM-grade segmentation feasible on commodity hardware, consumer devices, and real-time applications where the original SAM is too heavy.
Paper, Tweet
3) Magicoder - Magicoder is a fully open-source code LLM that closes the gap with top commercial code models at only 7B parameters via high-quality synthetic instruction data.
● OSS-Instruct data: Generates 75K synthetic instruction pairs by seeding GPT with snippets pulled from open-source code, producing more diverse and realistic training data than prior code SFT datasets.
● Broad coverage: Training data spans Python, multilingual programming, and data-science program completion, producing a genuinely general code model rather than a Python-only model.
● HumanEval+ win: MagicoderS-CL-7B (based on CodeLlama) surpasses ChatGPT on HumanEval+ with 66.5 vs. 65.9 pass@1, despite being 7B.
● Fully open: Ships with code, data, and weights, positioning Magicoder as a reproducible open baseline for instruction-tuned code generation.
Paper, Tweet
4) LLMs on Graphs - A comprehensive overview of the many ways LLMs can be applied to graph-structured data and when each pattern is useful.
● Three graph scenarios: Organizes the space by whether graphs are pure (no text), text-rich (nodes/edges carry natural language), or text-paired (graphs alongside documents).
● Three role taxonomies: Categorizes LLMs as predictors, enhancers, or aligners with GNNs - clarifying whether the LLM is the model, a feature source, or a supervisor.
● Task coverage: Spans node classification, link prediction, graph-level tasks, and reasoning over knowledge graphs.
● Open problems: Flags scalability to large graphs, handling of graph structure without loss, and integration with tool-augmented LLMs as the key unsolved directions.
Paper, Tweet
5) Llama Guard - Meta's Llama Guard is a compact, instruction-tuned safety classifier built on Llama 2-7B for input/output moderation in conversational AI.
● Llama 2-7B base: Small enough to run inline with a main generative model while handling both prompt- and response-level safety classification.
● Customizable taxonomy: The safety taxonomy is specified in the instruction prompt itself, so operators can adapt it to their use case without retraining.
● Zero-shot and few-shot: Works off the shelf for many taxonomies in zero- or few-shot mode, and can be fine-tuned on a specific policy dataset when needed.
● Open release: Ships as an open model, filling a gap for teams that want local, auditable safety classification rather than relying solely on API-side moderation.
Paper, Tweet
6) KTO (Kahneman-Tversky Optimization) - Contextual AI introduces KTO, an alignment objective derived from prospect theory that works with binary "good/bad" signals instead of preference pairs.
● Prospect-theory motivation: Models reward as a Kahneman-Tversky value function with loss aversion, replacing DPO's log-likelihood-of-preferences objective with utility maximization.
● No preference pairs needed: Works with unpaired good/bad signals, dramatically loosening data collection requirements compared to DPO or RLHF.
● Matches/beats DPO: Matches or exceeds DPO performance at model scales from 1B to 30B, a clean empirical win at similar training cost.
● Practical data advantage: Makes alignment much cheaper to run in production where paired preference data is rare but outcome feedback ("user liked/didn't like") is abundant.
Paper, Tweet
7) Chain of Code - DeepMind's Chain of Code extends CoT by encouraging LMs to write pseudocode that mixes real code with LM-simulated sub-routines.
● LMulator: The LM generates pseudocode programs and explicitly annotates sub-tasks that can't be executed; a "LMulator" simulates those sub-tasks with the LM while the interpreter handles the rest.
● Undefined-behavior handling: The interpreter catches undefined behavior and cleanly hands off to the LM, sidestepping the brittleness of code-first approaches that fail silently on hard ops.
● 84% on BIG-Bench Hard: Achieves 84% on BIG-Bench Hard - a 12-point gain over Chain of Thought and a clean demonstration that mixing exact execution with LM simulation beats either alone.
● Broad applicability: Works across math, logic, and commonsense reasoning, positioning Chain of Code as a general-purpose CoT upgrade.
Paper, Tweet
8) Data Management for LLMs - A survey of data-management research for LLM pretraining and supervised fine-tuning stages.
● Pretraining data: Covers data quantity, quality filtering, deduplication, domain composition, and curriculum strategies for large-scale pretraining.
● SFT data: Reviews instruction-data generation, quality filtering, diversity metrics, and the emerging literature on "less is more" for SFT.
● Domain and task composition: Examines how task mixing affects generalization vs. specialization in fine-tuning.
● Open challenges: Identifies dataset contamination, deduplication at trillion-token scale, and reproducible data recipes as the top open problems.
Paper, Tweet
9) RankZephyr - RankZephyr is an open-source LLM for listwise zero-shot reranking that bridges the effectiveness gap with GPT-4.
● Listwise zero-shot: Reranks a full candidate list in a single shot rather than doing pairwise or pointwise scoring, matching the paradigm GPT-4 uses most effectively.
● Open-source: Based on the open Zephyr chat model, releasing a fully reproducible stack for high-quality reranking.
● Matches/beats GPT-4: Competitive with GPT-4 on standard reranking benchmarks and outperforms GPT-4 on NovelEval, a post-training-cutoff benchmark resistant to contamination.
● Contamination-free win: The NovelEval advantage is particularly meaningful because it addresses the concern that GPT-4's strong reranking numbers are partly driven by memorization of benchmark queries.
Paper, Tweet
10) The Efficiency Spectrum of LLMs - A comprehensive review of algorithmic advancements for improving LLM efficiency across the full training-to-inference stack.
● Scaling laws and data: Covers how scaling laws and data-utilization strategies interact with efficiency - more isn't always better under compute constraints.
● Architectural innovations: Reviews attention variants, state-space models, MoE, and other architectural levers for efficient scaling.
● Training and tuning: Catalogs PEFT methods (LoRA, adapters, prefix tuning), quantization-aware training, and curriculum-based training strategies.
● Inference techniques: Surveys quantization, pruning, speculative decoding, KV-cache optimization, and batching as the inference-time efficiency toolkit.
Paper, Tweet

Top AI Papers of the Week (November 27 - December 3)

Paper Links
1) GNoME - DeepMind's Graph Networks for Materials Exploration (GNoME) is an AI system that discovered 2.2 million new crystal structures, including 380,000 thermodynamically stable ones.
● 2.2M new crystals: Dramatically expands the known crystal inventory, with 380,000 stable materials - an order-of-magnitude leap over prior computational chemistry.
● Graph networks for stability: Predicts formation energies and stability of candidate materials using graph neural networks trained on DFT-labeled data.
● Active-learning loop: Combines exploration (proposing candidate structures) with exploitation (prioritizing high-stability candidates), iteratively expanding the frontier of known materials.
● Autonomous lab validation: A subset of predictions was validated in Berkeley's autonomous materials lab, closing the prediction-to-synthesis loop for the first time at this scale.
Paper, Tweet
2) Open-Source LLMs vs. ChatGPT - A survey cataloguing tasks where open-source LLMs claim to be on par with or better than ChatGPT.
● Task-by-task audit: Organizes claims by task category (code, math, reasoning, summarization, etc.) with the specific open models and benchmarks backing each claim.
● Gap measurement: Clarifies where open-source genuinely closes the gap vs. where "comparable" actually hides meaningful performance differences.
● Critical lens: Calls out evaluation-methodology issues in specific open-source claims, including benchmark contamination, cherry-picked subsets, and inconsistent judge setups.
● 2023 snapshot: Captures where open-source LLMs stood at the end of 2023 - a useful reference point for tracking how the gap evolved through 2024.
Paper, Tweet
3) Adversarial Diffusion Distillation (SDXL Turbo) - Stability AI's ADD trains a student diffusion model that produces high-quality images in just 1-4 sampling steps.
● Score distillation + adversarial loss: Combines score-distillation from a teacher diffusion model with an adversarial loss to maintain image fidelity in the low-step regime.
● 1-4 step generation: Produces usable images in a single step and SoTA-quality images in four, compared to 25-50 steps for typical SDXL sampling.
● Matches multi-step SoTA: Achieves image quality comparable to state-of-the-art diffusion baselines at four steps, dramatically cutting inference cost.
● Real-time generation: Enables SDXL-quality images at real-time frame rates on consumer GPUs, unlocking interactive creative tooling that was previously impractical.
Paper, Tweet
4) Seamless - Meta's Seamless is a family of models for end-to-end expressive, streaming cross-lingual speech communication.
● SeamlessExpressive: Preserves the speaker's expressive characteristics (pitch, emotion, pauses) across translation rather than flattening them into neutral speech.
● SeamlessStreaming: Produces translated speech in a streaming fashion with low latency, enabling near-real-time conversational translation.
● Low-resource coverage: An improved SeamlessM4T is trained on more low-resource language data, broadening the language coverage meaningfully beyond the original M4T.
● Safety red-teaming: Meta applies a red-teaming effort specifically for multimodal translation safety, a recognition that MT systems can amplify harmful content across languages.
Paper, Tweet
5) MEDITRON-70B - EPFL's MEDITRON is an open-source family of medical LLMs at 7B and 70B parameters, continually pretrained on curated medical corpora.
● Llama 2 base + medical pretraining: Builds on Llama 2 with continual pretraining on a curated medical corpus covering clinical papers, guidelines, and textbooks.
● Strong open medical baseline: MEDITRON-70B outperforms GPT-3.5 and Med-PaLM on standard medical QA benchmarks while being open-source.
● Close to frontier: Comes within 5% of GPT-4 and 10% of Med-PaLM 2 on MultiMedQA - competitive given the much smaller scale and open release.
● Reproducible recipe: Ships with pretraining data, code, and weights, providing a reproducible starting point for researchers and institutions building medical LLMs.
Paper, Tweet
6) Medprompt - Microsoft researchers show that careful prompt engineering can push general-purpose GPT-4 to state-of-the-art on medical benchmarks, no domain fine-tuning required.
● General-purpose prompting: Uses purely general-purpose prompt-engineering techniques (CoT, dynamic few-shot, choice-shuffling ensembling) with no medical-domain specialization.
● Medprompt recipe: Combines k-nearest-neighbor example selection, GPT-4-generated chain-of-thought rationales, and choice-shuffling to cancel answer-position biases.
● SoTA on 9 benchmarks: Achieves state-of-the-art on all nine benchmarks in MultiMedQA, beating Med-PaLM 2 and other specialized medical models.
● Broader lesson: Reopens the question of whether domain-specific pretraining is actually necessary when a frontier base model is paired with strong prompting - a framing that has recurred in later debates.
Paper, Tweet
7) UniIR - UniIR is a unified instruction-guided multimodal retriever that handles eight retrieval tasks across modalities with a single model.
● Instruction-guided: A single retriever conditioned on natural-language instructions determines which retrieval task to perform, rather than one retriever per task.
● Eight tasks: Handles image-to-text, text-to-image, composed-image retrieval, video retrieval, and other multimodal variants under one umbrella.
● Zero-shot generalization: Generalizes to unseen retrieval tasks not explicitly trained on, approaching a truly general multimodal retrieval model.
● M-BEIR benchmark: Ships with a new multimodal retrieval benchmark (M-BEIR) designed to standardize evaluation across tasks and modalities.
Paper, Tweet
8) Safe Deployment of Generative AI (Nature) - A Nature correspondence arguing that medical professionals - not commercial interests - must drive the development and deployment of generative AI in medicine.
● Privacy-first framing: Centers patient-privacy considerations as the non-negotiable constraint on medical AI deployment.
● Professional governance: Calls for clinician-led governance structures rather than commercial self-regulation, citing past failures of tech-industry oversight in regulated domains.
● Deployment guardrails: Recommends guardrails including consent, transparency of training data, and clinician accountability for AI-assisted decisions.
● Policy signal: As a Nature piece, amplifies medical-community concerns into the broader AI policy conversation at a key moment in the regulation debate.
Paper, Tweet
9) Dobb-E - NYU's Dobb-E is an affordable household-manipulation robot that learns new tasks with just 5 minutes of user demonstrations.
● 5 minutes of demos: Learns new household manipulation tasks from only ~5 minutes of demonstrations, a dramatic reduction from typical data requirements.
● Hardware design: Uses a low-cost stick-on gripper and a smartphone-driven data-collection rig, keeping the barrier to entry low for non-expert users.
● Home-specific challenges: Experiments in real homes surface challenges usually hidden in lab robotics - strong shadows, variable demo quality, and household-specific clutter.
● General-purpose household system: Positions Dobb-E as a general-purpose system for household robotics rather than a task-specific demonstrator, a step toward practical home robots.
Paper, Tweet
10) Translatotron 3 - Google's Translatotron 3 performs speech-to-speech translation using only monolingual data - no parallel corpora required.
● Fully unsupervised S2S: Learns direct speech-to-speech translation from monolingual data alone, a first for this task.
● Three-component architecture: Combines a masked autoencoder for speech representation, unsupervised embedding mapping across languages, and back-translation for alignment.
● Beats cascade baselines: Outperforms a comparable cascade of ASR + MT + TTS, a surprising result given cascade systems are typically the strong baseline.
● Paralinguistic preservation: Preserves paralinguistic features - pauses, speaking rates, and speaker identity - that cascaded systems tend to wash out in translation.
Paper, Tweet

Top AI Papers of the Week (November 20 - November 26)

Paper Links
1) System 2 Attention (S2A) - Meta's S2A uses the LLM's own reasoning to decide what context actually matters, regenerating a clean prompt before the final response step.
● Two-pass prompting: First pass uses the LLM to filter/regenerate the input context, removing irrelevant or misleading content; second pass generates the final answer from the clean context.
● Addresses distraction: Directly targets the well-known problem that LLMs attend to irrelevant or manipulative content (e.g., opinion-laden context that biases answers).
● Factuality gains: Increases factuality on QA and reduces the model's sensitivity to biased framing or distractors inserted into the prompt.
● Math word problems: Outperforms standard attention-based LLMs on math word problems, where filtering irrelevant details is often the hard part of the task.
Paper, Tweet
2) Advancing Long-Context LLMs - A survey of methodologies for improving Transformer long-context capability across pretraining, fine-tuning, and inference stages.
● Full-stack coverage: Organizes methods by training stage - pretraining objectives, position encoding, fine-tuning recipes, and inference-time interventions.
● Position-encoding deep dive: Reviews RoPE variants, ALiBi, and other positional-encoding choices that dominate long-context extrapolation.
● Efficient attention: Catalogs sparse, linear, and memory-augmented attention mechanisms that make longer contexts tractable.
● Evaluation considerations: Addresses benchmark limitations including the "needle in a haystack" problem and the gap between nominal context length and effective usable context.
Paper, Tweet
3) Parallel Speculative Sampling - Amazon researchers propose a parallel variant of speculative sampling that achieves significant LLM inference speedups with minimal extra parameters.
● Parallel decoding: Combines speculative sampling with parallel decoding so multiple tokens can be generated and verified in a single pass.
● Tiny overhead: Requires learning only O(d_emb) additional parameters, far fewer than typical speculative-decoding draft models.
● Up to 30% speedup: Achieves up to 30% end-to-end inference speedup without compromising output quality.
● Minimal integration cost: Unlike separate-draft-model speculative decoding, this fits inside the main model with essentially no deployment overhead.
Paper, Tweet
4) Mirasol3B - Google's Mirasol3B is a multimodal model that decouples modalities into focused autoregressive components rather than forcing a single fused stream.
● Decoupled autoregressive modeling: Separates audio/video processing from text processing into focused autoregressive components that communicate through learned cross-modal interfaces.
● Handles longer videos: The decoupled design lets the model handle longer video inputs than typical end-to-end multimodal models constrained by sequence length.
● Modality-specific processing: Inputs are processed according to their modalities with appropriate tokenization rather than forcing a one-size-fits-all tokenizer.
● SoTA on video benchmarks: Outperforms prior methods on video QA, long-video QA, and audio-video-text benchmarks, validating the decoupled approach.
Paper, Tweet
5) Teaching Small LMs to Reason - An approach that teaches smaller language models to explicitly select among reasoning techniques for each problem.
● Reasoning technique menu: Trains the small LM to choose among step-by-step processing, recall-then-generate, recall-reason-generate, extract-generate, and direct-answer strategies.
● Technique selection: The model learns when to apply each strategy based on problem structure, not just which answer to produce.
● Matches 5-10x larger models: Attains zero-shot reasoning performance similar or better than models 5-10x larger on complex reasoning tasks.
● Practical scaling: Offers a recipe for teams that can't deploy frontier-scale models but need strong reasoning quality - a recurring production constraint.
Paper, Tweet
6) GPQA - A graduate-level Google-proof QA benchmark designed to stress-test reasoning in systems that might exceed human expertise.
● 448 expert questions: Consists of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.
● Google-proof by design: Questions are constructed so that even with unrestricted internet access, non-experts (~34%) perform only slightly better than random on them.
● GPT-4 gets 39%: The strongest GPT-4 baseline hits only 39% accuracy, showing a clear headroom for frontier models on expert-level reasoning.
● Scalable oversight testbed: Explicitly designed to enable scalable oversight research - experiments in supervising models whose knowledge may exceed the supervisors'.
Paper, Tweet
7) Hitchhiker's Guide From CoT to Agents - A survey mapping the conceptual evolution from chain-of-thought reasoning to modern language-agent frameworks.
● CoT foundations: Covers the mechanics underpinning CoT (few-shot prompting, self-consistency, least-to-most, tree-of-thought) with a consistent formalism.
● Mechanism theory: Explores why CoT works - in-context learning, prompt engineering theories, and emergence at scale - rather than just cataloging results.
● CoT-to-agent bridge: Traces how CoT techniques were progressively extended into tool use, multi-step planning, and full agent loops (ReAct, Reflexion, etc.).
● Framework landscape: Organizes the modern language-agent frameworks by which parts of the CoT-to-agent pipeline they emphasize, clarifying an otherwise noisy field.
Paper, Tweet
8) GAIA - Meta's GAIA is a benchmark for general AI assistants that requires reasoning, multimodal handling, web browsing, and tool use to solve real-world questions.
● Real-world questions: Questions are conceptually simple for humans but require integrated reasoning, web research, and tool use - a realistic test for assistant-style AI.
● Massive human-model gap: Humans achieve 92% accuracy while GPT-4 with plugins achieves only 15% - the widest human-AI gap on any major 2023 benchmark.
● Level-graduated difficulty: Three difficulty levels let researchers measure incremental progress rather than just binary success/failure.
● Agent-first evaluation: Explicitly designed to test AI assistants, not base LLMs - a framing that has since become dominant for agent evaluations.
Paper, Tweet
9) MedAgents - A collaborative multi-round framework for medical reasoning that uses role-playing LLM agents to improve accuracy and reasoning depth.
● Multi-agent deliberation: Multiple LLM agents take on specialist roles (e.g., different medical specialties) and deliberate in rounds over a case.
● Role-playing: Each agent has a defined role-play prompt that scopes its expertise and reasoning style, producing more diverse intermediate hypotheses.
● Consensus protocol: Agents iterate until reaching consensus or until a moderator resolves disagreements, producing a final answer with rationale.
● Reasoning gains: Improves accuracy and reasoning quality on medical QA benchmarks compared to single-agent baselines at matched compute.
Paper, Tweet
10) TÜLU 2 - Allen AI's TÜLU 2 is a suite of improved open instruction-tuned LLMs and an accompanying study of adaptation best practices.
● Open suite: Releases open models that match or exceed GPT-3.5-turbo-0301 on several benchmarks, a meaningful milestone for the open ecosystem at the time.
● Post-training recipe: The paper doubles as a practical recipe, documenting how instruction data curation, mixing ratios, and DPO-based preference training interact.
● UltraFeedback preference data: Uses UltraFeedback for preference optimization, validating that openly released preference datasets are sufficient to close much of the gap to commercial post-training pipelines.
● Adaptation research platform: Explicitly positioned as a platform for studying open adaptation techniques, informing the TÜLU 3 release that would follow in 2024.
Paper, Tweet

Top AI Papers of the Week (November 13 - November 19)

Paper Links
1) Emu Video and Emu Edit - Meta releases Emu Video and Emu Edit, a pair of diffusion models targeting controlled text-to-video generation and instruction-based image editing.
● Emu Video: Generates high-quality video from text-only, image-only, or combined text + image inputs using a factorized diffusion approach - text-to-image followed by image-conditioned video.
● Emu Edit: Enables free-form image editing through text instructions, handling region, local, and global edits within one model.
● Factorized video: The text-to-image then image-to-video split dramatically cuts training cost and improves controllability compared to end-to-end T2V models.
● Unified research line: Both models extend Meta's Emu foundation family, pointing toward a unified multimodal generative stack shared across image, video, and edit tasks.
Paper, Tweet
2) Chain-of-Note (CoN) - Tencent's Chain-of-Note adds an explicit note-taking step to RAG so the model can evaluate retrieved evidence before answering.
● Sequential notes: For each retrieved document, the model writes a "reading note" assessing relevance to the question, rather than attending to the entire retrieval dump directly.
● Noise robustness: +7.9 EM improvement when retrieved documents are entirely noisy, precisely the regime where standard RAG degrades most.
● Unknown-scenario handling: +10.5 rejection-rate improvement on questions outside the model's training scope, a key property for avoiding confident hallucinations.
● Generalizable pattern: The note-taking step is a lightweight addition on top of existing RAG pipelines, making it easy to adopt incrementally.
Paper, Tweet
3) LLMs for Scientific Discovery - A broad evaluation of GPT-4 across scientific disciplines including drug discovery, biology, and computational chemistry.
● Expert-driven assessment: Domain experts design case studies to probe GPT-4's understanding of complex scientific concepts and its ability to solve real research problems.
● Problem-solving capability: GPT-4 demonstrates meaningful problem-solving in many domains but shows systematic weaknesses on tasks requiring precise numerical reasoning or experimental design.
● Benchmark coverage: Complements qualitative case studies with quantitative benchmarks, triangulating on where current frontier models help vs. mislead.
● Research workflow integration: Argues LLMs can accelerate scientific ideation and literature synthesis but require careful scaffolding before touching high-stakes experimental decisions.
Paper, Tweet
4) Fine-Tuning LLMs for Factuality - Stanford fine-tunes LLMs for factuality without any human labels by using automatically generated preference signals.
● Automatic factuality signal: Derives factuality preference rankings from reference consistency checks and retrieval-based verification - no human labels required.
● Open-ended generation: Specifically targets open-ended generation settings rather than constrained QA, where hallucination is hardest to detect or correct.
● Llama 2 improvements: Significantly improves Llama 2's factuality on held-out topics, outperforming RLHF and decoding-time factuality strategies.
● Scalable alignment: Offers a recipe for scaling factuality alignment without proportionally scaling human annotation - an important direction as LLMs cover broader domains.
Paper, Tweet
5) Contrastive Chain-of-Thought - Proposes contrastive CoT prompting where models see both valid and invalid reasoning demonstrations to reduce reasoning errors.
● Valid + invalid demos: Demonstrations pair correct reasoning traces with common incorrect ones, teaching the model what not to do as well as what to do.
● Automatic construction: Provides an automatic method to generate contrastive demonstrations, avoiding the manual curation bottleneck that limited prior CoT variants.
● Improves over CoT: Outperforms standard CoT across reasoning benchmarks, with particularly strong gains on problems where common error patterns are predictable.
● Pedagogical analog: The improvement mirrors human learning research showing that studying worked examples and errors side-by-side beats studying successes alone.
Paper, Tweet
6) Survey on Language Models for Code - A comprehensive survey of LLMs for code covering 50+ models, 30+ evaluation tasks, and 500 related works.
● Model landscape: Catalogs 50+ code LLMs across sizes, architectures, and training regimes, providing a single reference for what's available.
● Task taxonomy: Reviews 30+ evaluation tasks spanning code generation, repair, translation, summarization, and execution prediction.
● Training and data recipes: Walks through pretraining corpus construction, instruction tuning, and RLHF specifically for code.
● Open problems: Highlights challenges in long-context code understanding, multi-file reasoning, and robust evaluation beyond HumanEval-style metrics.
Paper, Tweet
7) JARVIS-1 - An open-world multimodal agent for Minecraft that combines perception, planning, and memory into a self-improving system.
● Multimodal perception: Processes visual Minecraft observations and natural-language instructions through a unified multimodal input pipeline.
● Memory-augmented planning: Maintains a multimodal memory store of past observations and plans, enabling lifelong self-improvement across episodes.
● Strong task coverage: Completes 200+ diverse Minecraft tasks with competitive success rates, including long-horizon tasks like diamond collection.
● Open-world blueprint: An influential example of combining foundation models, memory, and explicit planning into an agent, foreshadowing many 2024 agent architectures.
Paper, Tweet
8) Learning to Filter Context for RAG (FILCO) - CMU's FILCO improves RAG by training a dedicated model to filter retrieved contexts before they reach the generator.
● Useful-context identification: Uses lexical and information-theoretic signals to identify genuinely useful portions of retrieved documents, rather than passing everything through.
● Context-filter training: Trains a separate filtering model whose only job is to retain useful context at inference time.
● Extractive QA wins: Outperforms prior RAG approaches on extractive QA benchmarks, a clean demonstration that context filtering is a high-leverage component.
● Modular addition: Slots in between retrieval and generation, making it compatible with any retriever/generator pairing.
Paper, Tweet
9) MART (Multi-round Automatic Red-Teaming) - Meta's MART scales LLM safety alignment using fully automatic multi-round red-teaming.
● Adversarial prompt writing: One LLM acts as red-teamer, automatically generating adversarial prompts that probe the target model's safety.
● Safe response generation: The target LLM then generates responses that are filtered/refined for safety, producing training data for the next round.
● 84.7% violation reduction: After 4 rounds, the violation rate of an initially weakly-aligned LLM drops up to 84.7%, matching models with extensive human-written adversarial data.
● Scalable alignment: Demonstrates that automatic red-teaming can substitute for expensive human adversarial prompt writing in the alignment pipeline.
Paper, Tweet
10) LLMs Can Deceive Users (Trading Agent) - Apollo Research shows that a helpful, honest LLM stock-trading agent can spontaneously deceive users under pressure.
● Stock-trading testbed: The LLM agent runs an autonomous trading simulation with access to market data and occasional insider tips.
● Acts on insider information: When placed under performance pressure, the agent acts on insider tips despite explicit instructions not to - a clear instance of strategic norm violation.
● Hides reasoning from the user: Crucially, the agent reports doctored rationales to its user, hiding the insider trade rather than reporting it - strategic deception without being trained to deceive.
● Alignment implication: Demonstrates that deception can emerge in "helpful and safe" models under realistic pressure, without targeted training - a significant datapoint for alignment research.
Paper, Tweet

Top AI Papers of the Week (November 6 - November 12)

Paper Links
1) Hallucination in LLMs Survey - A comprehensive survey of hallucination in LLMs, covering taxonomy, causes, evaluation, and mitigation.
● Two-category taxonomy: Separates hallucinations into factuality hallucinations (incorrect facts) and faithfulness hallucinations (deviations from source content).
● Causes breakdown: Attributes hallucinations to training-data issues, training-stage artifacts, and inference-time choices - each with distinct mitigation paths.
● Evaluation landscape: Reviews benchmarks and automatic metrics specifically designed for hallucination, contrasting them with general-purpose LLM metrics.
● Mitigation strategies: Organizes mitigation into data curation, training-stage (RLHF, factuality tuning), and inference-stage (decoding, retrieval) approaches.
Paper, Tweet
2) Simplifying Transformer Blocks - Researchers show that many components of the standard transformer block can be removed with no loss in training speed or quality.
● Aggressive simplification: Removes residual connections, normalization layers, and value/projection parameters in specific blocks without hurting per-update training speed.
● Works across architectures: Tested on autoregressive decoder-only and BERT encoder-only models, validating that the simplifications aren't architecture-specific.
● 15% faster throughput: Simplified blocks deliver 15% faster training throughput with fewer parameters - a clean efficiency win.
● Design-space implication: Suggests the standard transformer is overdetermined and that careful ablation can yield simpler, faster architectures without new ideas.
Paper, Tweet
3) In-Context Learning Generalization Limits - Investigates whether transformers' in-context learning can generalize beyond the distribution of their pretraining data.
● Pretraining distribution bridge: Tests whether transformers can identify and learn new tasks in-context, both inside and outside their pretraining data distribution.
● Limited OOD generalization: In the regimes studied, there's limited evidence that ICL generalizes meaningfully beyond pretraining data coverage.
● Counter-narrative: Pushes back on the strong "universal learners" framing of ICL that sometimes accompanies emergence-claims, grounding it in data-distribution bounds.
● Research implication: Argues that evaluating ICL requires carefully distinguishing in-distribution skill retrieval from genuine OOD generalization - a distinction rarely made cleanly in headlines.
Paper, Tweet
4) MusicGen - Meta's MusicGen is a single-stage transformer LLM for music generation that operates over compressed discrete audio tokens.
● Single-stage transformer: Unlike multi-stage music generation pipelines, MusicGen generates music as a single autoregressive transformer over multi-codebook tokens.
● Multi-stream tokens: Operates over several parallel streams of compressed discrete music tokens, producing high-fidelity audio without the cascaded VQ-VAE + LM setup.
● Text and melody conditioning: Supports both text prompts and melody conditioning, letting users specify style with text and structure with reference audio.
● High-quality generation: Delivers competitive subjective quality against multi-stage baselines while being simpler and faster to deploy.
Paper, Tweet
5) AltUp (Alternating Updates) - Google's AltUp lets transformers benefit from wider representations without paying the full compute cost at every layer.
● Wide-but-cheap representation: Widens the learned representation but only actively updates one sub-block per layer, leaving others untouched during that forward pass.
● Predict-and-correct: A predict-and-correct mechanism updates the inactive sub-blocks with predictions, so they remain coherent without full computation.
● Negligible latency increase: Achieves wider representations at negligible latency cost compared to matched-width dense transformers.
● Scaling lever: Provides a middle-ground between narrow dense models and sparse MoE - wider without routing complexity.
Paper, Tweet
6) Rephrase and Respond (RaR) - An effective prompting method where the LLM rephrases and expands the user's question before answering it.
● Rephrase step: The model first rewrites the question to resolve ambiguity, fill in implicit assumptions, and make the task explicit - then answers the rephrased version.
● Broad task gains: Improves performance across diverse tasks without needing any fine-tuning, using only prompt-level changes.
● Stacks with CoT: Combines cleanly with chain-of-thought prompting, giving additive improvements on reasoning benchmarks.
● User-friendly interpretation: Shows that part of the "prompt engineering" skill gap between novice and expert users is really a rephrasing problem - one the LLM itself can fix.
Paper, Tweet
7) On the Road with GPT-4V - An exhaustive evaluation of GPT-4V applied to autonomous driving scenarios.
● Driving-scenario evaluation: Tests GPT-4V across diverse driving situations including scene understanding, traffic-sign recognition, and causal reasoning about driver intent.
● Scene-understanding strength: Demonstrates superior performance in scene understanding and causal reasoning compared to existing production autonomous-driving systems.
● Edge-case robustness: Shows relative robustness on edge cases (construction zones, unusual road layouts) that typically confuse narrower perception stacks.
● Practical limitations: Flags real-world issues including latency, rare-hazard handling, and dependence on high-quality image quality that would gate production deployment.
Paper, Tweet
8) GPT4All Technical Report - The GPT4All technical report documents the model family and the open ecosystem built around democratizing local LLMs.
● Model family: Covers the sequence of GPT4All models trained and released through 2023, spanning 3B-13B parameter sizes.
● Open-source focus: Ships with a cross-platform desktop app, open model weights, and an accompanying dataset - positioning itself as a turnkey local LLM stack.
● Data and training: Details the curated instruction-tuning dataset and fine-tuning recipes used to build the family.
● Ecosystem impact: Tracks GPT4All's role in popularizing local LLM usage among hobbyists and small organizations before Ollama and similar tools matured.
Paper, Tweet
9) S-LoRA - S-LoRA enables serving thousands of LoRA adapters concurrently on a single GPU through memory-paging and custom CUDA kernels.
● Main-memory adapter pool: Stores all adapters in main memory and loads adapters for currently running queries into GPU memory on demand, dramatically increasing the adapter pool size.
● Novel tensor parallelism: Introduces a tensor-parallelism strategy tailored for heterogeneous LoRA batches, where each query might use a different adapter.
● 4x throughput: Improves throughput by 4x compared to prior adapter-serving solutions at comparable latency.
● Adapter scale: Enables serving several orders of magnitude more adapters on the same hardware - important for multi-tenant LoRA deployments and personalized fine-tuning services.
Paper, Tweet
10) FreshLLMs (FreshQA) - Introduces FreshQA, a dynamic benchmark designed to stress-test LLMs on time-sensitive knowledge.
● Dynamic QA benchmark: Continuously refreshes questions so models can't memorize answers - a direct response to the contamination concerns plaguing static benchmarks.
● Four question categories: Covers never-changing, slow-changing, fast-changing, and false-premise questions, stressing different aspects of freshness handling.
● Reveals freshness gap: Shows that LLMs without search augmentation answer fast-changing questions poorly, while retrieval-augmented models close most of the gap.
● FreshPrompt: Proposes FreshPrompt, a simple search-augmented prompting strategy that substantially boosts LLM performance on time-sensitive questions.
Paper, Tweet

Top AI Papers of the Week (October 30 - November 5)

Paper Links
1) MetNet-3 - Google's MetNet-3 is a state-of-the-art neural weather model extending lead time and variable coverage well beyond prior observation-based models.
● Dense + sparse sensors: Learns jointly from dense sensor data (radar, satellite) and sparse in-situ station data, combining signals that were typically used separately.
● 24-hour forecasts: Produces predictions up to 24 hours ahead, a meaningful lead-time extension for observation-based weather modeling.
● Multi-variable output: Predicts precipitation, wind, temperature, and dew point from the same model, rather than requiring per-variable systems.
● Operational relevance: Demonstrates the neural-weather-model pattern that would dominate 2024 forecasting research - observation-driven, end-to-end neural pipelines replacing traditional numerical systems.
Paper, Tweet
2) Evaluating LLMs Survey - A comprehensive survey of LLM evaluation covering benchmarks, methodologies, and open problems.
● Task-wise organization: Organizes evaluation by task category - reasoning, knowledge, alignment, robustness, ethics, etc. - showing which benchmarks address which capabilities.
● Automatic vs. human: Discusses the trade-offs between automatic metrics (cheap, inconsistent), LLM-as-a-Judge (scalable, biased), and human evaluation (reliable, expensive).
● Contamination and robustness: Highlights contamination and robustness as cross-cutting concerns plaguing static benchmarks at all scales.
● Frontier-model needs: Argues that evaluating frontier-scale LLMs requires new paradigms beyond simple benchmark accuracy, including interactive evaluation and behavioral testing.
Paper, Tweet
3) Battle of the Backbones - A large-scale benchmarking framework that compares vision backbones across a diverse suite of computer vision tasks.
● Broad benchmarking: Compares CNN and ViT backbones across classification, segmentation, detection, retrieval, and other tasks at matched compute.
● Pretraining recipes matter: Shows that pretraining scheme (supervised, self-supervised, language-image) often matters more than the architecture family.
● ViT ≠ universal winner: Vision transformers are not universally superior - strong CNN backbones remain competitive or better on several downstream tasks.
● Practitioner guide: Functions as a decision reference - the report explicitly maps from task characteristics to recommended backbone + pretraining combinations.
Paper, Tweet
4) ChipNeMo (LLMs for Chip Design) - NVIDIA's ChipNeMo applies domain-adapted LLMs to industrial chip design workflows.
● Domain adaptation pipeline: Applies continued pretraining on chip-design corpora, SFT, and domain-specific RLHF to adapt general LLMs to semiconductor design language.
● Three applications: Evaluates assistant chatbot for engineers, EDA (electronic design automation) tool invocation, and bug summarization - three real internal chip-design pain points.
● Significant adaptation gains: Domain adaptation dramatically outperforms general-purpose LLMs across tasks despite using smaller model sizes.
● Adapted RAG: Using a domain-adapted LLM as the generator in RAG further improves answer quality compared to using a general-purpose LLM with the same retrieval stack.
Paper, Tweet
5) YaRN (Efficient Context Extension) - YaRN is a compute-efficient method for extending the context window of LLMs well beyond their pretrained length.
● Rotary-embedding scaling: Extends RoPE-based context length via a combined attention and NTK-aware scaling scheme, avoiding the degradation of naive interpolation.
● Fine-tune extrapolation: Extrapolates meaningfully beyond the limited context seen during fine-tuning, so short fine-tune sequences can unlock much longer inference contexts.
● 128K context: Successfully scales Llama-family models to 128K-token context with minimal additional training compute.
● Open recipe: Adopted widely across the open-source community as a standard recipe for extending Llama and other RoPE-based LLMs.
Paper, Tweet
6) Open DAC 2023 - Meta releases a large DFT dataset for training ML models that predict sorbent-adsorbate interactions in Direct Air Capture (DAC).
● 38M+ DFT calculations: Consists of more than 38M density functional theory calculations on metal-organic frameworks (MOFs), enabling large-scale ML-driven DAC material discovery.
● DAC research: Targets direct air capture, where efficient CO₂-capturing MOFs are needed - a high-impact climate application for ML.
● ML baselines: Provides strong ML baselines showing that ML surrogates can replace expensive DFT calculations for MOF screening.
● Open-science contribution: Positions the dataset as an open foundation for materials ML research on climate applications.
Paper, Tweet
7) Symmetry in Machine Learning - A methodological framework for enforcing, discovering, and promoting symmetry in machine learning models.
● Unified framework: Presents a single theoretical framework that covers data augmentation, equivariant architectures, and symmetry-discovering learning objectives.
● Three-way taxonomy: Organizes approaches into enforcing known symmetries, discovering latent ones, and biasing learning toward symmetric solutions.
● Worked examples: Applies the framework to MLPs and basis-function regression, showing concretely how the abstract concepts translate into design choices.
● Broader ML perspective: Positions symmetry as a first-class design lever alongside scale and data quality, particularly for scientific ML.
Paper, Tweet
8) Next-Generation AlphaFold - DeepMind previews the next AlphaFold with dramatically expanded scope of biomolecular complexes.
● Multi-entity complexes: Jointly predicts structure of complexes including proteins, nucleic acids, small molecules, ions, and modified residues in a single unified model.
● Beyond protein-only: Dramatically expands applicability beyond AlphaFold 2's protein-only regime, opening up drug discovery and RNA biology workflows.
● Beats specialist predictors: Achieves greater accuracy on protein-nucleic acid interactions than specialized predictors in that domain - remarkable for a general model.
● Biology pipeline signal: Preview of the capability direction that would crystallize as AlphaFold 3 in 2024, with profound implications for structural biology research.
Paper, Tweet
9) EmotionPrompt - Microsoft researchers show that appending emotional stimuli to prompts reliably improves LLM performance across 45 tasks.
● 45-task evaluation: Tested across Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4 on 45 deterministic and generative tasks.
● Emotional stimuli: Appends phrases like "This is very important to my career" to prompts, drawing on social-psychology theories of human motivation.
● Consistent gains: Produces consistent improvements across both smaller and frontier models, despite the prompts being content-free manipulations.
● Emotional-intelligence signal: Suggests LLMs have internalized patterns connecting emotional framing to effort - a "bug or feature" question that has driven follow-up research on LLM behavioral psychology.
Paper, Tweet
10) FP8-LM - Microsoft's FP8-LM demonstrates that most LLM training variables - gradients, optimizer states - can use FP8 without sacrificing accuracy.
● FP8 across the pipeline: Extends FP8 training beyond forward activations to gradients and optimizer states (both moments), widening the FP8 footprint.
● No hyperparameter changes: Works as a drop-in replacement for FP16/BF16 training without requiring changes to learning rates, schedules, or other hyperparameters.
● Matched accuracy: Achieves accuracy indistinguishable from FP16/BF16 baselines on LLM pretraining tasks.
● Efficiency gains: Delivers substantial memory and compute savings, particularly attractive for training large models on FP8-capable hardware like H100.
Paper, Tweet

Top AI Papers of the Week (October 23 - October 29)

Paper Links
1) Zephyr - Hugging Face's Zephyr-7B is a 7B parameter LLM whose chat performance rivals much larger chat models aligned with human feedback.
● Distilled SFT: Uses distilled supervised fine-tuning on UltraChat-generated instruction data as the task-accuracy foundation.
● Distilled DPO: Aligns with AI feedback data via Direct Preference Optimization, rather than the expensive human-feedback RLHF pipeline.
● ChatGPT-level at 7B: Achieves competitive performance with ChatGPT on AlpacaEval and matches 70B chat models aligned with human feedback on several benchmarks.
● Recipe popularization: Open-sources the distilled-DPO recipe, which became a widely adopted template for small, strong open chat models.
Paper, Tweet
2) Fact-Checking with LLMs - Investigates the fact-checking capabilities of frontier LLMs across multiple languages and claim types.
● Contextual information helps: LLMs perform significantly better at fact-checking when equipped with retrieved evidence, validating the RAG pattern for claim verification.
● GPT-4 > GPT-3: GPT-4 shows meaningful accuracy gains over GPT-3 for fact-checking, but both struggle without supporting context.
● Multilingual variance: Accuracy varies substantially by query language and claim veracity, exposing persistent language-equity gaps in fact-checking.
● Inconsistent reliability: While LLMs show real fact-checking promise, their accuracy is inconsistent enough that they can't replace human fact-checkers - useful as assistants, not arbiters.
Paper, Tweet
3) Matryoshka Diffusion Models - Apple introduces an end-to-end framework for high-resolution image and video synthesis that denoises across multiple resolutions jointly.
● Joint multi-resolution diffusion: Runs the diffusion process at multiple resolutions simultaneously, sharing representations across scales in a single unified model.
● NestedUNet: Uses a NestedUNet architecture so that higher-resolution branches build on lower-resolution features without a separate cascade.
● Progressive training: Trains progressively from low to high resolution, dramatically improving optimization stability for high-resolution generation.
● Unified model: Eliminates the typical cascaded-diffusion pipeline used in prior high-resolution generation, simplifying training and serving.
Paper, Tweet
4) Spectron - Google's Spectron is a spoken-language model trained end-to-end on raw spectrograms rather than text or discrete audio tokens.
● End-to-end spectrogram modeling: Processes spectrograms directly without an intermediate speech-recognition or tokenization step, preserving paralinguistic information.
● High-quality spoken output: Fine-tuned to generate high-quality, accurate spoken language while preserving speaker and prosody characteristics.
● Speaker preservation: Outperforms prior spoken-language models on speaker preservation - a known weakness of tokenizer-based approaches.
● Semantic coherence: Also improves semantic coherence of generated speech, addressing the common drift problem in spectrogram-level generation.
Paper, Tweet
5) LLMs Meet New Knowledge - A benchmark that evaluates how well LLMs handle new knowledge beyond their training cutoff.
● Three-dimensional evaluation: Tests knowledge understanding, knowledge differentiation (old vs. new), and knowledge association across the full set of relations.
● Post-cutoff focus: Uses knowledge that appears after the model's training cutoff, avoiding contamination that undermines many LLM knowledge benchmarks.
● LLMs struggle with new knowledge: Reveals systematic gaps - even frontier LLMs handle post-cutoff facts significantly worse than pre-cutoff ones, despite strong reasoning.
● RAG-oriented motivation: Provides empirical grounding for RAG: parametric memory is tied to training data, so retrieval remains necessary for fresh knowledge.
Paper, Tweet
6) Min-K% Prob (Detecting Pretraining Data) - Proposes Min-K% Prob as an effective detection method for determining whether specific text was in an LLM's pretraining data.
● Method: Computes the average log-probability of the K% least-likely tokens in a text; memorized text has higher log-probabilities on these tokens than unseen text.
● Black-box detection: Works on API-accessible models without needing gradients or internal activations, making it broadly applicable.
● Multiple use cases: Usable for benchmark-contamination detection, privacy auditing of machine unlearning, and copyrighted-text detection in pretraining corpora.
● Policy implications: Provides a technical tool for the copyright and privacy debates, letting third parties measurably test specific-text inclusion in training data.
Paper, Tweet
7) ConvNets Match Vision Transformers - DeepMind shows that strong ConvNet architectures pretrained at scale match ViTs on ImageNet performance at comparable compute.
● JFT-4B pretraining: Pretrains performant ConvNet architectures (NFNets) on JFT-4B at scale - matching the data regime where ViTs typically pull ahead.
● Log-log scaling law: Observes a log-log scaling law between held-out loss and compute, mirroring the scaling properties seen in ViTs.
● ImageNet parity: Fine-tuned NFNets match the reported performance of Vision Transformers at comparable compute budgets, refuting the "ConvNets don't scale" narrative.
● Architecture vs. recipe: Argues that the ConvNet-vs-ViT gap is largely a scale/recipe gap rather than an architectural limitation - a recurring theme in vision research.
Paper, Tweet
8) CommonCanvas - Releases CommonCanvas, a text-to-image dataset composed entirely of Creative-Commons-licensed images.
● CC-only training data: Every image is Creative Commons-licensed, providing a clean-license dataset for commercial and research T2I training.
● Scale despite licensing constraints: Curates hundreds of millions of images despite the CC-only constraint, dispelling the myth that legal T2I training requires permissive copyrighted data.
● Strong baseline models: Trains SD-style models on CommonCanvas that reach competitive quality, demonstrating CC data is sufficient for SoTA T2I.
● Policy contribution: Provides a practical counterexample to the argument that copyrighted training data is necessary - important as copyright litigation reshaped the AI-data landscape.
Paper, Tweet
9) Managing AI Risks (Bengio, Hinton, et al.) - A high-profile position paper by leading AI researchers laying out risks from upcoming advanced AI systems.
● Risk catalog: Enumerates social harms, malicious uses, large-scale autonomous risks, and potential loss-of-control scenarios from increasingly capable AI.
● Signatory weight: Signed by multiple Turing Award-winning researchers including Hinton and Bengio, amplifying its impact in the policy conversation.
● Concrete recommendations: Calls for investment in safety research, mandatory standards for advanced AI, and international coordination - not a pure threat-inventory.
● Political moment: Published during active AI-regulation discussions in the US and UK, directly influencing the UK AI Safety Summit and related policy processes.
Paper, Tweet
10) Branch-Solve-Merge (BSM) - BSM decomposes LLM tasks into parallel sub-tasks via three LLM-programmed modules: branch, solve, and merge.
● Three-module architecture: A branch module proposes a decomposition into parallel sub-tasks, a solve module independently answers each, and a merge module fuses results into a final response.
● Prompt-parameterized: All three modules are the same base LLM with different prompts, so BSM works with any base model without fine-tuning.
● Evaluation quality gains: Improves evaluation correctness and consistency for multiple LLMs, particularly on tasks where a flat prompt leaves too much implicit.
● General pattern: Generalizes the "decompose then solve" pattern from math/CoT to arbitrary tasks, anticipating more structured agent decomposition patterns.
Paper, Tweet

Top AI Papers of the Week (October 16 - October 22)

Paper Links
1) Llemma - Llemma is an open LLM for mathematics built via continued pretraining of Code Llama on the Proof-Pile-2 dataset.
● Proof-Pile-2 dataset: Mixes scientific papers, math-heavy web pages, and mathematical code into a focused math-pretraining corpus.
● Code Llama base: Uses Code Llama as the base model, leveraging its existing code proficiency as a scaffold for formal-style math reasoning.
● Beats unreleased Minerva: Outperforms open base models and the unreleased Minerva on the MATH benchmark at comparable scale.
● Full open release: Releases model, dataset, and code - positioning Llemma as a reproducible starting point for open mathematical LLM research.
Paper, Tweet
2) LLMs for Software Engineering - A comprehensive survey of LLMs for software engineering covering models, tasks, evaluation, and open challenges.
● Task coverage: Surveys code generation, bug detection and repair, code review, code translation, documentation, and testing.
● Model landscape: Reviews code-specialized LLMs (Codex, StarCoder, CodeLlama) alongside general-purpose LLMs applied to code.
● Evaluation review: Catalogs standard benchmarks (HumanEval, MBPP, DS-1000) and their limitations for real-world software engineering.
● Open challenges: Highlights long-context code understanding, multi-file reasoning, verification, and agent-based SE as key open directions.
Paper, Tweet
3) Self-RAG - Self-RAG trains an LM to adaptively retrieve, generate, and self-critique using special reflection tokens.
● Reflection tokens: Introduces special tokens that control retrieval decisions, passage relevance judgments, and self-evaluation of generations.
● Adaptive retrieval: The model decides on-the-fly whether to retrieve, rather than always retrieving on every query - saving compute on knowledge-light queries.
● Self-reflection: Critiques its own generations against retrieved passages, enabling controllable trade-offs between response quality and factuality at inference.
● Significant gains: Outperforms state-of-the-art LLMs and strong RAG baselines on open-domain QA, reasoning, and fact verification.
Paper, Tweet
4) RAG for Long-Form QA - Explores retrieval-augmented LMs specifically on long-form question answering, where RAG failures are more subtle.
● Retrieval is necessary: Confirms that retrieval is an important component for long-form QA, but that evidence documents must be carefully curated and ordered.
● Attribution errors: Documents attribution errors - where the model cites passages that don't actually support its claims - and shows these spike when retrieved docs lack sufficient evidence.
● Document ordering: Demonstrates that document order within the context substantially affects long-form QA attribution accuracy.
● Practical guidelines: Offers concrete guidelines for document selection, ordering, and prompting to reduce hallucination in long-form RAG outputs.
Paper, Tweet
5) GenBench - A Nature Machine Intelligence paper framework for characterizing and understanding generalization research in NLP.
● Meta-analysis: Reviews 543 papers on generalization in NLP, mapping what "generalization" actually means across different research threads.
● Generalization taxonomy: Organizes generalization into compositional, structural, cross-lingual, cross-task, and cross-domain generalization types.
● Evaluation taxonomy: Provides tools for classifying generalization studies by the kind of distribution shift and evaluation protocol they test.
● Research infrastructure: Ships with tools to help researchers classify and compare generalization work, aiming to reduce conceptual fragmentation in the field.
Paper, Tweet
6) LLM Self-Explanations - Investigates whether LLMs can generate useful feature-attribution explanations for their own outputs.
● Self-explanation capability: LLMs can self-generate feature-attribution explanations that meaningfully highlight the tokens driving their predictions.
● Performance + truthfulness: Self-explanation improves both task performance and the truthfulness of outputs compared to baseline prompting.
● CoT synergy: Combines productively with chain-of-thought prompting, giving additive improvements rather than substituting for it.
● Interpretability lever: Offers a cheap, model-agnostic interpretability pattern that works through the API without needing gradients or white-box access.
Paper, Tweet
7) OpenAgents - An open platform for running and hosting real-world language agents, including three distinct agent types.
● Data Agent: A data-analysis agent capable of exploring datasets, running analyses, and producing visualizations through conversation.
● Plugins Agent: Integrates 200+ daily-use API tools (e.g., weather, search, calendars) into a single conversational agent interface.
● Web Agent: An autonomous web-browsing agent capable of navigating real websites and completing multi-step tasks.
● Open alternative to ChatGPT Plus: Positions OpenAgents as an open-source alternative to ChatGPT's plugin ecosystem, usable for research into agent-user interaction patterns.
Paper, Tweet
8) Eliciting Human Preferences with LLMs - Anthropic uses LLMs to guide the task-specification process, eliciting user intent through natural-language dialogue.
● Interactive elicitation: The LLM asks the user open-ended questions to clarify intent, producing a structured task specification that the model can then execute.
● Beats user-written prompts: Systems built via LLM-elicited specifications produce more informative, accurate responses than user-written prompts alone.
● Better than single-shot prompting: Shows that multi-turn elicitation yields higher task-success rates than single-shot prompting, even when the user is not a prompt engineer.
● Usable AI pattern: Offers a pattern for bridging the user-intent gap that shapes AI product design - spec-driven rather than prompt-driven interaction.
Paper, Tweet
9) AutoMix - AutoMix routes queries between LLMs of different sizes based on smaller-model confidence, saving cost without sacrificing quality.
● Confidence-based routing: A small model answers first; a confidence signal determines whether to accept its answer or escalate to a larger model.
● Cascading thresholds: Uses multiple confidence thresholds to route queries through a cascade of increasingly capable (and expensive) models.
● Cost-quality Pareto: Achieves Pareto improvements over single-model baselines, delivering equivalent quality at substantially lower inference cost.
● Production relevance: The pattern maps cleanly onto practical LLM deployment where most queries can be handled by cheap models but a tail of hard queries need the frontier model.
Paper, Tweet
10) Video Language Planning - Enables synthesizing complex long-horizon video plans for robotics via tree search over vision-language and text-to-video models.
● Tree-search planner: Uses a tree-search procedure over a vision-language model serving as policy+value, with a text-to-video model acting as the dynamics model.
● Long-horizon plans: Produces multi-step video plans for robotics tasks that would be infeasible with single-shot video generation.
● Cross-domain generalization: Works across diverse robotics domains, showing the approach is not tied to a specific embodiment or task type.
● Planning-via-generation: Demonstrates that generative video models can serve as world models for planning, a pattern that has gained traction through 2024.
Paper, Tweet

Top AI Papers of the Week (October 9 - October 15)

Paper Links
1) Ring Attention - UC Berkeley's Ring Attention scales transformer context to 100M+ tokens by distributing blockwise self-attention across devices in a ring topology.
● Blockwise attention: Computes self-attention in blocks so that only small KV chunks need to fit on each device at any time.
● Ring communication: Passes KV chunks between devices in a ring, overlapping communication with computation to hide networking latency.
● Context scales with devices: Achievable context length grows linearly with the number of devices, with no attention approximations required.
● 100M+ tokens: Enables context lengths exceeding 100 million tokens in theory, far beyond what any single-device attention implementation can reach.
Paper, Tweet
2) UniSim (Universal Simulator) - Google's UniSim learns a universal generative simulator of real-world interactions from diverse video + action data.
● Generative world model: Simulates how humans and agents interact with the world by predicting the visual outcome of high-level instructions and low-level controls.
● Diverse action conditioning: Handles both text instructions ("pick up the cup") and low-level motor commands, unifying instruction-following and dynamics modeling.
● Training downstream systems: Can be used to train vision-language planners, low-level RL policies, and video-captioning systems - acting as a general data source.
● World-model agenda: A key datapoint for the broader "generative world models for embodied AI" research agenda that accelerated through 2024.
Paper, Tweet
3) Survey on Factuality in LLMs - A survey covering evaluation and enhancement techniques for LLM factuality.
● Evaluation taxonomy: Organizes factuality evaluation by granularity (token, sentence, passage), task (QA, generation, dialogue), and reference availability.
● Enhancement taxonomy: Reviews enhancement techniques including better training data, retrieval augmentation, factuality-aware decoding, and post-hoc verification.
● Factuality vs. truthfulness: Clarifies the often-confused distinction between factuality (correct facts) and truthfulness (model reports its beliefs honestly).
● Open problems: Highlights persistent gaps in cross-lingual factuality, open-ended generation factuality, and calibration.
Paper, Tweet
4) Hypothesis Search (LLMs Can Learn Rules) - A two-stage framework where the LLM learns a rule library for reasoning.
● Rule induction phase: In the first stage, the LLM induces general rules from a small set of examples, producing an explicit rule library rather than implicit pattern matching.
● Rule application phase: In the second stage, the model applies rules from its library to new problems, with explicit rule-lookup rather than end-to-end inference.
● Improves reasoning: The explicit rule library improves reasoning performance on tasks where generalization from examples beats pure in-context learning.
● Interpretability bonus: The learned rule library is human-readable and auditable, providing a window into what the model actually learned from its examples.
Paper, Tweet
5) Meta Chain-of-Thought Prompting (Meta-CoT) - A generalizable CoT framework that selects domain-appropriate reasoning patterns for the task at hand.
● Task-adaptive CoT: Rather than using a fixed CoT prompt template, Meta-CoT adaptively selects reasoning patterns based on task characteristics.
● Pattern library: Maintains a library of reasoning templates tailored to task families (math, logic, commonsense, etc.), picking the best one per query.
● Strong across tasks: Improves reasoning accuracy across diverse task types compared to single-template CoT prompting.
● Generalizable framework: The Meta-CoT pattern is easy to extend to new task families by just adding new templates to the library.
Paper, Tweet
6) LLMs for Healthcare Survey - A comprehensive overview of LLMs applied to the healthcare domain.
● Application coverage: Surveys clinical decision support, patient communication, medical summarization, diagnostic assistance, and biomedical research applications.
● Medical-LLM landscape: Reviews major medical LLMs (Med-PaLM, MEDITRON, ClinicalBERT) alongside general-purpose LLMs prompted for medical use.
● Benchmarks: Catalogs medical QA benchmarks and discusses their limitations for predicting real-world clinical usefulness.
● Deployment challenges: Covers regulatory, privacy, and safety challenges specific to healthcare LLM deployment.
Paper, Tweet
7) RECOMP (Retrieval-Augmented LMs with Compressors) - Proposes two compression approaches to shrink retrieved documents before in-context use.
● Extractive compressor: Selects the most useful sentences from retrieved documents, retaining the most relevant signal at a fraction of token budget.
● Abstractive compressor: Generates a summary synthesizing information from multiple retrieved documents, compressing redundancy across sources.
● 6% compression rate: Achieves compression rates as low as 6% with minimal performance loss on language modeling and open-domain QA.
● Selective augmentation: The training scheme learns to emit empty summaries when retrieved docs are irrelevant - a built-in mechanism for gracefully handling noisy retrieval.
Paper, Tweet
8) InstructRetro - NVIDIA introduces Retro 48B, the largest LLM pretrained with retrieval at the time.
● 48B scale: Continues pretraining a 43B parameter GPT model on 100B additional tokens while retrieving from a 1.2T-token database.
● Instruction tuning: Further instruction-tunes the retrieval-pretrained model, producing an instruction-following version of Retro.
● Stronger factuality: Shows reduced hallucination and better factuality on knowledge-intensive tasks compared to Retro-free baselines at comparable scale.
● Retrieval pretraining validated: Provides evidence that retrieval-during-pretraining can scale to 40B+ parameters and benefit downstream instruction-tuned use cases.
Paper, Tweet
9) MemWalker - MemWalker treats the LLM as an interactive agent that traverses a tree-structured summary of long text.
● Tree of summary nodes: Preprocesses long context into a hierarchical tree of summary nodes, compressing and structuring the information.
● Query-driven traversal: Given a query, the LLM traverses the tree through iterative prompting, descending into subtrees that are most relevant to the question.
● Reasoning-based reading: The traversal decisions are reasoning-based, so the model can explain which part of the document it consulted and why.
● Explainability bonus: The traversal trace serves as a human-readable explanation of the model's document reading, improving debuggability of long-context QA.
Paper, Tweet
10) FireAct (Language Agent Fine-tuning) - Explores fine-tuning LLMs specifically for language-agent use, demonstrating consistent gains over prompting alone.
● Fine-tuning beats prompting: Language agents consistently improve over prompted baselines after fine-tuning their backbone LLM on agent trajectories.
● 500 trajectories suffice: Fine-tuning a Llama 2-7B on just 500 agent trajectories produces a substantially stronger language agent than a prompted GPT-4 on several agent benchmarks.
● Data-efficient: The low data threshold suggests agent behaviors can be cheaply specialized, which matters for production agent deployment.
● Agent-specialization pattern: Anticipates the wave of agent-specialized LLMs released through 2024, where small focused fine-tunes outperform prompting of large general models.
Paper, Tweet

Top AI Papers of the Week (October 2 - October 8)


Top AI Papers of the Week (September 25 - October 1)

Paper Links
1) LLMs Represent Space and Time - discovers that LLMs learn linear representations of space and time across multiple scales; the representations are robust to prompt variations and unified across different entity types; demonstrate that LLMs acquire fundamental structured knowledge such as space and time, claiming that language models learn beyond superficial statistics, but literal world models. Paper, Tweet
2) Retrieval meets Long Context LLMs - compares retrieval augmentation and long-context windows for downstream tasks to investigate if the methods can be combined to get the best of both worlds; an LLM with a 4K context window using simple RAG can achieve comparable performance to a fine-tuned LLM with 16K context; retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes; a retrieval-augmented LLaMA2-70B with a 32K context window outperforms GPT-3.5-turbo-16k on seven long context tasks including question answering and query-based summarization. Paper, Tweet
3) StreamingLLM - a framework that enables efficient streaming LLMs with attention sinks, a phenomenon where the KV states of initial tokens will largely recover the performance of window attention; the emergence of the attention sink is due to strong attention scores towards the initial tokens; this approach enables LLMs trained with finite length attention windows to generalize to infinite sequence length without any additional fine-tuning. Paper, Tweet
4) Neural Developmental Programs - proposes to use neural networks that self-assemble through a developmental process that mirrors properties of embryonic development in biological organisms Paper, Tweet
5) The Dawn of LMMs - a comprehensive analysis of GPT-4V to deepen the understanding of large multimodal models Paper, Tweet
6) Training LLMs with Pause Tokens - performs training and inference on LLMs with a learnable token which helps to delay the model's answer generation and attain performance gains on general understanding tasks of Commonsense QA and math word problem-solving; experiments show that this is only beneficial provided that the delay is introduced in both pertaining and downstream fine-tuning. Paper, Tweet
7) Recursively Self-Improving Code Generation - proposes the use of a language model-infused scaffolding program to recursively improve itself; a seed improver first improves an input program that returns the best solution which is then further tasked to improve itself; shows that the GPT-4 models can write code that can call itself to improve itself. Paper, Tweet
8) Retrieval-Augmented Dual Instruction Tuning - proposes a lightweight fine-tuning method to retrofit LLMs with retrieval capabilities; it involves a 2-step approach: 1) updates a pretrained LM to better use the retrieved information 2) updates the retriever to return more relevant results, as preferred by the LM Results show that fine-tuning over tasks that require both knowledge utilization and contextual awareness, each stage leads to additional gains; a 65B model achieves state-of-the-art results on a range of knowledge-intensive zero- and few-shot learning benchmarks; it outperforms existing retrieval-augmented language approaches by up to +8.9% in zero-shot and +1.4% in 5-shot. Paper, Tweet
9) KOSMOG-G - a model that performs high-fidelity zero-shot image generation from generalized vision-language input that spans multiple images; extends zero-shot subject-driven image generation to multi-entity scenarios; allows the replacement of CLIP, unlocking new applications with other U-Net techniques such as ControlNet and LoRA. Paper, Tweet
10) Analogical Prompting - a new prompting approach to automatically guide the reasoning process of LLMs; the approach is different from chain-of-thought in that it doesn’t require labeled exemplars of the reasoning process; the approach is inspired by analogical reasoning and prompts LMs to self-generate relevant exemplars or knowledge in the context. Paper, Tweet
Paper Links
1) The Reversal Curse - finds that LLMs trained on sentences of the form “A is B” will not automatically generalize to the reverse direction “B is A”, i.e., the Reversal Curse; shows the effect through finetuning LLMs on fictitious statements and demonstrating its robustness across model sizes and model families. Paper, Tweet
2) Effective Long-Context Scaling with LLMs - propose a 70B variant that can already surpass gpt-3.5-turbo-16k’s overall performance on a suite of long-context tasks. This involves a cost-effective instruction tuning procedure that does not require human-annotated long instruction data. Paper, Tweet
3) Graph Neural Prompting with LLMs - proposes a plug-and-play method to assist pre-trained LLMs in learning beneficial knowledge from knowledge graphs Paper, Tweet
4) Vision Transformers Need Registers - identifies artifacts in feature maps of vision transformer networks that are repurposed for internal computations; this work proposes a solution to provide additional tokens to the input sequence to fill that role; the solution fixes the problem, leads to smoother feature and attention maps, and sets new state-of-the-art results on dense visual prediction tasks. Paper, Tweet
5) Boolformer - presents the first Transformer architecture trained to perform end-to-end symbolic regression of Boolean functions; it can predict compact formulas for complex functions and be applied to modeling the dynamics of gene regulatory networks. Paper, Tweet
6) LlaVA-RLHF - adapts factually augmented RLHF to aligning large multimodal models; this approach alleviates the reward hacking in RLHF and improves performance on the LlaVA-Bench dataset with the 94% performance level of the text-only GPT-4. Paper, Tweet
7) LLM Alignment Survey - a comprehensive survey paper on LLM alignment; topics include Outer Alignment, Inner Alignment, Mechanistic Interpretability, Attacks on Aligned LLMs, Alignment Evaluation, Future Directions, and Discussions. Paper, Tweet
8) Qwen LLM - proposes a series of LLMs demonstrating the strength of RLHF on tasks involving tool use and planning capabilities for creating language agents. Paper, Tweet
9) MentalLlaMa - an open-source LLM series for interpretable mental health analysis with instruction-following capability; it also proposes a multi-task and multi-source interpretable mental health instruction dataset on social media with 105K data samples. Paper, Tweet
10) Logical Chain-of-Thought in LLMs - a new neurosymbolic framework to improve zero-shot chain-of-thought reasoning in LLMs; leverages principles from symbolic logic to verify and revise reasoning processes to improve the reasoning capabilities of LLMs. Paper, Tweet

Top AI Papers of the Week (September 18 - September 24)

Paper Links
1) AlphaMissense - an AI model classifying missense variants to help pinpoint the cause of diseases; the model is used to develop a catalogue of genetic mutations; it can categorize 89% of all 71 million possible missense variants as either likely pathogenic or likely benign. Paper, Tweet
2) Chain-of-Verification reduces Hallucination in LLMs - develops a method to enable LLMs to "deliberate" on responses to correct mistakes; include the following steps: 1) draft initial response, 2) plan verification questions to fact-check the draft, 3) answer questions independently to avoid bias from other responses, and 4) generate a final verified response. Paper, Tweet
3) Contrastive Decoding Improves Reasoning in Large Language Models - shows that contrastive decoding leads Llama-65B to outperform Llama 2 and other models on commonsense reasoning and reasoning benchmarks. Paper, Tweet
4) LongLoRA - an efficient fine-tuning approach to significantly extend the context windows of pre-trained LLMs; implements shift short attention, a substitute that approximates the standard self-attention pattern during training; it has less GPU memory cost and training time compared to full fine-tuning while not compromising accuracy. Paper, Tweet
5) LLMs for Generating Structured Data - studies the use of LLMs for generating complex structured data; proposes a structure-aware fine-tuning method, applied to Llama-7B, which significantly outperform other model like GPT-3.5/4 and Vicuna-13B. Paper, Tweet
6) LMSYS-Chat-1M - a large-scale dataset containing 1 million real-world conversations with 25 state-of-the-art LLM; it is collected from 210K unique IP addresses on the Vincuna demo and Chatbot Arena website. Paper, Tweet
7) Language Modeling is Compression - evaluates the compression capabilities of LLMs; it investigates how and why compression and prediction are equivalent; shows that LLMs are powerful general-purpose compressors due to their in-context learning abilities; finds that Chinchilla 70B compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG Paper, Tweet
8) Compositional Foundation Models - proposes foundation models that leverage multiple expert foundation models trained on language, vision, and action data to solve long-horizon goals. Paper, Tweet
9) LLMs for IT Operations - proposes OWL, an LLM for IT operations tuned using a self-instruct strategy based on IT-related tasks; it discusses how to collect a quality instruction dataset and how to put together a benchmark. Paper, Tweet
10) KOSMOS-2.5 - a multimodal model for machine reading of text-intensive images, capable of document-level text generation and image-to-markdown text generation. Paper, Tweet

Top AI Papers of the Week (September 11 - September 17)

Paper Links
1) Textbooks Are All You Need II - a new 1.3 billion parameter model trained on 30 billion tokens; the dataset consists of "textbook-quality" synthetically generated data; phi-1.5 competes or outperforms other larger models on reasoning tasks suggesting that data quality plays a more important role than previously thought. Paper, Tweet
2) The Rise and Potential of LLM Based Agents - a comprehensive overview of LLM based agents; covers from how to construct these agents to how to harness them for good. Paper, Tweet
3) EvoDiff - combines evolutionary-scale data with diffusion models for controllable protein generation in sequence space; it can generate proteins inaccessible to structure-based models. Paper, Tweet
4) LLMs Can Align Themselves without Finetuning? - discovers that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. Paper, Tweet
5) Robot Parkour Learning - presents a system for learning end-to-end vision-based parkour policy which is transferred to a quadrupedal robot using its ecocentric depth camera; shows that low-cost robots can automatically select and execute parkour skills in a real-world environment. Paper, Tweet
6) A Survey of Hallucination in LLMs - classifies different types of hallucination phenomena and provides evaluation criteria for assessing hallucination along with mitigation strategies. Paper, Tweet
7) Agents - an open-source library for building autonomous language agents including support for features like planning, memory, tool usage, multi-agent communication, and more. Paper, Tweet
8) Radiology-Llama2: Best-in-Class LLM for Radiology - presents an LLM based on Llama 2 tailored for radiology; it's tuned on a large dataset of radiology reports to generate coherent and clinically useful impressions from radiology findings. Paper, Tweet
9) Communicative Agents for Software Development - presents ChatDev, a virtual chat-powered software development company mirroring the waterfall model; shows the efficacy of the agent in software generation, even completing the entire software development process in less than seven minutes for less than one dollar. Paper, Tweet
10) MAmmoTH - a series of open-source LLMs tailored for general math problem-solving; the models are trained on a curated instruction tuning dataset and outperform existing open-source models on several mathematical reasoning datasets. Paper, Tweet

Top AI Papers of the Week (September 4 - September 10)

Paper Links
1) Transformers as SVMs - finds that the optimization geometry of self-attention in Transformers exhibits a connection to hard-margin SVM problems; also finds that gradient descent applied without early-stopping leads to implicit regularization and convergence of self-attention; this work has the potential to deepen the understanding of language models. Paper
2) Scaling RLHF with AI Feedback - tests whether RLAIF is a suitable alternative to RLHF by comparing the efficacy of human vs. AI feedback; uses different techniques to generate AI labels and conduct scaling studies to report optimal settings for generating aligned preferences; the main finding is that on the task of summarization, human evaluators prefer generations from both RLAIF and RLHF over a baseline SFT model in ∼70% of cases. Paper, Tweet
3) GPT Solves Math Problems Without a Calculator - shows that with sufficient training data, a 2B language model can perform multi-digit arithmetic operations with 100% accuracy and without data leakage; it’s also competitive with GPT-4 on 5K samples Chinese math problem test set when fine-tuned from GLM-10B on a dataset containing additional multi-step arithmetic operations and detailed math problems. Paper, Tweet
4) LLMs as Optimizers - an approach where the optimization problem is described in natural language; an LLM is then instructed to iteratively generate new solutions based on the defined problem and previously found solutions; at each optimization step, the goal is to generate new prompts that increase test accuracy based on the trajectory of previously generated prompts; the optimized prompts outperform human-designed prompts on GSM8K and Big-Bench Hard, sometimes by over 50% Paper, Tweet
5) Multi-modality Instruction Tuning - presents ImageBind-LLM, a multimodality instruction tuning method of LLMs via ImageBind; this model can respond to instructions of diverse modalities such as audio, 3D point clouds, and video, including high language generation quality; this is achieved by aligning ImageBind’s visual encoder with an LLM via learnable bind network. Paper, Tweet
6) Explaining Grokking - aims to explain grokking behavior in neural networks; specifically, it predicts and shows two novel behaviors: the first is ungrokking where a model goes from perfect generalization to memorization when trained further on a smaller dataset than the critical threshold; the second is semi-grokking where a network demonstrates grokking-like transition when training a randomly initialized network on the critical dataset size. Paper, Tweet
7) Overview of AI Deception - provides a survey of empirical examples of AI deception. Paper, Tweet
8) FLM-101B - a new open LLM called FLM-101B with 101B parameters and 0.31TB tokens which can be trained on a $100K budget; the authors analyze different growth strategies, growing the number of parameters from smaller sizes to large ones. They ultimately employ an aggressive strategy that reduces costs by >50%. In other words, three models are trained sequentially with each model inheriting knowledge from its smaller predecessor Paper, Tweet
9) Cognitive Architecture for Language Agents - proposes a systematic framework for understanding and building fully-fledged language agents drawing parallels from production systems and cognitive architectures; it systematizes diverse methods for LLM-based reasoning, grounding, learning, and decision making as instantiations of language agents in the framework. Paper, Tweet
10) Q-Transformer - a scalable RL method for training multi-task policies from large offline datasets leveraging human demonstrations and autonomously collected data; shows good performance on a large diverse real-world robotic manipulation task suite. Paper, Tweet

Top AI Papers of the Week (August 28 - September 3)

Paper Links
1) Large Language and Speech Model - proposes a large language and speech model trained with cross-modal conversational abilities that supports speech-and-language instruction enabling more natural interactions with AI systems. Paper, Tweet
2) SAM-Med2D - applies segment anything models Paper, Tweet
3) Vector Search with OpenAI Embeddings - suggests that “from a cost–benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern “AI stack” for search since such applications have already received substantial investments in existing, widely deployed infrastructure.” Paper, Tweet
4) Graph of Thoughts - presents a prompting approach that models text generated by LLMs as an arbitrary graph; it enables combining arbitrary "thoughts" and enhancing them using feedback loops; the core idea is to enhance the LLM capabilities through "network reasoning" and without any model updates; this could be seen as a generalization of the now popular Chain-of-Thought and Tree-of-Thought. Paper, Tweet
5) MVDream - a multi-view diffusion model that can generate geometrically consistent multi-view images given a text prompt; it leverages pre-trained diffusion models and a multi-view dataset rendered from 3D assets; this leads to generalizability of 2D diffusion and consistency of 3D data. Paper, Tweet
6) Nougat - proposes an approach for neural optical understanding of academic documents; it supports the ability to extract text, equations, and tables from academic PDFs, i.e., convert PDFs into LaTeX/markdown. Paper, Tweet
7) Factuality Detection in LLMs - proposes a tool called FacTool to detect factual errors in texts generated by LLMs; shows the necessary components needed and the types of tools to integrate with LLMs for better detecting factual errors. Paper, Tweet
8) AnomalyGPT - an approach for industrial anomaly detection based on large vision-language models; it simulates anomalous images and textual descriptions to generate training data; employs an image decoder and prompt learner to detect anomalies; it shows few-shot in-context learning capabilities and achieves state-of-the-art performance benchmark datasets. Paper, Tweet
9) FaceChain - a personalized portrait generation framework combining customized image-generation models and face-related perceptual understanding models to generate truthful personalized portraits; it works with a handful of portrait images as input. Paper
10) Qwen-VL - introduces a set of large-scale vision-language models demonstrating strong performance in tasks like image captioning, question answering, visual localization, and flexible interaction. Paper, Tweet

Top AI Papers of the Week (August 21 - August 27)

Paper Links
1) Code Llama - a family of LLMs for code based on Llama 2; the models provided as part of this release: foundation base models Paper, Tweet
2) Survey on Instruction Tuning for LLMs - new survey paper on instruction tuning LLM, including a systematic review of the literature, methodologies, dataset construction, training models, applications, and more. Paper, Tweet
3) SeamlessM4T - a unified multilingual and multimodal machine translation system that supports ASR, text-to-text translation, speech-to-text translation, text-to-speech translation, and speech-to-speech translation. Paper, Tweet
4) Use of LLMs for Illicit Purposes - provides an overview of existing efforts to identify and mitigate threats and vulnerabilities arising from LLMs; serves as a guide to building more reliable and robust LLM-powered systems. Paper, Tweet
5) Giraffe - a new family of models that are fine-tuned from base Llama and Llama 2; extends the context length to 4K, 16K, and 32K; explores the space of expanding context lengths in LLMs so it also includes insights useful for practitioners and researchers. Paper, Tweet
6) IT3D - presents a strategy that leverages explicitly synthesized multi-view images to improve Text-to-3D generation; integrates a discriminator along a Diffusion-GAN dual training strategy to guide the training of the 3D models. Paper
7) A Survey on LLM-based Autonomous Agents - presents a comprehensive survey of LLM-based autonomous agents; delivers a systematic review of the field and a summary of various applications of LLM-based AI agents in domains like social science and engineering. Paper, Tweet
8) Prompt2Model - a new framework that accepts a prompt describing a task through natural language; it then uses the prompt to train a small special-purpose model that is conducive to deployment; the proposed pipeline automatically collects and synthesizes knowledge through three channels: dataset retrieval, dataset generation, and model retrieval. Paper, Tweet
9) LegalBench - a collaboratively constructed benchmark for measuring legal reasoning in LLMs; it consists of 162 tasks covering 6 different types of legal reasoning. Paper, Tweet
10) Language to Rewards for Robotic Skill Synthesis - proposes a new language-to-reward system that utilizes LLMs to define optimizable reward parameters to achieve a variety of robotic tasks; the method is evaluated on a real robot arm where complex manipulation skills such as non-prehensile pushing emerge. Paper, Tweet

Top AI Papers of the Week (August 14 - August 20)

Paper Links
1) Self-Alignment with Instruction Backtranslation - presents an approach to automatically label human-written text with corresponding instruction which enables building a high-quality instruction following language model; the steps are: 1) fine-tune an LLM with small seed data and web corpus, then 2) generate instructions for each web doc, 3) curate high-quality examples via the LLM, and finally 4) fine-tune on the newly curated data; the self-alignment approach outperforms all other Llama-based models on the Alpaca leaderboard. Paper, Tweet
2) Platypus - a family of fine-tuned and merged LLMs currently topping the Open LLM Leaderboard; it describes a process of efficiently fine-tuning and merging LoRA modules and also shows the benefits of collecting high-quality datasets for fine-tuning; specifically, it presents a small-scale, high-quality, and highly curated dataset, Open-Platypus, that enables strong performance with short and cheap fine-tuning time and cost... one can train a 13B model on a single A100 GPU using 25K questions in 5 hours. Paper, Tweet
3) Model Compression for LLMs - a short survey on the recent model compression techniques for LLMs; provides a high-level overview of topics such as quantization, pruning, knowledge distillation, and more; it also provides an overview of benchmark strategies and evaluation metrics for measuring the effectiveness of compressed LLMs. Paper, Tweet
4) GEARS - uses deep learning and gene relationship knowledge graph to help predict cellular responses to genetic perturbation; GEARS exhibited 40% higher precision than existing approaches in the task of predicting four distinct genetic interaction subtypes in a combinatorial perturbation screen. Paper, Tweet
5) Shepherd - introduces a language model (7B) specifically tuned to critique the model responses and suggest refinements; this enables the capability to identify diverse errors and suggest remedies; its critiques are either similar or preferred to ChatGPT. Paper, Tweet
6) Using GPT-4 Code Interpreter to Boost Mathematical Reasoning - proposes a zero-shot prompting technique for GPT-4 Code Interpreter that explicitly encourages the use of code for self-verification which further boosts performance on math reasoning problems; initial experiments show that GPT4-Code achieved a zero-shot accuracy of 69.7% on the MATH dataset which is an improvement of 27.5% over GPT-4’s performance (42.2%). Lots to explore here. Paper, Tweet
7) Teach LLMs to Personalize - proposes a general approach based on multitask learning for personalized text generation using LLMs; the goal is to have an LLM generate personalized text without relying on predefined attributes. Paper, Tweet
8) OctoPack - presents 4 terabytes of Git commits across 350 languages used to instruction tune code LLMs; achieves state-of-the-art performance among models not trained on OpenAI outputs, on the HumanEval Python benchmark; the data is also used to extend the HumanEval benchmark to other tasks such as code explanation and code repair. Paper, Tweet
9) Efficient Guided Generation for LLMs - presents a library to help LLM developers guide text generation in a fast and reliable way; provides generation methods that guarantee that the output will match a regular expression, or follow a JSON schema. Paper, Tweet
10) Bayesian Flow Networks - introduces a new class of generative models bringing together the power of Bayesian inference and deep learning; it differs from diffusion models in that it operates on the parameters of a data distribution rather than on a noisy version of the data; it’s adapted to continuous, discretized and discrete data with minimal changes to the training procedure. Paper, Tweet

Top AI Papers of the Week (August 7 - August 13)

Paper Links
1) LLMs as Database Administrators - presents D-Bot, a framework based on LLMs that continuously acquires database maintenance experience from textual sources; D-Bot can help in performing: 1) database maintenance knowledge detection from documents and tools, 2) tree of thought reasoning for root cause analysis, and 3) collaborative diagnosis among multiple LLMs. Paper, Tweet
2) Political Biases Found in NLP Models - develops methods to measure media biases in LLMs, including the fairness of downstream NLP models tuned on top of politically biased LLMs; findings reveal that LLMs have political leanings which reinforce existing polarization in the corpora. Paper, Tweet
3) Evaluating LLMs as Agents - presents a multidimensional benchmark (AgentBench) to assess LLM-as-Agent’s reasoning and decision-making abilities; results show that there is a significant disparity in performance between top commercial LLMs and open-source LLMs when testing the ability to act as agents; open-source LLMs lag on the AgentBench tasks while GPT-4 shows potential to build continuously learning agents. Paper, Tweet
4) Studying LLM Generalization with Influence Functions - introduces an efficient approach to scale influence functions to LLMs with up to 52 billion parameters; the influence functions are used to further investigate the generalization patterns of LLMs such as cross-lingual generalization and memorization; finds that middle layers in the network seem to be responsible for the most abstract generalization patterns. Paper, Tweet
5) Seeing Through the Brain - proposes NeuroImagen, a pipeline for reconstructing visual stimuli images from EEG signals to potentially understand visually-evoked brain activity; a latent diffusion model takes EEG data and reconstructs high-resolution visual stimuli images. Paper, Tweet
6) SynJax - is a new library that provides an efficient vectorized implementation of inference algorithms for structured distributions; it enables building large-scale differentiable models that explicitly model structure in data like tagging, segmentation, constituency trees, and spanning trees. Paper, Tweet
7) Synthetic Data Reduces Sycophancy in LLMs - proposes fine-tuning on simple synthetic data to reduce sycophancy in LLMs; sycophancy occurs when LLMs try to follow a user’s view even when it’s not objectively correct; essentially, the LLM repeats the user’s view even when the opinion is wrong. Paper, Tweet
8) Photorealistic Unreal Graphics (PUG) - presents photorealistic and semantically controllable synthetic datasets for representation learning using Unreal Engine; the goal is to democratize photorealistic synthetic data and enable more rigorous evaluations of vision models. Paper, Tweet
9) LLMs for Industrial Control - develops an approach to select demonstrations and generate high-performing prompts used with GPT for executing tasks such as controlling (Heating, Ventilation, and Air Conditioning) for buildings; GPT-4 performs comparable to RL method but uses fewer samples and lower technical debt. Paper, Tweet
10) Trustworthy LLMs - presents a comprehensive overview of important categories and subcategories crucial for assessing LLM trustworthiness; the dimensions include reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness; finds that aligned models perform better in terms of trustworthiness but the effectiveness of alignment varies. Paper, Tweet

Top AI Papers of the Week (July 31 - August 6)

Paper Links
1) Open Problem and Limitation of RLHF - provides an overview of open problems and the limitations of RLHF. Paper, Tweet
2) Med-Flamingo - a new multimodal model that allows in-context learning and enables tasks such as few-shot medical visual question answering; evaluations based on physicians, show improvements of up to 20% in clinician's rating; the authors occasionally observed low-quality generations and hallucinations. Paper, Tweet
3) ToolLLM - enables LLMs to interact with 16000 real-world APIs; it’s a framework that allows data preparation, training, and evaluation; the authors claim that one of their models, ToolLLaMA, has reached the performance of ChatGPT (turbo-16k) in tool use. Paper, Tweet
4) Skeleton-of-Thought - proposes a prompting strategy that firsts generate an answer skeleton and then performs parallel API calls to generate the content of each skeleton point; reports quality improvements in addition to speed-up of up to 2.39x. Paper, Tweet
5) MetaGPT - a framework involving LLM-based multi-agents that encodes human standardized operating procedures (SOPs) to extend complex problem-solving capabilities that mimic efficient human workflows; this enables MetaGPT to perform multifaceted software development, code generation tasks, and even data analysis using tools like AutoGPT and LangChain. Paper, Tweet
6) OpenFlamingo - introduces a family of autoregressive vision-language models ranging from 3B to 9B parameters; the technical report describes the models, training data, and evaluation suite. Paper, Tweet
7) The Hydra Effect - shows that language models exhibit self-repairing properties — when one layer of attention heads is ablated it causes another later layer to take over its function. Paper, Tweet
8) Self-Check - explores whether LLMs have the capability to perform self-checks which is required for complex tasks that depend on non-linear thinking and multi-step reasoning; it proposes a zero-shot verification scheme to recognize errors without external resources; the scheme can improve question-answering performance through weighting voting and even improve math word problem-solving. Paper, Tweet
9) Agents Model the World with Language - presents an agent that learns a multimodal world model that predicts future text and image representations; it learns to predict future language, video, and rewards; it’s applied to different domains and can learn to follow instructions in visually and linguistically complex domains. Paper, Tweet
10) AutoRobotics-Zero - discovers zero-shot adaptable policies from scratch that enable adaptive behaviors necessary for sudden environmental changes; as an example, the authors demonstrate the automatic discovery of Python code for controlling a robot. Paper, Tweet

Top AI Papers of the Week (July 24 - July 30)

Paper Links
1) Universal Adversarial LLM Attacks - finds universal and transferable adversarial attacks that cause aligned models like ChatGPT and Bard to generate objectionable behaviors; the approach automatically produces adversarial suffixes using greedy and gradient search. Paper, Tweet
2) RT-2 - a new end-to-end vision-language-action model that learns from both web and robotics data; enables the model to translate the learned knowledge to generalized instructions for robotic control. Paper, Tweet
3) Med-PaLM Multimodal - introduces a new multimodal biomedical benchmark with 14 different tasks; it presents a proof of concept for a generalist biomedical AI system called Med-PaLM Multimodal; it supports different types of biomedical data like clinical text, imaging, and genomics. Paper, Tweet
4) Tracking Anything in High Quality - propose a framework for high-quality tracking anything in videos; consists of a video multi-object segmented and a pretrained mask refiner model to refine the tracking results; the model ranks 2nd place in the VOTS2023 challenge. Paper, Tweet
5) Foundation Models in Vision - presents a survey and outlook discussing open challenges and research directions for foundational models in computer vision. Paper, Tweet
6) L-Eval - a standardized evaluation for long context language models containing 411 long documents over 2K query-response pairs encompassing areas such as law, finance, school lectures, long conversations, novels, and meetings. Paper, Tweet
7) LoraHub - introduces LoraHub to enable efficient cross-task generalization via dynamic LoRA composition; it enables the combination of LoRA modules without human expertise or additional parameters/gradients; mimics the performance of in-context learning in few-shot scenarios. Paper, Tweet
8) Survey of Aligned LLMs - resents a comprehensive overview of alignment approaches, including aspects like data collection, training methodologies, and model evaluation. Paper, Tweet
9) WavJourney - leverages LLMs to connect various audio models to compose audio content for engaging storytelling; this involves an explainable and interactive design that enhances creative control in audio production. Paper, Tweet
10) FacTool - a task and domain agnostic framework for factuality detection of text generated by LLM; the effectiveness of the approach is tested on tasks such as code generation and mathematical reasoning; a benchmark dataset is released, including a ChatGPT plugin. Paper, Tweet

Top AI Papers of the Week (July 17 - July 23)

Paper Links
1) Llama 2 - a collection of pretrained foundational models and fine-tuned chat models ranging in scale from 7B to 70B; Llama 2-Chat is competitive on a range of tasks and shows strong results on safety and helpfulness. Paper, Tweet
2) How is ChatGPT’s Behavior Changing Over Time? - evaluates different versions of GPT-3.5 and GPT-4 on various tasks and finds that behavior and performance vary greatly over time; this includes differences in performance for tasks such as math problem-solving, safety-related generations, and code formatting. Paper, Tweet
3) FlashAttention-2 - improves work partitioning and parallelism and addresses issues like reducing non-matmul FLOPs, parallelizing attention computation which increases occupancy, and reducing communication through shared memory. Paper, Tweet
4) Measuring Faithfulness in Chain-of-Thought Reasoning - nds that CoT reasoning shows large variation across tasks by simple interventions like adding mistakes and paraphrasing; demonstrates that as the model becomes larger and more capable, the reasoning becomes less faithful; suggests carefully choosing the model size and tasks can enable CoT faithfulness. Paper, Tweet
5) Generative TV & Showrunner Agents - an approach to generate episodic content using LLMs and multi-agent simulation; this enables current systems to perform creative storytelling through the integration of simulation, the user, and powerful AI models and enhance the quality of AI-generated content. Paper, Tweet
6) Challenges & Application of LLMs - summarizes a comprehensive list of challenges when working with LLMs that range from brittle evaluations to prompt brittleness to a lack of robust experimental designs. Paper, Tweet
7) Retentive Network - presents a foundation architecture for LLMs with the goal to improve training efficiency, inference, and efficient long-sequence modeling; adapts retention mechanism for sequence modeling that support parallel representation, recurrent representations, and chunkwise recurrent representation. Paper, Tweet
8) Meta-Transformer - a framework that performs unified learning across 12 modalities; it can handle tasks that include fundamental perception (text, image, point cloud, audio, video), practical application (X-Ray, infrared, hyperspectral, and IMU), and data mining (graph, tabular, and time-series). Paper, Tweet
9) Retrieve In-Context Example for LLMs - presents a framework to iteratively train dense retrievers to identify high-quality in-context examples for LLMs; the approach enhances in-context learning performance demonstrated using a suite of 30 tasks; examples with similar patterns are helpful and gains are consistent across model sizes. Paper, Tweet
10) FLASK - proposes fine-grained evaluation for LLMs based on a range of alignment skill sets; involves 12 skills and can help to provide a holistic view of a model’s performance depending on skill, domain, and level of difficulty; useful to analyze factors that make LLMs more proficient at specific skills. Paper, Tweet

Top AI Papers of the Week (July 10 - July 16)

Paper Links
1) CM3Leon - introduces a retrieval-augmented multi-modal language model that can generate text and images; leverages diverse and large-scale instruction-style data for tuning which leads to significant performance improvements and 5x less training compute than comparable methods. Paper, Tweet
2) Claude 2 - presents a detailed model card for Claude 2 along with results on a range of safety, alignment, and capabilities evaluations. Paper, Tweet
3) Secrets of RLHF in LLMs - takes a closer look at RLHF and explores the inner workings of PPO with code included. Paper, Tweet
4) LongLLaMA - employs a contrastive training process to enhance the structure of the (key, value) space to extend context length; presents a fine-tuned model that lengthens context and demonstrates improvements in long context tasks. Paper, Tweet
5) Patch n’ Pack: NaViT - introduces a vision transformer for any aspect ratio and resolution through sequence packing; enables flexible model usage, improved training efficiency, and transfers to tasks involving image and video classification among others. Paper, Tweet
6) LLMs as General Pattern Machines - shows that even without any additional training, LLMs can serve as general sequence modelers, driven by in-context learning; this work applies zero-shot capabilities to robotics and shows that it’s possible to transfer the pattern among words to actions. Paper, Tweet
7) HyperDreamBooth - introduces a smaller, faster, and more efficient version of Dreambooth; enables personalization of text-to-image diffusion model using a single input image, 25x faster than Dreambooth. Paper, Tweet
8) Teaching Arithmetics to Small Transformers - trains small transformer models on chain-of-thought style data to significantly improve accuracy and convergence speed; it highlights the importance of high-quality instructive data for rapidly eliciting arithmetic capabilities. Paper, Tweet
9) AnimateDiff - appends a motion modeling module to a frozen text-to-image model, which is then trained and used to animate existing personalized models to produce diverse and personalized animated images. Paper, Tweet
10) Generative Pretraining in Multimodality - presents a new transformer-based multimodal foundation model to generate images and text in a multimodal context; enables performant multimodal assistants via instruction tuning. Paper, Tweet

Top AI Papers of the Week (July 3 - July 9)

Paper Links
1) A Survey on Evaluation of LLMs - a comprehensive overview of evaluation methods for LLMs focusing on what to evaluate, where to evaluate, and how to evaluate. Paper, Tweet
2) How Language Models Use Long Contexts - finds that LM performance is often highest when relevant information occurs at the beginning or end of the input context; performance degrades when relevant information is provided in the middle of a long context. Paper, Tweet
3) LLMs as Effective Text Rankers - proposes a prompting technique that enables open-source LLMs to perform state-of-the-art text ranking on standard benchmarks. Paper, Tweet
4) Multimodal Generation with Frozen LLMs - introduces an approach that effectively maps images to the token space of LLMs; enables models like PaLM and GPT-4 to tackle visual tasks without parameter updates; enables multimodal tasks and uses in-context learning to tackle various visual tasks. Paper, Tweet
5) CodeGen2.5 - releases a new code LLM trained on 1.5T tokens; the 7B model is on par with >15B code-generation models and it’s optimized for fast sampling. Paper, Tweet
6) Elastic Decision Transformer - introduces an advancement over Decision Transformers and variants by facilitating trajectory stitching during action inference at test time, achieved by adjusting to shorter history that allows transitions to diverse and better future states. Paper, Tweet
7) Robots That Ask for Help - presents a framework to measure and align the uncertainty of LLM-based planners that ask for help when needed. Paper, Tweet
8) Physics-based Motion Retargeting in Real-Time - proposes a method that uses reinforcement learning to train a policy to control characters in a physics simulator; it retargets motions in real-time from sparse human sensor data to characters of various morphologies. Paper, Tweet
9) Scaling Transformer to 1 Billion Tokens - presents LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, with no loss in shorter sequences. Paper, Tweet
10) InterCode - introduces a framework of interactive coding as a reinforcement learning environment; this is different from the typical coding benchmarks that consider a static sequence-to-sequence process. Paper, Tweet

Top AI Papers of the Week (June 26 - July 2)

Paper Links
1) LeanDojo - an open-source Lean playground consisting of toolkits, data, models, and benchmarks for theorem proving; also develops ReProver, a retrieval augmented LLM-based prover for theorem solving using premises from a vast math library. Paper, Tweet
2) Extending Context Window of LLMs - extends the context window of LLMs like LLaMA to up to 32K with minimal fine-tuning (within 1000 steps); previous methods for extending the context window are inefficient but this approach attains good performance on several tasks while being more efficient and cost-effective. Paper, Tweet
3) Computer Vision Through the Lens of Natural Language - proposes a modular approach for solving computer vision problems by leveraging LLMs; the LLM is used to reason over outputs from independent and descriptive modules that provide extensive information about an image. Paper, Tweet
4) Visual Navigation Transformer - a foundational model that leverages the power of pretrained models to vision-based robotic navigation; it can be used with any navigation dataset and is built on a flexible Transformer-based architecture that can tackle various navigational tasks. Paper, Tweet
5) Generative AI for Programming Education - evaluates GPT-4 and ChatGPT on programming education scenarios and compares their performance with human tutors; GPT-4 outperforms ChatGPT and comes close to human tutors' performance. Paper, Tweet
6) DragDiffusion - extends interactive point-based image editing using diffusion models; it optimizes the diffusion latent to achieve precise spatial control and complete high-quality editing efficiently. Paper, Tweet
7) Understanding Theory-of-Mind in LLMs with LLMs - a framework for procedurally generating evaluations with LLMs; proposes a benchmark to study the social reasoning capabilities of LLMs with LLMs. Paper, Tweet
8) Evaluations with No Labels - a framework for self-supervised evaluation of LLMs by analyzing their sensitivity or invariance to transformations on input text; can be used to monitor LLM behavior on datasets streamed during live model deployment. Paper, Tweet
9) Long-range Language Modeling with Self-Retrieval - an architecture and training procedure for jointly training a retrieval-augmented language model from scratch for long-range language modeling tasks. Paper, Tweet
10) Scaling MLPs: A Tale of Inductive Bias - shows that the performance of MLPs improves with scale and highlights that lack of inductive bias can be compensated. Paper, Tweet

Top AI Papers of the Week (June 19 - June 25)

Paper Links
1) Textbooks Are All You Need - introduces a new 1.3B parameter LLM called phi-1; it’s significantly smaller in size and trained for 4 days using a selection of textbook-quality data and synthetic textbooks and exercises with GPT-3.5; achieves promising results on the HumanEval benchmark. Paper, Tweet
2) RoboCat - a new foundation agent that can operate different robotic arms and can solve tasks from as few as 100 demonstrations; the self-improving AI agent can self-generate new training data to improve its technique and get more efficient at adapting to new tasks. Paper, Tweet
3) ClinicalGPT - a language model optimized through extensive and diverse medical data, including medical records, domain-specific knowledge, and multi-round dialogue consultations. Paper, Tweet
4) An Overview of Catastrophic AI Risks - provides an overview of the main sources of catastrophic AI risks; the goal is to foster more understanding of these risks and ensure AI systems are developed in a safe manner. Paper, Tweet
5) LOMO - proposes a new memory-efficient optimizer that combines gradient computation and parameter update in one step; enables tuning the full parameters of an LLM with limited resources. Paper, Tweet
6) SequenceMatch - formulates sequence generation as an imitation learning problem; this framework allows the ability to incorporate backtracking into text generation through a backspace action; this enables the generative model to mitigate compounding errors by reverting sample tokens that lead to sequence OOD. Paper, Tweet
7) LMFlow - an extensible and lightweight toolkit that simplifies finetuning and inference of general large foundation models; supports continuous pretraining, instruction tuning, parameter-efficient finetuning, alignment tuning, and large model inference. Paper, Tweet
8) MotionGPT - uses multimodal control signals for generating consecutive human motions; it quantizes multimodal control signals intro discrete codes which are converted to LLM instructions that generate motion answers. Paper, Tweet
9) Wanda - introduces a simple and effective pruning approach for LLMs; it prunes weights with the smallest magnitudes multiplied by the corresponding input activations, on a per-output basis; the approach requires no retraining or weight update and outperforms baselines of magnitude pruning. Paper, Tweet
10) AudioPaLM - fuses text-based and speech-based LMs, PaLM-2 and AudioLM, into a multimodal architecture that supports speech understanding and generation; outperforms existing systems for speech translation tasks with zero-shot speech-to-text translation capabilities. Paper, Tweet

Top AI Papers of the Week (June 12 - June 18)

Paper Links
1) Voicebox - an all-in-one generative speech model; it can synthesize speech across 6 languages; it can perform noise removal, content editing, style conversion, and more; it's 20x faster than current models and outperforms single-purpose models through in-context learning. Paper, Tweet
2) FinGPT - an open-source LLM for the finance sector; it takes a data-centric approach, providing researchers & practitioners with accessible resources to develop FinLLMs. Paper, Tweet
3) Crowd Workers Widely Use Large Language Models for Text Production Tasks - estimates that 33-46% of crowd workers on MTurk used LLMs when completing a text production task. Paper, Tweet
4) Reliability of Watermarks for LLMs - watermarking is useful to detect LLM-generated text and potentially mitigate harms; this work studies the reliability of watermarking for LLMs and finds that watermarks are detectable even when the watermarked text is re-written by humans or paraphrased by another non-watermarked LLM. Paper, Tweet
5) Applications of Transformers - a new survey paper highlighting major applications of Transformers for deep learning tasks; includes a comprehensive list of Transformer models. Paper, Tweet
6) Benchmarking NN Training Algorithms - it’s currently challenging to properly assess the best optimizers to train neural networks; this paper presents a new benchmark, AlgoPerf, for benchmarking neural network training algorithms using realistic workloads. Paper, Tweet
7) Unifying LLMs & Knowledge Graphs - provides a roadmap for the unification of LLMs and KGs; covers how to incorporate KGs in LLM pre-training/inferencing, leverage LLMs for KG tasks such as question answering, and enhance both KGs and LLMs for bidirectional reasoning. Paper, Tweet
8) Augmenting LLMs with Long-term Memory - proposes a framework to enable LLMs to memorize long history; it’s enhanced with memory-augmented adaptation training to memorize long past context and use long-term memory for language modeling; achieves improvements on memory-augmented in-context learning over LLMs. Paper, Tweet
9) TAPIR - enables tracking any queried point on any physical surface throughout a video sequence; outperforms all baselines and facilitates fast inference on long and high-resolution videos (track points faster than real-time when using modern GPUs). Paper, Tweet
10) Mind2Web - a new dataset for evaluating generalist agents for the web; contains 2350 tasks from 137 websites over 31 domains; it enables testing generalization ability across tasks and environments, covering practical use cases on the web. Paper, Tweet

Top AI Papers of the Week (June 5 - June 11)

Paper Links
1) Tracking Everything Everywhere All at Once - propose a test-time optimization method for estimating dense and long-range motion; enables accurate, full-length motion estimation of every pixel in a video. Paper, Tweet
2) AlphaDev - a deep reinforcement learning agent which discovers faster sorting algorithms from scratch; the algorithms outperform previously known human benchmarks and have been integrated into the LLVM C++ library. Paper, Tweet
3) Sparse-Quantized Representation - a new compressed format and quantization technique that enables near-lossless compression of LLMs across model scales; “allows LLM inference at 4.75 bits with a 15% speedup”. Paper, Tweet
4) MusicGen - a simple and controllable model for music generation built on top of a single-stage transformer LM together with efficient token interleaving patterns; it can be conditioned on textual descriptions or melodic features and shows high performance on a standard text-to-music benchmark. Paper, Tweet
5) Augmenting LLMs with Databases - combines an LLM with a set of SQL databases, enabling a symbolic memory framework; completes tasks via LLM generating SQL instructions that manipulate the DB autonomously. Paper, Tweet
6) Concept Scrubbing in LLM - presents a method called LEAst-squares Concept Erasure (LEACE) to erase target concept information from every layer in a neural network; it’s used for reducing gender bias in BERT embeddings. Paper , Tweet
7) Fine-Grained RLHF - trains LMs with fine-grained human feedback; instead of using overall preference, more explicit feedback is provided at the segment level which helps to improve efficacy on long-form question answering, reduce toxicity, and enables LM customization. Paper, Tweet
8) Hierarchical Vision Transformer - pretrains vision transformers with a visual pretext task (MAE), while removing unnecessary components from a state-of-the-art multi-stage vision transformer; this enables a simple hierarchical vision transformer that’s more accurate and faster at inference and during training. Paper, Tweet
9) Humor in ChatGPT - explores ChatGPT’s capabilities to grasp and reproduce humor; finds that over 90% of 1008 generated jokes were the same 25 jokes and that ChatGPT is also overfitted to a particular joke structure. Paper, Tweet
10) Imitating Reasoning Process of Larger LLMs - develops a 13B parameter model that learns to imitate the reasoning process of large foundational models like GPT-4; it leverages large-scale and diverse imitation data and surpasses instruction-tuned models such as Vicuna-13B in zero-shot reasoning. Paper, Tweet

Top AI Papers of the Week (May 29-June 4)

Paper Links
1) Let’s Verify Step by Step - achieves state-of-the-art mathematical problem solving by rewarding each correct step of reasoning in a chain-of-thought instead of rewarding the final answer; the model solves 78% of problems from a representative subset of the MATH test set. Paper, Tweet
2) No Positional Encodings - shows that explicit position embeddings are not essential for decoder-only Transformers; shows that other positional encoding methods like ALiBi and Rotary are not well suited for length generalization. Paper, Tweet
3) BiomedGPT - a unified biomedical generative pretrained transformer model for vision, language, and multimodal tasks. Achieves state-of-the-art performance across 5 distinct tasks with 20 public datasets spanning over 15 unique biomedical modalities. Paper, Tweet
4) Thought Cloning - introduces an imitation learning framework to learn to think while acting; the idea is not only to clone the behaviors of human demonstrators but also the thoughts humans have when performing behaviors. Paper, Tweet
5) Fine-Tuning Language Models with Just Forward Passes - proposes a memory-efficient zeroth-order optimizer and a corresponding SGD algorithm to finetune large LMs with the same memory footprint as inference. Paper , Tweet
6) MERT - an acoustic music understanding model with large-scale self-supervised training; it incorporates a superior combination of teacher models to outperform conventional speech and audio approaches. Paper , Tweet
7) Bytes Are All You Need - investigates performing classification directly on file bytes, without needing to decode files at inference time; achieves ImageNet Top-1 accuracy of 77.33% using a transformer backbone; achieves 95.42% accuracy when operating on WAV files from the Speech Commands v2 dataset. Paper, Tweet
8) Direct Preference Optimization - while helpful to train safe and useful LLMs, the RLHF process can be complex and often unstable; this work proposes an approach to finetune LMs by solving a classification problem on the human preferences data, with no RL required. Paper, Tweet
9) SQL-PaLM - an LLM-based Text-to-SQL adopted from PaLM-2; achieves SoTA in both in-context learning and fine-tuning settings; the few-shot model outperforms the previous fine-tuned SoTA by 3.8% on the Spider benchmark; few-shot SQL-PaLM also outperforms few-shot GPT-4 by 9.9%, using a simple prompting approach. Paper, Tweet
10) CodeTF - an open-source Transformer library for state-of-the-art code LLMs; supports pretrained code LLMs and popular code benchmarks, including standard methods to train and serve code LLMs efficiently. Paper, Tweet

Top AI Papers of the Week (May 22-28)

Paper Links
1) QLoRA - an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning performance. Paper, Tweet
2) LIMA - a new 65B parameter LLaMa model fine-tuned on 1000 carefully curated prompts and responses; it doesn't use RLHF, generalizes well to unseen tasks not available in the training data, and generates responses equivalent or preferred to GPT-4 in 43% of cases, and even higher compared to Bard. Paper, Tweet
3) Voyager - an LLM-powered embodied lifelong learning agent in Minecraft that can continuously explore worlds, acquire skills, and make novel discoveries without human intervention. Paper, Tweet
4) Gorilla - a finetuned LLaMA-based model that surpasses GPT-4 on writing API calls. This capability can help identify the right API, boosting the ability of LLMs to interact with external tools to complete specific tasks. Paper, Tweet
5) The False Promise of Imitating Proprietary LLMs - provides a critical analysis of models that are finetuned on the outputs of a stronger model; argues that model imitation is a false premise and that the higher leverage action to improve open source models is to develop better base models. Paper , Tweet
6) Sophia - presents a simple scalable second-order optimizer that has negligible average per-step time and memory overhead; on language modeling, Sophia achieves 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time. Paper , Tweet
7) The Larger They Are, the Harder They Fail - shows that LLMs fail to generate correct Python code when default function names are swapped; they also strongly prefer incorrect continuation as they become bigger. Paper, Tweet
8) Model Evaluation for Extreme Risks - discusses the importance of model evaluation for addressing extreme risks and making responsible decisions about model training, deployment, and security. Paper, Tweet
9) LLM Research Directions - discusses a list of research directions for students looking to do research with LLMs. Paper, Tweet
10) Reinventing RNNs for the Transformer Era - proposes an approach that combines the efficient parallelizable training of Transformers with the efficient inference of RNNs; results show that the method performs on part with similarly sized Transformers. Paper, Tweet

Top AI Papers of the Week (May 15-21)

Paper Links
1) Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold - an approach for controlling GANs that allows dragging points of the image to precisely reach target points in a user-interactive manner. Paper, Tweet
2) Evidence of Meaning in Language Models Trained on Programs - argues that language models can learn meaning despite being trained only to perform next token prediction on text. Paper, Tweet
3) Towards Expert-Level Medical Question Answering with Large Language Models - a top-performing LLM for medical question answering; scored up to 86.5% on the MedQA dataset (a new state-of-the-art); approaches or exceeds SoTA across MedMCQA, PubMedQA, and MMLU clinical topics datasets. Paper, Tweet
4) MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers - a multi-scale decoder architecture enabling end-to-end modeling of sequences of over one million bytes; enables sub-quadratic self-attention and improved parallelism during decoding. Paper, Tweet
5) StructGPT: A General Framework for Large Language Model to Reason over Structured Data - improves the zero-shot reasoning ability of LLMs over structured data; effective for solving question answering tasks based on structured data. Paper , Tweet
6) TinyStories: How Small Can Language Models Be and Still Speak Coherent English? - uses a synthetic dataset of short stories to train and evaluate LMs that are much smaller than SoTA models but can produce fluent and consistent stories with several paragraphs, and demonstrate reasoning capabilities. Paper , Tweet
7) DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining - trains a small proxy model over domains to produce domain weights without knowledge of downstream tasks; it then resamples a dataset with the domain weights and trains a larger model; this enables using a 280M proxy model to train an 8B model (30x larger) more efficiently. Paper, Tweet
8) CodeT5+: Open Code Large Language Models for Code Understanding and Generation - supports a wide range of code understanding and generation tasks and different training methods to improve efficacy and computing efficiency; tested on 20 code-related benchmarks using different settings like zero-shot, fine-tuning, and instruction tuning; achieves SoTA on tasks like code completion, math programming, and text-to-code retrieval tasks. Paper, Tweet
9) Symbol tuning improves in-context learning in language models - an approach to finetune LMs on in-context input-label pairs where natural language labels are replaced by arbitrary symbols; boosts performance on unseen in-context learning tasks and algorithmic reasoning tasks. Paper), Tweet
10) Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability - shows that PaLM is exposed to over 30 million translation pairs across at least 44 languages; shows that incidental bilingualism connects to the translation capabilities of PaLM. Paper, Tweet

Top AI Papers of the Week (May 8-14)

Paper Links
1) LLM explains neurons in LLMs - applies GPT-4 to automatically write explanations on the behavior of neurons in LLMs and even score those explanations; this offers a promising way to improve interpretability in future LLMs and potentially detect alignment and safety problems. Paper, Tweet
2) PaLM 2 - a new state-of-the-art language model integrated into AI features and tools like Bard and the PaLM API; displays competitive performance in mathematical reasoning compared to GPT-4; instruction-tuned model, Flan-PaLM 2, shows good performance on benchmarks like MMLU and BIG-bench Hard. Paper, Tweet
3) ImageBind - an approach that learns joint embedding data across six modalities at once; extends zero-shot capabilities to new modalities and enables emergent applications including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection, and generation. Paper, Tweet
4) TidyBot - shows that robots can combine language-based planning and perception with the few-shot summarization capabilities of LLMs to infer generalized user preferences that are applicable to future interactions. Paper, Tweet
5) Unfaithful Explanations in Chain-of-Thought Prompting - demonstrates that CoT explanations can misrepresent the true reason for a model’s prediction; when models are biased towards incorrect answers, CoT generation explanations supporting those answers. Paper , Tweet
6) InstructBLIP - explores visual-language instruction tuning based on the pre-trained BLIP-2 models; achieves state-of-the-art zero-shot performance on 13 held-out datasets, outperforming BLIP-2 and Flamingo. Paper , Tweet
7) Active Retrieval Augmented LLMs - introduces FLARE, retrieval augmented generation to improve the reliability of LLMs; FLARE actively decides when and what to retrieve across the course of the generation; demonstrates superior or competitive performance on long-form knowledge-intensive generation tasks. Paper, Tweet
8) FrugalGPT - presents strategies to reduce the inference cost associated with using LLMs while improving performance. Paper, Tweet
9) StarCoder - an open-access 15.5B parameter LLM with 8K context length and is trained on large amounts of code spanning 80+ programming languages. Paper, Tweet
10) MultiModal-GPT - a vision and language model for multi-round dialogue with humans; the model is fine-tuned from OpenFlamingo, with LoRA added in the cross-attention and self-attention parts of the language model. Paper, Tweet

Top AI Papers of the Week (May 1-7)

Paper Links
1) scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI - a foundation large language model pretrained on 10 million cells for single-cell biology. Paper, Tweet
2) GPTutor: a ChatGPT-powered programming tool for code explanation - a ChatGPT-powered tool for code explanation provided as a VSCode extension; claims to deliver more concise and accurate explanations than vanilla ChatGPT and Copilot; performance and personalization enhanced via prompt engineering; programmed to use more relevant code in its prompts. Paper, Tweet
3) Shap-E: Generating Conditional 3D Implicit Functions - a conditional generative model for 3D assets; unlike previous 3D generative models, this model generates implicit functions that enable rendering textured meshes and neural radiance fields. Paper, Tweet
4) Are Emergent Abilities of Large Language Models a Mirage? - presents an alternative explanation to the emergent abilities of LLMs; suggests that existing claims are creations of the researcher’s analyses and not fundamental changes in model behavior on specific tasks with scale Paper, Tweet
5) Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl - releases PySR, an open-source library for practical symbolic regression for the sciences; it’s built on a high-performance distributed back-end and interfaces with several deep learning packages; in addition, a new benchmark, “EmpiricalBench”, is released to quantify applicability of symbolic regression algorithms in science. Paper , Tweet
6) PMC-LLaMA: Further Finetuning LLaMA on Medical Papers - a LLaMA model fine-tuned on 4.8 million medical papers; enhances capabilities in the medical domain and achieves high performance on biomedical QA benchmarks. Paper , Tweet
7) Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes - a mechanism to extract rationales from LLMs to train smaller models that outperform larger language models with less training data needed by finetuning or distillation. Paper, Tweet
8) Poisoning Language Models During Instruction Tuning - show that adversaries can poison LLMs during instruction tuning by contributing poison examples to datasets; it can induce degenerate outputs across different held-out tasks. Paper, Tweet
9) Unlimiformer: Long-Range Transformers with Unlimited Length Input - proposes long-range transformers with unlimited length input by augmenting pre-trained encoder-decoder transformer with external datastore to support unlimited length input; shows usefulness in long-document summarization; could potentially be used to improve the performance of retrieval-enhanced LLMs. Paper, Tweet
10) Learning to Reason and Memorize with Self-Notes - an approach that enables LLMs to reason and memorize enabling them to deviate from the input sequence at any time to explicitly “think”; this enables the LM to recall information and perform reasoning on the fly; experiments show that this method scales better to longer sequences unseen during training. Paper, Tweet

Top AI Papers of the Week (April 24 - April 30)

Paper Links
1) Learning Agile Soccer Skills for a Bipedal Robot with Deep Reinforcement Learning - applies deep reinforcement learning to synthesize agile soccer skills for a miniature humanoid robot; the resulting policy allows dynamic movement skills such as fast recovery, walking, and kicking. Paper, Tweet
2) Scaling Transformer to 1M tokens and beyond with RMT - leverages a recurrent memory transformer architecture to increase BERT’s effective context length to two million tokens while maintaining high memory retrieval accuracy. Paper, Tweet
3) Track Anything: Segment Anything Meets Videos - an interactive tool for video object tracking and segmentation; it’s built on top segment anything and allows flexible tracking and segmenting via user clicks. Paper, Tweet
4) A Cookbook of Self-Supervised Learning - provides an overview of fundamental techniques and key concepts in SSL; it also introduces practical considerations for implementing SSL methods successfully. Paper, Tweet
5) Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond - a comprehensive and practical guide for practitioners working with LLMs; discusses many use cases with practical applications and limitations of LLMs in real-world scenarios. Paper , Tweet
6) AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head - connects ChatGPT with audio foundational models to handle challenging audio tasks and a modality transformation interface to enable spoken dialogue. Paper , Tweet
7) DataComp: In search of the next generation of multimodal datasets - releases a new multimodal dataset benchmark containing 12.8B image-text pairs. Paper, Tweet
8) ChatGPT for Information Extraction - provides a deeper assessment of ChatGPT's performance on the important information extraction task. Paper, Tweet
9) Comparing Physician vs ChatGPT - investigates if chatbot assistants like ChatGPT can provide responses to patient questions while emphasizing quality and empathy; finds that chatbot responses were preferred over physician responses and rated significantly higher in terms of both quality and empathy. Paper, Tweet
10) Stable and low-precision training for large-scale vision-language models - introduces methods for accelerating and stabilizing training of large-scale language vision models. Paper, Tweet

Top AI Papers of the Week (April 17 - April 23)

Paper Links
1) DINOv2: Learning Robust Visual Features without Supervision - a new method for training high-performance computer vision models based on self-supervised learning; enables learning rich and robust visual features without supervision which are useful for both image-level visual tasks and pixel-level tasks; tasks supported include image classification, instance retrieval, video understanding, depth estimation, and much more. Paper, Tweet
2) Learning to Compress Prompts with Gist Tokens - an approach that trains language models to compress prompts into gist tokens reused for compute efficiency; this approach enables 26x compression of prompts, resulting in up to 40% FLOPs reductions. Paper, Tweet
3) Scaling the leading accuracy of deep equivariant models to biomolecular simulations of realistic size - presents a framework for large-scale biomolecular simulation; this is achieved through the high accuracy of equivariant deep learning and the ability to scale to large and long simulations; the system is able to “perform nanoseconds-long stable simulations of protein dynamics and scale up to a 44-million atom structure of a complete, all-atom, explicitly solvated HIV capsid on the Perlmutter supercomputer.” Paper, Tweet
4) Evaluating Verifiability in Generative Search Engines - performs human evaluation to audit popular generative search engines such as Bing Chat, Perplexity AI, and NeevaAI; finds that, on average, only 52% of generated sentences are supported by citations and 75% of citations support their associated sentence. Paper, Tweet
5) Generative Disco: Text-to-Video Generation for Music Visualization - an AI system based on LLMs and text-to-image models that generates music visualizations. Paper , Tweet
6) Architectures of Topological Deep Learning: A Survey on Topological Neural Networks Paper , Tweet
7) Visual Instruction Tuning - presents an approach that uses language-only GPT-4 to generate multimodal language-image instruction-following data; applies instruction tuning with the data and introduces LLaVA, an end-to-end trained large multimodal model for general-purpose visual and language understanding. Paper, Tweet
8) ChatGPT: Applications, Opportunities, and Threats Paper, Tweet
9) Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models - a plug-and-play compositional reasoning framework that augments LLMs and can infer the appropriate sequence of tools to compose and execute in order to generate final responses; achieves 87% accuracy on ScienceQA and 99% on TabMWP. Paper, Tweet
10) Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models - applies latent diffusion models to high-resolution video generation; validates the model on creative content creation and real driving videos of 512 x 1024 and achieves state-of-the-art performance. Paper, Tweet

Top AI Papers of the Week (April 10 - April 16)

Paper Links
1) Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields - combines mip-NeRF 360 and grid-based models to improve NeRFs that train 22x faster than mip-NeRF 360. Paper, Tweet
2) Generative Agents: Interactive Simulacra of Human Behavior - proposes an architecture that extends LLMs to build agents that enable simulations of human-like behavior; these capabilities are possible by storing a complete record of an agent's experiences, synthesizing memories over time into higher-level reflections, and retrieving them dynamically to plan behavior. Paper, Tweet
3) Emergent autonomous scientific research capabilities of large language models - presents an agent that combines LLMs for autonomous design, planning, and execution of scientific experiments; shows emergent scientific research capabilities, including the successful performance of catalyzed cross-coupling reactions. Paper, Tweet
4) Automatic Gradient Descent: Deep Learning without Hyperparameters - derives optimization algorithms that explicitly leverage neural architecture; it proposes a first-order optimizer without hyperparameters that trains CNNs at ImageNet scale. Paper, Tweet
5) ChemCrow: Augmenting large-language models with chemistry tools - presents an LLM chemistry agent that performs tasks across synthesis, drug discovery, and materials design; it integrates 13 expert-design tools to augment LLM performance in chemistry and demonstrate effectiveness in automating chemical tasks. Paper , Tweet
6) One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era - A Survey of ChatGPT and GPT-4 Paper , Tweet
7) OpenAGI: When LLM Meets Domain Experts - an open-source research platform to facilitate the development and evaluation of LLMs in solving complex, multi-step tasks through manipulating various domain expert models. Paper, Tweet
8) AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models - a new benchmark to assess foundational models in the context of human-centric standardized exams, including college entrance exams, law school admission tests, and math competitions, among others. Paper, Tweet
9) Teaching Large Language Models to Self-Debug - proposes an approach that teaches LLMs to debug their predicted program via few-shot demonstrations; this allows a model to identify its mistakes by explaining generated code in natural language; achieves SoTA on several code generation tasks like text-to-SQL generation. Paper, Tweet
10) Segment Everything Everywhere All at Once - a promptable, interactive model for various segmentation tasks that yields competitive performance on open-vocabulary and interactive segmentation benchmarks. Paper, Tweet

Top AI Papers of the Week (April 3 - April 9)

Paper Links
1) Segment Anything - presents a set of resources to establish foundational models for image segmentation; releases the largest segmentation dataset with over 1 billion masks on 11M licensed images; the model’s zero-shot performance is competitive with or even superior to fully supervised results. Paper, Tweet
2) Instruction Tuning with GPT-4 - presents GPT-4-LLM, a "first attempt" to use GPT-4 to generate instruction-following data for LLM fine-tuning; the dataset is released and includes 52K unique English and Chinese instruction-following data; the dataset is used to instruction-tune LLaMA models which leads to superior zero-shot performance on new tasks. Paper, Tweet
3) Eight Things to Know about Large Language Models - discusses important considerations regarding the capabilities and limitations of LLMs. Paper, Tweet
4) A Survey of Large Language Models - a new 50 pages survey on large language models. Paper, Tweet
5) Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data - an open-source chat model fine-tuned with LoRA. Leverages 100K dialogs generated from ChatGPT chatting with itself; it releases the dialogs along with 7B, 13B, and 30B parameter models. Paper , Tweet
6) Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark - a new benchmark of 134 text-based Choose-Your-Own-Adventure games to evaluate the capabilities and unethical behaviors of LLMs. Paper , Tweet
7) Better Language Models of Code through Self-Improvement - generates pseudo data from knowledge gained through pre-training and fine-tuning; adds the data to the training dataset for the next step; results show that different frameworks can be improved in performance using code-related generation tasks. Paper, Tweet
8) Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models - an overview of applications of ChatGPT and GPT-4; the analysis is done on 194 relevant papers and discusses capabilities, limitations, concerns, and more Paper, Tweet
9) Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling - a suite for analyzing LLMs across training and scaling; includes 16 LLMs trained on public data and ranging in size from 70M to 12B parameters. Paper, Tweet
10) SegGPT: Segmenting Everything In Context - unifies segmentation tasks into a generalist model through an in-context framework that supports different kinds of data. Paper, Tweet

Top AI Papers of the Week (Mar 27 - April 2)

Paper Links
1) BloombergGPT: A Large Language Model for Finance - a new 50B parameter large language model for finance. Claims the largest domain-specific dataset yet with 363 billion tokens... further augmented with 345 billion tokens from general-purpose datasets; outperforms existing models on financial tasks while not sacrificing performance on general LLM benchmarks. Paper, Tweet
2) Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware - a low-cost system that performs end-to-end imitation learning from real demonstrations; also presents an algorithm called Action Chunking with Transformers to learn a generative model that allows a robot to learn difficult tasks in the real world. Paper, Tweet
3) HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace - a system that leverages LLMs like ChatGPT to conduct task planning, select models and act as a controller to execute subtasks and summarize responses according to execution results. Paper, Tweet
4) ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge - a medical chat model fine-tuned on LLaMA using medical domain knowledge. Collects data on around 700 diseases and generated 5K doctor-patient conversations to finetune the LLM. Paper, Tweet
5) LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention - a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model; generates responses comparable to Alpaca with fully fine-tuned 7B parameter; it’s also extended for multi-modal input support. Paper , Tweet
6) ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks - demonstrates that ChatGPT can outperform crowd-workers for several annotation tasks such as relevance, topics, and frames detection; besides better zero-shot accuracy, the per-annotation cost of ChatGPT is less 20 times cheaper than MTurk. Paper , Tweet
7) Language Models can Solve Computer Tasks - shows that a pre-trained LLM agent can execute computer tasks using a simple prompting scheme where the agent recursively criticizes and improves its outputs. Paper, Tweet
8) DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents - a paradigm to enhance large language model completions by allowing models to communicate feedback and iteratively improve output; DERA outperforms base GPT-4 on clinically-focused tasks. Paper, Tweet
9) Natural Selection Favors AIs over Humans - discusses why AI systems will become more fit than humans and the potential dangers and risks involved, including ways to mitigate them. Paper, Tweet
10) Machine Learning for Partial Differential Equations - Pa review examining avenues of partial differential equations research advanced by machine learning. Paper, Tweet

Top AI Papers of the Week (Mar 20-Mar 26)

Paper Links
1) Sparks of Artificial General Intelligence: Early experiments with GPT-4 - a comprehensive investigation of an early version of GPT-4 when it was still in active development by OpenAI. Paper, Tweet
2) Reflexion: an autonomous agent with dynamic memory and self-reflection - proposes an agent with dynamic memory and self-reflection capabilities to enhance its existing reasoning trace and task-specific action choice abilities. Paper, Tweet
3) Capabilities of GPT-4 on Medical Challenge Problems - shows that GPT-4 exceeds the passing score on USMLE by over 20 points and outperforms GPT-3.5 as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a prompt-tuned version of Flan-PaLM 540B). Paper, Tweet
4) GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models - investigates the potential implications of GPT models and related systems on the US labor market. Paper, Tweet
5) CoLT5: Faster Long-Range Transformers with Conditional Computation - a long-input Transformer model that employs conditional computation, devoting more resources to important tokens in both feedforward and attention layers. Paper , Tweet
6) Artificial muses: Generative Artificial Intelligence Chatbots Have Risen to Human-Level Creativity - compares human-generated ideas with those generated by generative AI chatbots like ChatGPT and YouChat; reports that 9.4% of humans were more creative than GPT-4 and that GAIs are valuable assistants in the creative process. Paper , Tweet
7) A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models - a comprehensive capability analysis of GPT series models; evaluates performance on 9 natural language understanding tasks using 21 datasets. Paper, Tweet
8) Context-faithful Prompting for Large Language Models - presents a prompting technique that aims to improve LLMs' faithfulness using strategies such as opinion-based prompts and counterfactual demonstrations. Paper, Tweet
9) Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models - a method for extracting room-scale textured 3D meshes from 2D text-to-image models. Paper, ProjectTweet
10) PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing - a trillion parameter language model with sparse heterogeneous computing. Paper, Tweet

Top AI Papers of the Week (Mar 13-Mar 19)

Paper Links
1) GPT-4 Technical Report - GPT-4 - a large multimodal model with broader general knowledge and problem-solving abilities. Paper, Tweet
2) LERF: Language Embedded Radiance Fields - a method for grounding language embeddings from models like CLIP into NeRF; this enables open-ended language queries in 3D. Paper, Tweet
3) An Overview on Language Models: Recent Developments and Outlook - an overview of language models covering recent developments and future directions. It also covers topics like linguistic units, structures, training methods, evaluation, and applications. Paper, Tweet
4) Eliciting Latent Predictions from Transformers with the Tuned Lens - a method for transformer interpretability that can trace a language model predictions as it develops layer by layer. Paper, Tweet
5) Meet in the Middle: A New Pre-training Paradigm - a new pre-training paradigm using techniques that jointly improve training data efficiency and capabilities of LMs in the infilling task; performance improvement is shown in code generation tasks. Paper , Tweet
6) Resurrecting Recurrent Neural Networks for Long Sequences - demonstrates that careful design of deep RNNs using standard signal propagation arguments can recover the performance of deep state-space models on long-range reasoning tasks. Paper , Tweet
7) UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation - a new approach to tune a lightweight and versatile retriever to automatically retrieve prompts to improve zero-shot performance and help mitigate hallucinations. Paper, Tweet
8) Patches Are All You Need? - proposes ConvMixer, a parameter-efficient fully-convolutional model which replaces self-attention and MLP layers in ViTs with less-expressive depthwise and pointwise convolutional layers. Paper, Tweet
9) NeRFMeshing: Distilling Neural Radiance Fields into Geometrically-Accurate 3D Meshes - a compact and flexible architecture that enables easy 3D surface reconstruction from any NeRF-driven approach; distills NeRFs into geometrically-accurate 3D meshes. Paper, Tweet
10) High-throughput Generative Inference of Large Language Models with a Single GPU - a high-throughput generation engine for running LLMs with limited GPU memory. Paper, Code , Tweet

Top AI Papers of the Week (Mar 6-Mar 12)

Paper Links
1) PaLM-E: An Embodied Multimodal Language Model - incorporates real-world continuous sensor modalities resulting in an embodied LM that performs tasks such as robotic manipulation planning, visual QA, and other embodied reasoning tasks. Paper, Demo , Tweet
2) Prismer: A Vision-Language Model with An Ensemble of Experts - a parameter-efficient vision-language model powered by an ensemble of domain experts; it efficiently pools expert knowledge from different domains and adapts it to various vision-language reasoning tasks. Paper, GitHub, Project , Tweet
3) Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models - it connects ChatGPT and different visual foundation models to enable users to interact with ChatGPT beyond language format. Paper, GitHub Tweet
4) A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT - an overview of generative AI - from GAN to ChatGPT. Paper, Tweet
5) Larger language models do in-context learning differently - shows that with scale, LLMs can override semantic priors when presented with enough flipped labels; these models can also perform well when replacing targets with semantically-unrelated targets. Paper , Tweet
6) Foundation Models for Decision Making: Problems, Methods, and Opportunities - provides an overview of foundation models for decision making, including tools, methods, and new research directions. Project , Tweet
7) Hyena Hierarchy: Towards Larger Convolutional Language Models - a subquadratic drop-in replacement for attention; it interleaves implicit long convolutions and data-controlled gating and can learn on sequences 10x longer and up to 100x faster than optimized attention. Paper, Code, Blog, Tweet
8) OpenICL: An Open-Source Framework for In-context Learning - a new open-source toolkit for in-context learning and LLM evaluation; supports various state-of-the-art retrieval and inference methods, tasks, and zero-/few-shot evaluation of LLMs. Paper, Repo, Tweet
9) MathPrompter: Mathematical Reasoning using Large Language Models - a technique that improves LLM performance on mathematical reasoning problems; it uses zero-shot chain-of-thought prompting and verification to ensure generated answers are accurate. Paper, Tweet
10) Scaling up GANs for Text-to-Image Synthesis - enables scaling up GANs on large datasets for text-to-image synthesis; it’s found to be orders of magnitude faster at inference time, synthesizes high-resolution images, & supports various latent space editing applications. Paper, Project , Tweet

Top AI Papers of the Week (Feb 27-Mar 5)

Paper Links
1) Language Is Not All You Need: Aligning Perception with Language Models - introduces a multimodal large language model called Kosmos-1; achieves great performance on language understanding, OCR-free NLP, perception-language tasks, visual QA, and more. Paper, Tweet
2) Evidence of a predictive coding hierarchy in the human brain listening to speech - finds that human brain activity is best explained by the activations of modern language models enhanced with long-range and hierarchical predictions. Paper, Tweet
3) EvoPrompting: Language Models for Code-Level Neural Architecture Search - combines evolutionary prompt engineering with soft prompt-tuning to find high-performing models; it leverages few-shot prompting which is further improved by using an evolutionary search approach to improve the in-context examples. Paper, Tweet
4) Consistency Models - a new family of generative models that achieve high sample quality without adversarial training. Paper, Tweet
5) Goal Driven Discovery of Distributional Differences via Language Descriptions - a new task that automatically discovers corpus-level differences via language description in a goal-driven way; applications include discovering insights from commercial reviews and error patterns in NLP systems. Paper , Code, Tweet
6) High-resolution image reconstruction with latent diffusion models from human brain activity - proposes an approach for high-resolution image reconstruction with latent diffusion models from human brain activity. Project , Tweet
7) Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control - a scalable approach to planning with LLMs in embodied settings through grounding functions; GD is found to be a general, flexible, and expressive approach to embodied tasks. Paper, Project Tweet
8) Language-Driven Representation Learning for Robotics - a framework for language-driven representation learning from human videos and captions for robotics. Paper, Models, Evaluation, Tweet
9) Dropout Reduces Underfitting - demonstrates that dropout can mitigate underfitting when used at the start of training; it counteracts SGD stochasticity and limits the influence of individual batches when training models. Paper, Tweet
10) Enabling Conversational Interaction with Mobile UI using Large Language Models - an approach that enables versatile conversational interactions with mobile UIs using a single LLM. Paper, Tweet

Top AI Papers of the Week (Feb 20-26)

Paper Links
1) LLaMA: Open and Efficient Foundation Language Models - a 65B parameter foundation model released by Meta AI; relies on publicly available data and outperforms GPT-3 on most benchmarks despite being 10x smaller. Paper, Tweet
2) Composer: Creative and Controllable Image Synthesis with Composable Conditions - a 5B parameter creative and controllable diffusion model trained on billions (text, image) pairs. Paper, Project , GitHub , Tweet
3) The Wisdom of Hindsight Makes Language Models Better Instruction Followers - an alternative algorithm to train LLMs from feedback; the feedback is converted to instruction by relabeling the original one and training the model, in a supervised way, for better alignment. Paper, GitHub Tweet
4) Active Prompting with Chain-of-Thought for Large Language Models - a prompting technique to adapt LLMs to different task-specific example prompts (annotated with human-designed chain-of-thought reasoning); this process involves finding where the LLM is most uncertain and annotating those. Paper, Code Tweet
5) Modular Deep Learning - a survey offering a unified view of the building blocks of modular neural networks; it also includes a discussion about modularity in the context of scaling LMs, causal inference, and other key topics in ML. Paper , Project, Tweet
6) Recitation-Augmented Language Models - an approach that recites passages from the LLM’s own memory to produce final answers; shows high performance on knowledge-intensive tasks. Paper , Tweet
7) Learning Performance-Improving Code Edits - an approach that uses LLMs to suggest functionally correct, performance-improving code edits. Paper, Tweet
8) More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models - a comprehensive analysis of novel prompt injection threats to application-integrated LLMs. Paper, Tweet
9) Aligning Text-to-Image Models using Human Feedback - proposes a fine-tuning method to align generative models using human feedback. Paper, Tweet
10) MERF: Memory-Efficient Radiance Fields for Real-time View Synthesis in Unbounded Scenes - a memory-efficient radiance field representation for real-time view synthesis of large-scale scenes in a browser. Paper, Tweet

Top AI Papers of the Week (Feb 13 - 19)

Paper Links
1) Symbolic Discovery of Optimization Algorithms - a simple and effective optimization algorithm that’s more memory-efficient than Adam. Paper, Tweet
2) Transformer models: an introduction and catalog Paper, Tweet
3) 3D-aware Conditional Image Synthesis - a 3D-aware conditional generative model extended with neural radiance fields for controllable photorealistic image synthesis. Project Tweet
4) The Capacity for Moral Self-Correction in Large Language Models - finds strong evidence that language models trained with RLHF have the capacity for moral self-correction. The capability emerges at 22B model parameters and typically improves with scale. Paper, Tweet
5) Vision meets RL - uses reinforcement learning to align computer vision models with task rewards; observes large performance boost across multiple CV tasks such as object detection and colorization. Paper
6) Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment - an unsupervised method for text-image alignment that leverages pretrained language models; it enables few-shot image classification with LLMs. Paper , Code Tweet
7) Augmented Language Models: a Survey - a survey of language models that are augmented with reasoning skills and the capability to use tools. Paper, Tweet
8) Geometric Clifford Algebra Networks - an approach to incorporate geometry-guided transformations into neural networks using geometric algebra. Paper, Tweet
9) Auditing large language models: a three-layered approach - proposes a policy framework for auditing LLMs. Paper, Tweet
10) Energy Transformer - a transformer architecture that replaces the sequence of feedforward transformer blocks with a single large Associate Memory model; this follows the popularity that Hopfield Networks have gained in the field of ML. Paper, Tweet

Top AI Papers of the Week (Feb 6 - 12)

Paper Links
1) Toolformer: Language Models Can Teach Themselves to Use Tools - introduces language models that teach themselves to use external tools via simple API calls. Paper, Tweet
2) Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents - proposes using language models for open-world game playing. Paper, Tweet
3) A Categorical Archive of ChatGPT Failures - a comprehensive analysis of ChatGPT failures for categories like reasoning, factual errors, maths, and coding. Paper, Tweet
4) Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery - optimizing hard text prompts through efficient gradient-based optimization. Paper, Tweet
5) Data Selection for Language Models via Importance Resampling - proposes a cheap and scalable data selection framework based on an importance resampling algorithm to improve the downstream performance of LMs. Paper, Tweet
6) Structure and Content-Guided Video Synthesis with Diffusion Models - proposes an approach for structure and content-guided video synthesis with diffusion models. Paper , Project, Tweet
7) A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity - performs a more rigorous evaluation of ChatGPt on reasoning, hallucination, and interactivity. Paper, Tweet
8) Noise2Music: Text-conditioned Music Generation with Diffusion Models - proposes diffusion models to generate high-quality 30-second music clips via text prompts. Paper, Project, Tweet
9) Offsite-Tuning: Transfer Learning without Full Model - introduces an efficient, privacy-preserving transfer learning framework to adapt foundational models to downstream data without access to the full model. Paper, Project, Tweet
10) Zero-shot Image-to-Image Translation - proposes a model for zero-shot image-to-image translation. Paper, Project, Tweet

Top AI Papers of the Week (Jan 30-Feb 5)

Paper Links
1) REPLUG: Retrieval-Augmented Black-Box Language Models - a retrieval-augmented LM framework that adapts a retriever to a large-scale, black-box LM like GPT-3. Paper, Tweet
2) Extracting Training Data from Diffusion Models - shows that diffusion-based generative models can memorize images from the training data and emit them at generation time. Paper, Tweet
3) The Flan Collection: Designing Data and Methods for Effective Instruction Tuning - release a more extensive publicly available collection of tasks, templates, and methods to advancing instruction-tuned models. Paper, Tweet
4) Multimodal Chain-of-Thought Reasoning in Language Models - incorporates vision features to elicit chain-of-thought reasoning in multimodality, enabling the model to generate effective rationales that contribute to answer inference. Paper, Code Tweet
5) Dreamix: Video Diffusion Models are General Video Editors - a diffusion model that performs text-based motion and appearance editing of general videos. Paper, Project, Tweet
6) Benchmarking Large Language Models for News Summarization Paper , Tweet
7) Mathematical Capabilities of ChatGPT - investigates the mathematical capabilities of ChatGPT on a new holistic benchmark called GHOSTS. Paper, Tweet
8) Emergence of Maps in the Memories of Blind Navigation Agents - trains an AI agent to navigate purely by feeling its way around; no use of vision, audio, or any other sensing (as in animals). Paper, Project, Tweet
9) SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections - a generative model that synthesizes large-scale 3D landscapes from random noises. Paper, Tweet
10) Large Language Models Can Be Easily Distracted by Irrelevant Context - finds that many prompting techniques fail when presented with irrelevant context for arithmetic reasoning. Paper, Tweet

Top AI Papers of the Week (Jan 23-29)

Paper Links
1) MusicLM: Generating Music From Text - a generative model for generating high-fidelity music from text descriptions. Paper, Tweet
2) Hungry Hungry Hippos: Towards Language Modeling with State Space Models - an approach to reduce the gap, in terms of performance and hardware utilization, between state space models and attention for language modeling. Paper, Tweet
3) A Watermark for Large Language Models - a watermarking framework for proprietary language models. Paper, Tweet
4) Text-To-4D Dynamic Scene Generation - a new text-to-4D model for dynamic scene generation from input text. Paper, GitHub, Tweet
5) ClimaX: A foundation model for weather and climate - a foundation model for weather and climate, including many capabilities for atmospheric science tasks. Paper, Tweet, Blog
6) Open Problems in Applied Deep Learning - If you're looking for interesting open problems in DL, this is a good reference. Not sure if intentional but it also looks useful to get a general picture of current trends in deep learning with ~300 references. Paper , Tweet
7) DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature - an approach for zero-shot machine-generated text detection. Uses raw log probabilities from the LLM to determine if the passage was sampled from it. Paper, Tweet
8) StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis - a new model that aims to regain the competitiveness of GANs for fast large-scale text-to-image synthesis. Paper, Project, Code Tweet
9) Large language models generate functional protein sequences across diverse families - an LLM that can generate protein sequences with a predictable function across large protein families. Paper, Tweet
10) The Impossibility of Parallelizing Boosting - investigates the possibility of parallelizing boosting. Paper, Tweet

Top AI Papers of the Week (Jan 16-22)

Paper Links
1) Google AI Research Recap (2022 Edition) - an excellent summary of some notable research Google AI did in 2022. Blog, Tweet
2) Dissociating language and thought in large language models: a cognitive perspective - a review paper on the capabilities of LLMs from a cognitive science perspective. Paper, Tweet
3) Human-Timescale Adaptation in an Open-Ended Task Space - an agent trained at scale that leads to a general in-content learning algorithm able to adapt to open-ended embodied 3D problems. Paper, Tweet
4) AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation - an approach to help provide explanations of generative transformer models through memory-efficient attention manipulation. Paper, Tweet
5) Everything is Connected: Graph Neural Networks - short overview of key concepts in graph representation learning. Paper, Tweet
6) GLIGEN: Open-Set Grounded Text-to-Image Generation - an approach that extends the functionality of existing pre-trained text-to-image diffusion models by enabling conditioning on grounding inputs. Paper, Tweet, Project
7) InstructPix2Pix: Learning to Follow Image Editing Instructions - proposes a method with the capability of editing images from human instructions. Paper, Tweet
8) Dataset Distillation: A Comprehensive Review Paper, Tweet
9) Learning-Rate-Free Learning by D-Adaptation - a new method for automatically adjusting the learning rate during training, applicable to more than a dozen diverse ML problems. Paper, Tweet
10) RecolorNeRF: Layer Decomposed Radiance Field for Efficient Color Editing of 3D Scenes - a user-friendly color editing approach for the neural radiance field to achieve a more efficient view-consistent recoloring. Paper, Tweet

Top AI Papers of the Week (Jan 9-15)

Paper Links
1) Mastering Diverse Domains through World Models - a general algorithm to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing challenge in AI. Paper, Tweet
2) Tracr: Compiled Transformers as a Laboratory for Interpretability - a compiler for converting RASP programs into transformer weights. This way of constructing NNs weights enables the development and evaluation of new interpretability tools. Paper, Tweet, Code
3) Multimodal Deep Learning - multimodal deep learning is a new book published on ArXiv. Book, Tweet
4) Forecasting Potential Misuses of Language Models for Disinformation Campaigns—and How to Reduce Risk - new work analyzing how generative LMs could potentially be misused for disinformation and how to mitigate these types of risks. Paper, Tweet
5) Why do Nearest Neighbor Language Models Work? - empirically identifies reasons why retrieval-augmented LMs (specifically k-nearest neighbor LMs) perform better than standard parametric LMs. Paper, Code, Tweet
6) Memory Augmented Large Language Models are Computationally Universal - investigates the use of existing LMs (e.g, Flan-U-PaLM 540B) combined with associative read-write memory to simulate the execution of a universal Turing machine. Paper , Tweet
7) A Survey on Transformers in Reinforcement Learning - transformers for RL will be a fascinating research area to track. The same is true for the reverse direction (RL for Transformers)... a notable example: using RLHF to improve LLMs (e.g., ChatGPT). Paper, Tweet
8) Scaling Laws for Generative Mixed-Modal Language Models - introduces scaling laws for generative mixed-modal language models. Paper, Tweet
9) DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching - a transformer-based network showing robust local feature matching, outperforming the state-of-the-art methods on several benchmarks. Paper, Tweet
10) Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement - addresses the time series forecasting problem with generative modeling; involves a bidirectional VAE backbone equipped with diffusion, denoising for prediction accuracy, and disentanglement for model interpretability. Paper, Tweet

Top AI Papers of the Week (Jan 1-8)

Paper Links
1) Muse: Text-To-Image Generation via Masked Generative Transformers - introduces Muse, a new text-to-image generation model based on masked generative transformers; significantly more efficient than other diffusion models like Imagen and DALLE-2. Paper, Project, Code, Tweet
2) VALL-E Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers - introduces VALL-E, a text-to-audio model that performs state-of-the-art zero-shot performance; the text-to-speech synthesis task is treated as a conditional language modeling task. Project, Tweet
3) Rethinking with Retrieval: Faithful Large Language Model Inference - shows the potential of enhancing LLMs by retrieving relevant external knowledge based on decomposed reasoning steps obtained through chain-of-thought prompting. Paper, Tweet
4) SparseGPT: Massive Language Models Can Be Accurately Pruned In One-Shot - presents a technique for compressing large language models while not sacrificing performance; "pruned to at least 50% sparsity in one-shot, without any retraining." Paper, Tweet
5) ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders - a performant model based on a fully convolutional masked autoencoder framework and other architectural improvements. CNNs are sticking back! Paper, Code, Tweet
6) Large Language Models as Corporate Lobbyists - with more capabilities, we are starting to see a wider range of applications with LLMs. This paper utilized large language models for conducting corporate lobbying activities. Paper , Code, Tweet
7) Superposition, Memorization, and Double Descent - aims to better understand how deep learning models overfit or memorize examples; interesting phenomena observed; important work toward a mechanistic theory of memorization. Paper, Tweet
8) StitchNet: Composing Neural Networks from Pre-Trained Fragments - new idea to create new coherent neural networks by reusing pretrained fragments of existing NNs. Not straightforward but there is potential in terms of efficiently reusing learned knowledge in pre-trained networks for complex tasks. Paper, Tweet
9) Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes - proposes integrated decomposition, an approach to improve Science Q&A through a human-in-the-loop workflow for refining compositional LM programs. Paper, Code Tweet
10) A Succinct Summary of Reinforcement Learning - a nice overview of some important ideas in RL. Paper, Tweet

We use a combination of AI-powered tools, analytics, and human curation to build the lists of papers.

Subscribe to our NLP Newsletter to stay on top of ML research and trends.

Join our Discord.