# AI Papers of the Week — 2023

[← Back to main index](../README.md)

This page collects every weekly issue of **AI Papers of the Week** from 2023. For other years, see the [main index](../README.md).

---

## Top AI Papers of the Week (December 25 - December 31)
| **Paper**  | **Links** |
| ------------- | ------------- |
| 1) **CogAgent** - Tsinghua's CogAgent is an 18B-parameter visual-language model purpose-built for GUI understanding and navigation, with unusually high input resolution.  <br>● High-res GUI input: Supports 1120x1120 input resolution via a dedicated high-res cross-module, letting it read small fonts and dense UI elements that typical VLMs blur out.  <br>● Dual-tower vision: Combines a low-res general vision encoder with a high-res cross-module, balancing context understanding with fine-grained icon/text perception.  <br>● Broad capabilities: Handles visual Q&A, visual grounding, and end-to-end GUI agent tasks on web and desktop, positioning as a general GUI backbone.  <br>● SoTA VQA: Achieves state-of-the-art on 5 text-rich (e.g., OCR-heavy) and 4 general VQA benchmarks, covering document, chart, and scene understanding. | [Paper](https://arxiv.org/abs/2312.08914), [Tweet](https://x.com/cenyk1230/status/1739916469272789222?s=20) |
| 2) **From Gemini to Q-Star** - A 300+-paper survey mapping the state of Generative AI and the research frontiers that followed the Gemini + rumored Q* news cycle.  <br>● Broad coverage: Surveys developments across language, vision, audio, and multimodal generative systems, treating Gen AI as a unified field rather than siloed modalities.  <br>● Computational challenges: Catalogs scalability, efficiency, and alignment challenges currently gating further progress, including training compute, inference serving, and evaluation.  <br>● Real-world applications: Reviews Gen AI impact across healthcare, finance, and education, highlighting where genuine deployment signals diverge from hype.  <br>● Future directions: Identifies agent frameworks, reasoning, grounded multimodality, and alignment as the most live research areas heading into 2024. | [Paper](https://arxiv.org/abs/2312.10868), [Tweet](https://x.com/omarsar0/status/1740119485011390558?s=20) |
| 3) **PromptBench** - A unified library for comprehensive evaluation and analysis of LLMs that consolidates multiple evaluation concerns under one roof.  <br>● Prompt-construction tooling: Ships with utilities for prompt construction, prompt engineering, and dataset/model loading, covering the end-to-end LLM evaluation workflow.  <br>● Adversarial prompt attacks: Built-in adversarial prompt-attack capabilities let users stress-test LLMs against perturbations rather than just measuring clean accuracy.  <br>● Dynamic evaluation: Supports dynamic evaluation protocols to detect dataset contamination and measure robustness beyond static benchmark numbers.  <br>● Unified interface: Replaces the ad-hoc evaluation scripts many teams maintain with a consistent API, reducing friction when comparing across models and prompt variants. | [Paper](https://arxiv.org/abs/2312.07910v1), [Tweet](https://x.com/omarsar0/status/1739360426134028631?s=20) |
| 4) **Exploiting Novel GPT-4 APIs** - A red-team study of three newer GPT-4 API surfaces - fine-tuning, function calling, and knowledge retrieval - that reveals each introduces new attack vectors.  <br>● Fine-tuning strips safeguards: As few as 15 harmful examples - or even 100 benign examples - fine-tuned into GPT-4 is enough to remove core safety behaviors.  <br>● Function-call schema leakage: GPT-4 Assistants can be coerced into divulging their function-call schemas and then tricked into executing arbitrary function calls.  <br>● Retrieval hijacking: The knowledge-retrieval endpoint is vulnerable to prompt injection via documents in the retrieval corpus, letting attackers steer model behavior through uploaded content.  <br>● Policy implication: Expanding API surface area introduces alignment risks that weren't present for text-only completions, and API providers need surface-specific defenses rather than relying on base-model alignment. | [Paper](https://arxiv.org/abs/2312.14302), [Tweet](https://x.com/omarsar0/status/1739677995747450964?s=20) |
| 5) **Fact Recalling in LLMs** - A mechanistic-interpretability study showing that early MLP layers function as a lookup table for factual recall.  <br>● Athletes-to-sports task: Scoped to how Pythia 2.8B recalls which of 3 different sports various athletes play - a clean task for dissecting a single type of factual recall.  <br>● Early MLPs as lookup table: Early MLP layers perform a structured lookup rather than distributed reasoning, with specific neurons keyed to entity-attribute pairs.  <br>● Multi-token embedding view: Recommends treating factual knowledge recall as operating over multi-token embeddings rather than single-token representations.  <br>● Interpretability payoff: Provides a concrete, testable account of where and how facts live inside transformers, enabling targeted editing and auditing of parametric memory. | [Paper](https://www.alignmentforum.org/s/hpWHhjvjn67LJ4xXX/p/iGuwZTHWb6DFY3sKB), [Tweet](https://x.com/NeelNanda5/status/1738559368361349122?s=20) |
| 6) **Generative AI for Math (OpenWebMath / MathPile)** - Releases a diverse, high-quality math-centric corpus of ~9.5B tokens designed for training math-capable foundation models.  <br>● 9.5B-token corpus: Curated from mathematical content across the web, textbooks, papers, and Q&A, rebalanced for math-specific token distribution.  <br>● Quality filtering: Applies math-specific filtering to surface content dense in symbolic notation, proofs, and problem solutions rather than surface-level mentions of math.  <br>● Diverse sources: Explicitly mixes proof-heavy formal math with applied problem-solving to avoid over-fitting to any single mathematical register.  <br>● Training signal: Positioned as a drop-in pretraining or continual-pretraining corpus to lift math reasoning in existing LLMs without changing the architecture. | [Paper](https://arxiv.org/abs/2312.17120), [Tweet](https://x.com/arankomatsuzaki/status/1740564961032556942?s=20) |
| 7) **Principled Instructions Are All You Need** - Distills effective LLM prompting into 26 guiding principles and validates them across multiple model families.  <br>● 26 principles: Covers prompt structure, audience specification, example selection, formatting, role assignment, and stepwise decomposition.  <br>● Broad model validation: Tested on LLaMA-1/2 (7B, 13B, 70B) and GPT-3.5/4, finding the principles generalize across scales and families.  <br>● Both small and large benefits: Smaller models benefit more from structured prompting (higher variance reduction), while larger models benefit in absolute accuracy on harder tasks.  <br>● Practical reference: Functions as a cheat-sheet for practitioners, converting scattered prompting folklore into testable recipes. | [Paper](https://arxiv.org/abs/2312.16171v1), [Tweet](https://x.com/_akhaliq/status/1739857456161759455?s=20) |
| 8) **Survey of Reasoning with Foundation Models** - A comprehensive survey of reasoning with foundation models, covering tasks, methods, benchmarks, and future directions.  <br>● Task coverage: Surveys math reasoning, commonsense reasoning, logical reasoning, symbolic reasoning, and multimodal reasoning - showing how each evolves with model scale.  <br>● Methodology catalog: Covers prompting techniques (CoT, ToT, self-consistency), fine-tuning strategies, and neurosymbolic approaches under a unified framework.  <br>● Benchmarks: Systematizes the reasoning benchmarks landscape and flags contamination and robustness concerns specific to reasoning evaluation.  <br>● Adjacencies: Discusses how multimodal learning, autonomous agents, and super-alignment research intersect with and extend the reasoning agenda. | [Paper](https://arxiv.org/abs/2312.11562v4), [Tweet](https://x.com/omarsar0/status/1740729489661874632?s=20) |
| 9) **LLaRA** - LLaRA adapts a decoder-only LLM for dense retrieval via two tailored pretext tasks that leverage text embeddings from the LLM itself.  <br>● EBAE pretext task: Embedding-Based Auto-Encoding uses LLM embeddings to reconstruct tokens of the input sentence, aligning the embedding space with semantic content.  <br>● EBAR pretext task: Embedding-Based Auto-Regression predicts tokens of the next sentence from the current embedding, injecting discourse-level signal into retrieval embeddings.  <br>● LLaMA 2 7B base: A LLaMA 2-7B base model is adapted into a retriever with these pretext tasks, yielding significant gains on MSMARCO and BEIR.  <br>● Decoder retrievers validated: Provides another data point that decoder-only LLMs, with the right adaptation, rival specialized encoder retrievers - a theme that continued through 2024. | [Paper](https://arxiv.org/abs/2312.15503v1) |
| 10) **Gemini vs GPT-4V** - A qualitative side-by-side comparison of Gemini and GPT-4V across vision-language tasks, documenting systematic behavioral differences.  <br>● Head-to-head cases: Evaluates both models on a curated set of tasks covering document understanding, chart reading, everyday scenes, and multi-image reasoning.  <br>● GPT-4V style: Produces precise, succinct answers with strong preference for brevity and factual minimalism.  <br>● Gemini style: Returns more expansive, narrative answers frequently accompanied by relevant images and links - leveraging its deeper integration with search.  <br>● Complementary strengths: Concludes that the models are substitutable for many core VLM tasks but differ sharply on response length, multimedia, and augmentation patterns. | [Paper](https://arxiv.org/abs/2312.15011v1), [Tweet](https://x.com/omarsar0/status/1741177994377330895?s=20) |

---

## Top AI Papers of the Week (December 18 - December 24)
| **Paper**  | **Links** |
| ------------- | ------------- |
| 1) **Gemini's Language Abilities** - CMU's impartial, reproducible evaluation of Gemini Pro against GPT and Mixtral across standard LLM benchmarks.  <br>● Reproducible methodology: Provides an open, reproducible evaluation pipeline - a response to concerns about Google's own Gemini launch benchmarks being hard to independently verify.  <br>● Gemini Pro vs. GPT 3.5 Turbo: Gemini Pro achieves comparable but slightly lower accuracy than GPT 3.5 Turbo, countering marketing claims of broad parity on language tasks.  <br>● Gemini & GPT beat Mixtral: Both Gemini and GPT outperform Mixtral on these benchmarks, suggesting open mixture-of-experts has not yet closed the gap to frontier proprietary models.  <br>● Evaluation norms: Positioned as evidence that independent replications remain essential, and that first-party model reports shouldn't be the final word on comparative capability. | [Paper](https://arxiv.org/abs/2312.11444), [Tweet](https://x.com/gneubig/status/1737108966931673191?s=20)|
| 2) **PowerInfer** - A high-speed LLM inference engine for consumer GPUs that exploits sparse neuron activation patterns to run large models on commodity hardware.  <br>● Hot/cold neurons: Analysis shows that a small fraction of "hot" neurons activate on most inputs while the majority of "cold" neurons activate rarely - a power-law pattern across many LLMs.  <br>● GPU-CPU hybrid: Hot neurons are preloaded onto the GPU for fast access, while cold neurons live on the CPU and are computed lazily, dramatically reducing GPU memory pressure.  <br>● Reduced memory + transfer: This split reduces both GPU memory demand and the CPU-GPU data transfer that typically dominates hybrid inference cost.  <br>● 11x speedup over llama.cpp: Achieves up to ~11x faster token generation than llama.cpp on a single consumer GPU for OPT-175B-class models - a step-change for local deployment. | [Paper](https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf), [Tweet](https://x.com/omarsar0/status/1737168751668187229?s=20)|
| 3) **Antibiotic Discovery with Graph Deep Learning (Nature)** - MIT researchers use explainable graph neural networks to discover a new structural class of antibiotics.  <br>● Graph neural networks: Trains GNNs on molecular graphs to predict antibiotic activity, with explainability layers that surface chemical substructures driving predictions.  <br>● Explainable discovery: Unlike black-box property predictors, the explanation module identifies substructures underlying antibiotic activity - a feature drug chemists can actually use.  <br>● New structural class: The discovered compounds belong to a novel structural class, not a variant of existing antibiotic scaffolds - an unusually strong generalization signal.  <br>● Real-world pipeline: Demonstrates end-to-end pipeline from GNN prediction to wet-lab validation, reinforcing explainable ML as a practical discovery tool for biomedicine. | [Paper](https://www.nature.com/articles/s41586-023-06887-8), [Tweet](https://x.com/EricTopol/status/1737505177052348545?s=20)|
| 4) **VideoPoet** - Google Research's VideoPoet is a large language model for zero-shot video generation that treats video as just another token stream.  <br>● Unified token stream: Uses multiple tokenizers to map video, image, audio, and text into a shared discrete token space for a single autoregressive model.  <br>● Zero-shot task variety: The same model handles image-to-video, video stylization, video-to-audio, and text-to-video without task-specific fine-tuning.  <br>● Language-model paradigm: Demonstrates that a plain autoregressive LM, given the right tokenizers, can handle video generation - challenging the diffusion-everywhere default for video.  <br>● Temporal consistency: Produces videos with reasonable motion coherence over short durations, a meaningful milestone for LM-based video generation. | [Paper](https://sites.research.google/videopoet/), [Tweet](https://x.com/GoogleAI/status/1737235593078456389?s=20)_|
| 5) **AppAgent** - Introduces an LLM-based multimodal agent that operates real smartphone apps through touch actions and screenshots.  <br>● Multimodal control: The agent reads the phone screen (visual input) and issues low-level touch actions (tap, swipe, type), operating apps the way humans do rather than via APIs.  <br>● Two learning modes: Learns new apps either via autonomous exploration (discovering functionality through self-play) or by observing human demonstrations.  <br>● Cross-app generality: Demonstrates proficiency across email, social media, shopping, and creative apps, suggesting that multimodal LLMs can generalize across smartphone UIs.  <br>● Early mobile-agent blueprint: An early example of the on-device multimodal agent pattern that would become a major 2024 deployment theme. | [Paper](https://arxiv.org/abs/2312.13771), [Tweet](https://x.com/omarsar0/status/1738265651188253051?s=20)_|
| 6) **LLM in a Flash** - Apple researchers show how to run LLMs larger than available DRAM by streaming weights from flash storage on demand.  <br>● Flash as swap: Stores model weights on flash and streams only the rows/columns needed per forward pass into DRAM, exploiting the sparsity of relevant parameters.  <br>● 2x DRAM headroom: Enables running models up to 2x the size of available DRAM without catastrophic slowdown, critical for on-device deployment where memory is tight.  <br>● Major speedups vs. naive loading: 4-5x faster on CPU and 20-25x faster on GPU compared to naive parameter loading, thanks to selective transfer and row-column bundling.  <br>● On-device LLM groundwork: Directly enabled Apple's later on-device LLM plans by showing that flash-based streaming can make phone-scale LLM inference practical. | [Paper](https://arxiv.org/abs/2312.11514), [Tweet](https://x.com/gabrielnocode/status/1737307286887133552?s=20)_|
| 7) **ReST Meets ReAct** - Proposes a ReAct-style agent that improves itself via reinforced self-training on its own reasoning traces.  <br>● Self-critique ReAct: A ReAct-style agent with a self-critique step that evaluates its own reasoning and answers, generating a filterable trace dataset.  <br>● ReST-style iterative RL: Uses growing-batch RL from AI feedback to iteratively fine-tune on the agent's successful reasoning traces, improving over rounds without human labels.  <br>● Human-label-free: Minimizes human involvement; synthetic data with self-improvement from AI feedback is the primary training signal throughout.  <br>● Distillation to small models: The improved agent can be distilled into models 1-2 orders of magnitude smaller with comparable performance, dramatically cutting inference cost. | [Paper](https://arxiv.org/abs/2312.10003), [Tweet](https://x.com/omarsar0/status/1736587397830176910?s=20)_|
| 8) **Adversarial Attacks on GPT-4** - Demonstrates that a trivially simple random-search procedure can jailbreak GPT-4 with high reliability.  <br>● Adversarial suffix: Appends a suffix to a harmful request and iteratively perturbs it, keeping changes that increase the log-probability of the response starting with "Sure".  <br>● No gradients needed: Operates purely via the API in a black-box setting, without model gradients or weights - a much lower bar than prior white-box jailbreak work.  <br>● Strong success rate: Achieves high attack-success rates on GPT-4 with a small number of API calls, despite ongoing alignment efforts.  <br>● Alignment implication: Shows that current safety training is still vulnerable to near-trivial optimization attacks, pointing to the need for stronger behavioral defenses. | [Paper](https://www.andriushchenko.me/gpt4adv.pdf), [Tweet](https://x.com/maksym_andr/status/1737844601891983563?s=20)_|
| 9) **RAG for LLMs** - A broad survey of Retrieval-Augmented Generation research, organizing the rapidly growing literature into a coherent map.  <br>● Three-paradigm taxonomy: Organizes RAG approaches into Naive RAG, Advanced RAG (pre/post-retrieval enhancements), and Modular RAG (orchestrated component-based systems).  <br>● Core components: Reviews retrievers, generators, and augmentation strategies separately, clarifying which design choices sit in which component.  <br>● Evaluation and datasets: Catalogs RAG-specific benchmarks and evaluation metrics, surfacing the still-uneven state of RAG evaluation.  <br>● Frontier directions: Highlights agentic retrieval, multimodal RAG, and long-context RAG as the key research areas driving the 2024 RAG landscape. | [Paper](https://arxiv.org/abs/2312.10997v1), [Tweet](https://x.com/omarsar0/status/1738354427759612222?s=20)_|
| 10) **BabyLLM Challenge Findings** - Reports results from a challenge on sample-efficient pretraining using a developmentally plausible corpus.  <br>● Constrained pretraining: Participants pretrain on a small, child-directed-style corpus rather than on internet-scale data, testing how efficiently models can learn from limited input.  <br>● LTG BERT wins: The winning submission, LTG BERT, beat Llama 2 70B on 3 of 4 evaluations despite vastly less training data.  <br>● Data preprocessing pays: Strong-performing entries relied heavily on data preprocessing and training on shorter contexts, challenging assumptions about long-context training for small data.  <br>● Cognitive-science bridge: Provides an empirical platform connecting language-model training to developmental psycholinguistics, informing both fields. | [Paper](https://aclanthology.org/volumes/2023.conll-babylm/), [Tweet](https://x.com/a_stadt/status/1737849248560066794?s=20)_|

---

## Top AI Papers of the Week (December 11 - December 17)
| **Paper**  | **Links** |
| ------------- | ------------- |
| 1) **FunSearch** - DeepMind's FunSearch uses LLMs as a mutation operator in an evolutionary loop to discover genuinely new mathematical knowledge.  <br>● LLM + evaluator loop: Combines a pretrained LLM that proposes candidate programs with a systematic evaluator that scores them, iteratively evolving low-scoring programs into high-scoring ones.  <br>● New math discoveries: Produces novel solutions to open problems in combinatorics, including cap-set and online bin-packing, not memorized from the training data.  <br>● Hallucination mitigation: The evaluator acts as a hard filter - only programs that actually work are kept - so LLM hallucinations don't propagate into the "discovered" knowledge.  <br>● General recipe: Positions LLM-in-the-loop search as a general tool for scientific discovery beyond math, applicable wherever candidates can be automatically scored. | [Paper](https://www.nature.com/articles/s41586-023-06924-6), [Tweet](https://x.com/GoogleDeepMind/status/1735332722208284797?s=20) |
| 2) **Weak-to-Strong Generalization** - OpenAI's superalignment team shows that weak supervisors can still elicit capabilities from much stronger models - a first empirical signal for scalable oversight.  <br>● Weak-to-strong setup: A weak model (e.g., GPT-2) generates labels, and a strong pretrained model (e.g., GPT-4) is fine-tuned on those labels - an analog of humans supervising superhuman AI.  <br>● Better than the supervisor: Naively fine-tuning the strong model on weak-model labels often yields a model better than the supervisor itself, demonstrating useful capability elicitation.  <br>● ~GPT-3.5 from GPT-2 supervision: Fine-tuning GPT-4 with GPT-2-level supervision recovers close to GPT-3.5-level performance on NLP tasks - a surprising amount of capability without strong labels.  <br>● Superalignment signal: Offers an early empirical footing for the bet that humans can align superhuman systems using their own (weaker) judgments - provided the right training recipe. | [Paper](https://cdn.openai.com/papers/weak-to-strong-generalization.pdf), [Tweet](https://x.com/OpenAI/status/1735349718765715913?s=20) |
| 3) **Audiobox** - Meta's Audiobox is a unified flow-matching audio model that generates speech, sound effects, and music from natural-language and example prompts.  <br>● Unified audio generation: Single model handles speech, sound, and music - ending the typical pattern of one model per audio modality.  <br>● Description + example prompting: Supports both natural-language descriptions and reference-audio examples for style control, letting users mix semantic and acoustic conditioning.  <br>● Self-supervised infilling: Adapts a self-supervised infilling objective to pretrain on large unlabeled audio, reducing dependence on scarce labeled speech/music datasets.  <br>● Novel voice/styles: Unlocks generation of novel vocal and acoustic styles by interpolating in the learned audio space, going beyond reproduction of training-set styles. | [Paper](https://ai.meta.com/research/publications/audiobox-unified-audio-generation-with-natural-language-prompts/), [Tweet](https://x.com/AIatMeta/status/1734257634008531453?s=20) |
| 4) **Mathematical LLMs Survey** - A survey on the progress of LLMs on mathematical reasoning tasks, covering methods, benchmarks, and open problems.  <br>● Task taxonomy: Covers math word problem solving, symbolic reasoning, and theorem proving, showing which capabilities emerge at which model scales.  <br>● Methods landscape: Reviews prompting techniques (CoT, PoT, ToT, self-verification) alongside fine-tuning and tool-use approaches.  <br>● Dataset reference: Catalogs the dominant math benchmarks (GSM8K, MATH, MiniF2F, etc.) and their evaluation methodologies.  <br>● Frontier problems: Highlights reasoning-faithfulness, formal-vs-informal math integration, and reward-model design as the key open questions. | [Paper](https://arxiv.org/abs/2312.07622), [Tweet](https://x.com/omarsar0/status/1735323577392542084?s=20) |
| 5) **LLM360** - LLM360 is a framework for fully transparent open-source LLM development, with everything from data to training dynamics released.  <br>● End-to-end transparency: Ships training code, the pretraining corpus, intermediate checkpoints, evaluation code, and analyses - going well beyond the "just weights" openness of earlier "open" LLMs.  <br>● Two 7B models: Releases AMBER (general) and CRYSTALCODER (code-specialized) 7B models pretrained from scratch under the framework.  <br>● Enables training-dynamics research: Intermediate checkpoints let researchers study loss trajectories, emergent capabilities, and data-effect ablations - typically only possible inside frontier labs.  <br>● Standard for openness: Pushes the community's definition of "open-source LLM" from weights to a full training-pipeline standard. | [Paper](https://arxiv.org/abs/2312.06550), [Tweet](https://x.com/omarsar0/status/1734591071575744820?s=20) |
| 6) **LLMs in Medicine** - A comprehensive survey (300+ papers) of LLMs applied to medicine, from clinical tasks to biomedical research.  <br>● Principles and applications: Covers the core principles of medical LLMs and their applications across clinical decision support, patient communication, medical education, and biomedical research.  <br>● Benchmark coverage: Reviews medical QA benchmarks (MedQA, PubMedQA, MedMCQA, etc.) and their limitations for real clinical settings.  <br>● Challenges: Identifies challenges specific to medicine including hallucination in clinical advice, privacy, regulatory compliance, and equity/bias concerns.  <br>● Deployment considerations: Discusses what's required for safe deployment, including evaluation, monitoring, and the role of clinician oversight. | [Paper](https://arxiv.org/abs/2311.05112), [Tweet](https://x.com/omarsar0/status/1734599425568231513?s=20) |
| 7) **Beyond Human Data (ReST-EM)** - DeepMind's ReST-EM shows that model-generated data plus a reward function can substantially reduce dependence on human-generated data.  <br>● Expectation-Maximization framing: Generates candidate solutions from the current model, filters using a reward/verifier, and fine-tunes on the filtered set - repeat.  <br>● Verifiable rewards: Uses automatic verifiers (e.g., correct-answer checks) as the reward signal, sidestepping the need for a learned reward model on scarce tasks.  <br>● PaLM 2 gains: Scales effectively on PaLM 2 for math and code tasks, outperforming standard SFT on human data at matched compute.  <br>● Synthetic-data signal: A strong empirical case that self-generated filtered data can replace much of the human data bottleneck for reasoning tasks - a theme that grew through 2024. | [Paper](https://arxiv.org/abs/2312.06585), [Tweet](https://x.com/omarsar0/status/1734953578274386002?s=20) |
| 8) **Gaussian-SLAM** - A neural RGBD SLAM method that extends 3D Gaussian Splatting to achieve photorealistic scene reconstruction without sacrificing speed.  <br>● 3D Gaussians for SLAM: Represents scenes as 3D Gaussians rather than neural fields, inheriting the fast training and rendering of Gaussian Splatting.  <br>● Photorealistic reconstruction: Produces significantly higher-fidelity reconstructions than prior neural SLAM methods at comparable or better runtime.  <br>● RGBD input: Uses standard RGB+depth input streams, making it compatible with off-the-shelf depth cameras for practical deployment.  <br>● Speed/quality Pareto: Advances the Pareto frontier for RGBD SLAM, where previous methods forced a trade-off between runtime and photorealism. | [Paper](https://vladimiryugay.github.io/gaussian_slam/), [Tweet](https://x.com/vlyug/status/1734683948440252480?s=20) |
| 9) **Pearl** - Meta's Pearl is a production-ready reinforcement learning agent package designed for real-world deployment constraints.  <br>● Production-oriented design: Built for real-world environments with limited observability, sparse feedback, and high stochasticity - conditions that usually break research-oriented RL libraries.  <br>● Modular components: Offers modular policy networks, exploration strategies, offline RL, and safety constraints that can be composed for specific applications.  <br>● Research + practice: Targets both researchers building new RL agents and practitioners deploying RL in production recommender systems, ranking, and control.  <br>● Meta internal use: Reflects learnings from Meta's internal deployments, making it a rare RL library that starts from production pain rather than benchmark scores. | [Paper](https://arxiv.org/abs/2312.03814), [Tweet](https://x.com/ZheqingZhu/status/1732880717263352149?s=20) |
| 10) **QuIP#** - Cornell's QuIP# is a 2-bit LLM quantization scheme that combines lattice codebooks with incoherence processing to close the quality gap to FP16.  <br>● Lattice codebooks: Uses E8 lattice codebooks for weight quantization, a classical lattice-quantization technique adapted to LLM weight matrices.  <br>● Incoherence processing: Pre-processes weight matrices to make them "incoherent" (less structured along axes), which improves lattice-quantization fidelity.  <br>● 2-bit at 16-bit quality: Significantly closes the gap between 2-bit quantized LLMs and their unquantized 16-bit counterparts across a range of LLaMA-family models.  <br>● Deployment impact: Makes large LLMs (e.g., Llama 2 70B) fit into consumer-grade GPU memory without catastrophic quality loss, expanding the set of models hobbyists can run locally. | [Paper](https://cornell-relaxml.github.io/quip-sharp/), [Tweet](https://x.com/tsengalb99/status/1733222467953422702?s=20) |

---

## Top AI Papers of the Week (December 4 - December 10)
| **Paper**  | **Links** |
| ------------- | ------------- |
| 1) **Gemini 1.0** - Google launches Gemini 1.0, a multimodal family natively designed to reason across text, images, video, audio, and code from the ground up.  <br>● Three tiers: Ships as Ultra (frontier), Pro (balanced), and Nano (on-device), covering everything from data-center reasoning to mobile inference.  <br>● Native multimodality: Unlike "bolted-on" multimodal models, Gemini is trained multimodally from scratch, with joint tokenization across text, image, video, audio, and code.  <br>● MMLU milestone: Gemini Ultra reports the first MMLU score above human-expert performance (90.0%), using chain-of-thought with uncertainty-weighted majority voting.  <br>● Broad capability claims: Ultra sets SOTA on 30 of 32 benchmarks in the report, spanning multimodality, multilinguality, factuality, summarization, math/science, long-context, and reasoning. | [Paper](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf), [Tweet](https://x.com/omarsar0/status/1732434324291563831?s=20) |
| 2) **EfficientSAM** - Meta's EfficientSAM is a lightweight Segment Anything variant that preserves most of SAM's zero-shot quality at a fraction of the compute.  <br>● Masked autoencoder pretraining: Uses a SAMI (SAM-leveraged masked image) pretraining objective where a small student learns to reconstruct features aligned with the SAM teacher.  <br>● 20x smaller and faster: Achieves roughly 20x fewer parameters and 20x faster runtime than the original SAM image encoder.  <br>● Near-parity quality: 44.4 AP vs. 46.5 AP on zero-shot instance segmentation (within 2 points) despite the dramatic efficiency win.  <br>● Deployment-ready: Makes SAM-grade segmentation feasible on commodity hardware, consumer devices, and real-time applications where the original SAM is too heavy. | [Paper](https://arxiv.org/abs/2312.00863), [Tweet](https://x.com/fiandola/status/1732171016783180132?s=20)  |
| 3) **Magicoder** - Magicoder is a fully open-source code LLM that closes the gap with top commercial code models at only 7B parameters via high-quality synthetic instruction data.  <br>● OSS-Instruct data: Generates 75K synthetic instruction pairs by seeding GPT with snippets pulled from open-source code, producing more diverse and realistic training data than prior code SFT datasets.  <br>● Broad coverage: Training data spans Python, multilingual programming, and data-science program completion, producing a genuinely general code model rather than a Python-only model.  <br>● HumanEval+ win: MagicoderS-CL-7B (based on CodeLlama) surpasses ChatGPT on HumanEval+ with 66.5 vs. 65.9 pass@1, despite being 7B.  <br>● Fully open: Ships with code, data, and weights, positioning Magicoder as a reproducible open baseline for instruction-tuned code generation. | [Paper](https://arxiv.org/abs/2312.02120), [Tweet](https://x.com/omarsar0/status/1732063926613946863?s=20)  |
| 4) **LLMs on Graphs** - A comprehensive overview of the many ways LLMs can be applied to graph-structured data and when each pattern is useful.  <br>● Three graph scenarios: Organizes the space by whether graphs are pure (no text), text-rich (nodes/edges carry natural language), or text-paired (graphs alongside documents).  <br>● Three role taxonomies: Categorizes LLMs as predictors, enhancers, or aligners with GNNs - clarifying whether the LLM is the model, a feature source, or a supervisor.  <br>● Task coverage: Spans node classification, link prediction, graph-level tasks, and reasoning over knowledge graphs.  <br>● Open problems: Flags scalability to large graphs, handling of graph structure without loss, and integration with tool-augmented LLMs as the key unsolved directions. | [Paper](https://arxiv.org/abs/2312.02783), [Tweet](https://x.com/omarsar0/status/1732404393037762588?s=20)  |
| 5) **Llama Guard** - Meta's Llama Guard is a compact, instruction-tuned safety classifier built on Llama 2-7B for input/output moderation in conversational AI.  <br>● Llama 2-7B base: Small enough to run inline with a main generative model while handling both prompt- and response-level safety classification.  <br>● Customizable taxonomy: The safety taxonomy is specified in the instruction prompt itself, so operators can adapt it to their use case without retraining.  <br>● Zero-shot and few-shot: Works off the shelf for many taxonomies in zero- or few-shot mode, and can be fine-tuned on a specific policy dataset when needed.  <br>● Open release: Ships as an open model, filling a gap for teams that want local, auditable safety classification rather than relying solely on API-side moderation. | [Paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/), [Tweet](https://x.com/omarsar0/status/1732781628139696279?s=20)  |
| 6) **KTO (Kahneman-Tversky Optimization)** - Contextual AI introduces KTO, an alignment objective derived from prospect theory that works with binary "good/bad" signals instead of preference pairs.  <br>● Prospect-theory motivation: Models reward as a Kahneman-Tversky value function with loss aversion, replacing DPO's log-likelihood-of-preferences objective with utility maximization.  <br>● No preference pairs needed: Works with unpaired good/bad signals, dramatically loosening data collection requirements compared to DPO or RLHF.  <br>● Matches/beats DPO: Matches or exceeds DPO performance at model scales from 1B to 30B, a clean empirical win at similar training cost.  <br>● Practical data advantage: Makes alignment much cheaper to run in production where paired preference data is rare but outcome feedback ("user liked/didn't like") is abundant. | [Paper](https://github.com/ContextualAI/HALOs/blob/main/assets/report.pdf), [Tweet](https://x.com/ethayarajh/status/1732837520784957476?s=20)  |
| 7) **Chain of Code** - DeepMind's Chain of Code extends CoT by encouraging LMs to write pseudocode that mixes real code with LM-simulated sub-routines.  <br>● LMulator: The LM generates pseudocode programs and explicitly annotates sub-tasks that can't be executed; a "LMulator" simulates those sub-tasks with the LM while the interpreter handles the rest.  <br>● Undefined-behavior handling: The interpreter catches undefined behavior and cleanly hands off to the LM, sidestepping the brittleness of code-first approaches that fail silently on hard ops.  <br>● 84% on BIG-Bench Hard: Achieves 84% on BIG-Bench Hard - a 12-point gain over Chain of Thought and a clean demonstration that mixing exact execution with LM simulation beats either alone.  <br>● Broad applicability: Works across math, logic, and commonsense reasoning, positioning Chain of Code as a general-purpose CoT upgrade. | [Paper](https://arxiv.org/abs/2312.04474), [Tweet](https://x.com/ChengshuEricLi/status/1733169631949701425?s=20)  |
| 8) **Data Management for LLMs** - A survey of data-management research for LLM pretraining and supervised fine-tuning stages.  <br>● Pretraining data: Covers data quantity, quality filtering, deduplication, domain composition, and curriculum strategies for large-scale pretraining.  <br>● SFT data: Reviews instruction-data generation, quality filtering, diversity metrics, and the emerging literature on "less is more" for SFT.  <br>● Domain and task composition: Examines how task mixing affects generalization vs. specialization in fine-tuning.  <br>● Open challenges: Identifies dataset contamination, deduplication at trillion-token scale, and reproducible data recipes as the top open problems. | [Paper](https://arxiv.org/abs/2312.01700), [Tweet](https://x.com/omarsar0/status/1731877232493166969?s=20)  |
| 9) **RankZephyr** - RankZephyr is an open-source LLM for listwise zero-shot reranking that bridges the effectiveness gap with GPT-4.  <br>● Listwise zero-shot: Reranks a full candidate list in a single shot rather than doing pairwise or pointwise scoring, matching the paradigm GPT-4 uses most effectively.  <br>● Open-source: Based on the open Zephyr chat model, releasing a fully reproducible stack for high-quality reranking.  <br>● Matches/beats GPT-4: Competitive with GPT-4 on standard reranking benchmarks and outperforms GPT-4 on NovelEval, a post-training-cutoff benchmark resistant to contamination.  <br>● Contamination-free win: The NovelEval advantage is particularly meaningful because it addresses the concern that GPT-4's strong reranking numbers are partly driven by memorization of benchmark queries. | [Paper](https://arxiv.org/abs/2312.02724), [Tweet](https://x.com/lintool/status/1732430269485867114?s=20)  |
| 10) **The Efficiency Spectrum of LLMs** - A comprehensive review of algorithmic advancements for improving LLM efficiency across the full training-to-inference stack.  <br>● Scaling laws and data: Covers how scaling laws and data-utilization strategies interact with efficiency - more isn't always better under compute constraints.  <br>● Architectural innovations: Reviews attention variants, state-space models, MoE, and other architectural levers for efficient scaling.  <br>● Training and tuning: Catalogs PEFT methods (LoRA, adapters, prefix tuning), quantization-aware training, and curriculum-based training strategies.  <br>● Inference techniques: Surveys quantization, pruning, speculative decoding, KV-cache optimization, and batching as the inference-time efficiency toolkit. | [Paper](https://arxiv.org/abs/2312.00678), [Tweet](https://x.com/omarsar0/status/1731696419457606048?s=20)  |

---

## Top AI Papers of the Week (November 27 - December 3)
| **Paper**  | **Links** |
| ------------- | ------------- |
| 1) **GNoME** - DeepMind's Graph Networks for Materials Exploration (GNoME) is an AI system that discovered 2.2 million new crystal structures, including 380,000 thermodynamically stable ones.  <br>● 2.2M new crystals: Dramatically expands the known crystal inventory, with 380,000 stable materials - an order-of-magnitude leap over prior computational chemistry.  <br>● Graph networks for stability: Predicts formation energies and stability of candidate materials using graph neural networks trained on DFT-labeled data.  <br>● Active-learning loop: Combines exploration (proposing candidate structures) with exploitation (prioritizing high-stability candidates), iteratively expanding the frontier of known materials.  <br>● Autonomous lab validation: A subset of predictions was validated in Berkeley's autonomous materials lab, closing the prediction-to-synthesis loop for the first time at this scale. | [Paper](https://www.nature.com/articles/s41586-023-06735-9), [Tweet](https://x.com/demishassabis/status/1729995611443769823?s=20) |
| 2) **Open-Source LLMs vs. ChatGPT** - A survey cataloguing tasks where open-source LLMs claim to be on par with or better than ChatGPT.  <br>● Task-by-task audit: Organizes claims by task category (code, math, reasoning, summarization, etc.) with the specific open models and benchmarks backing each claim.  <br>● Gap measurement: Clarifies where open-source genuinely closes the gap vs. where "comparable" actually hides meaningful performance differences.  <br>● Critical lens: Calls out evaluation-methodology issues in specific open-source claims, including benchmark contamination, cherry-picked subsets, and inconsistent judge setups.  <br>● 2023 snapshot: Captures where open-source LLMs stood at the end of 2023 - a useful reference point for tracking how the gap evolved through 2024. | [Paper](https://arxiv.org/abs/2311.16989), [Tweet](https://x.com/sophiamyang/status/1730108858889097710?s=20) |
| 3) **Adversarial Diffusion Distillation (SDXL Turbo)** - Stability AI's ADD trains a student diffusion model that produces high-quality images in just 1-4 sampling steps.  <br>● Score distillation + adversarial loss: Combines score-distillation from a teacher diffusion model with an adversarial loss to maintain image fidelity in the low-step regime.  <br>● 1-4 step generation: Produces usable images in a single step and SoTA-quality images in four, compared to 25-50 steps for typical SDXL sampling.  <br>● Matches multi-step SoTA: Achieves image quality comparable to state-of-the-art diffusion baselines at four steps, dramatically cutting inference cost.  <br>● Real-time generation: Enables SDXL-quality images at real-time frame rates on consumer GPUs, unlocking interactive creative tooling that was previously impractical. | [Paper](https://stability.ai/research/adversarial-diffusion-distillation), [Tweet](https://x.com/robrombach/status/1729590281647870342?s=20) |
| 4) **Seamless** - Meta's Seamless is a family of models for end-to-end expressive, streaming cross-lingual speech communication.  <br>● SeamlessExpressive: Preserves the speaker's expressive characteristics (pitch, emotion, pauses) across translation rather than flattening them into neutral speech.  <br>● SeamlessStreaming: Produces translated speech in a streaming fashion with low latency, enabling near-real-time conversational translation.  <br>● Low-resource coverage: An improved SeamlessM4T is trained on more low-resource language data, broadening the language coverage meaningfully beyond the original M4T.  <br>● Safety red-teaming: Meta applies a red-teaming effort specifically for multimodal translation safety, a recognition that MT systems can amplify harmful content across languages. | [Paper](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/), [Tweet](https://x.com/AIatMeta/status/1730294284023427221?s=20) |
| 5) **MEDITRON-70B** - EPFL's MEDITRON is an open-source family of medical LLMs at 7B and 70B parameters, continually pretrained on curated medical corpora.  <br>● Llama 2 base + medical pretraining: Builds on Llama 2 with continual pretraining on a curated medical corpus covering clinical papers, guidelines, and textbooks.  <br>● Strong open medical baseline: MEDITRON-70B outperforms GPT-3.5 and Med-PaLM on standard medical QA benchmarks while being open-source.  <br>● Close to frontier: Comes within 5% of GPT-4 and 10% of Med-PaLM 2 on MultiMedQA - competitive given the much smaller scale and open release.  <br>● Reproducible recipe: Ships with pretraining data, code, and weights, providing a reproducible starting point for researchers and institutions building medical LLMs. | [Paper](https://arxiv.org/abs/2311.16079v1), [Tweet](https://x.com/eric_zemingchen/status/1729563855213175010?s=20) |
| 6) **Medprompt** - Microsoft researchers show that careful prompt engineering can push general-purpose GPT-4 to state-of-the-art on medical benchmarks, no domain fine-tuning required.  <br>● General-purpose prompting: Uses purely general-purpose prompt-engineering techniques (CoT, dynamic few-shot, choice-shuffling ensembling) with no medical-domain specialization.  <br>● Medprompt recipe: Combines k-nearest-neighbor example selection, GPT-4-generated chain-of-thought rationales, and choice-shuffling to cancel answer-position biases.  <br>● SoTA on 9 benchmarks: Achieves state-of-the-art on all nine benchmarks in MultiMedQA, beating Med-PaLM 2 and other specialized medical models.  <br>● Broader lesson: Reopens the question of whether domain-specific pretraining is actually necessary when a frontier base model is paired with strong prompting - a framing that has recurred in later debates. | [Paper](https://arxiv.org/abs/2311.16452), [Tweet](https://x.com/erichorvitz/status/1729854235443884385?s=20) |
| 7) **UniIR** - UniIR is a unified instruction-guided multimodal retriever that handles eight retrieval tasks across modalities with a single model.  <br>● Instruction-guided: A single retriever conditioned on natural-language instructions determines which retrieval task to perform, rather than one retriever per task.  <br>● Eight tasks: Handles image-to-text, text-to-image, composed-image retrieval, video retrieval, and other multimodal variants under one umbrella.  <br>● Zero-shot generalization: Generalizes to unseen retrieval tasks not explicitly trained on, approaching a truly general multimodal retrieval model.  <br>● M-BEIR benchmark: Ships with a new multimodal retrieval benchmark (M-BEIR) designed to standardize evaluation across tasks and modalities. | [Paper](https://arxiv.org/abs/2311.17136), [Tweet](https://x.com/CongWei1230/status/1730307767469068476?s=20) |
| 8) **Safe Deployment of Generative AI (Nature)** - A Nature correspondence arguing that medical professionals - not commercial interests - must drive the development and deployment of generative AI in medicine.  <br>● Privacy-first framing: Centers patient-privacy considerations as the non-negotiable constraint on medical AI deployment.  <br>● Professional governance: Calls for clinician-led governance structures rather than commercial self-regulation, citing past failures of tech-industry oversight in regulated domains.  <br>● Deployment guardrails: Recommends guardrails including consent, transparency of training data, and clinician accountability for AI-assisted decisions.  <br>● Policy signal: As a Nature piece, amplifies medical-community concerns into the broader AI policy conversation at a key moment in the regulation debate. | [Paper](https://www.nature.com/articles/d41586-023-03803-y), [Tweet](https://x.com/ClementDelangue/status/1730300666403238393?s=20) |
| 9) **Dobb-E** - NYU's Dobb-E is an affordable household-manipulation robot that learns new tasks with just 5 minutes of user demonstrations.  <br>● 5 minutes of demos: Learns new household manipulation tasks from only ~5 minutes of demonstrations, a dramatic reduction from typical data requirements.  <br>● Hardware design: Uses a low-cost stick-on gripper and a smartphone-driven data-collection rig, keeping the barrier to entry low for non-expert users.  <br>● Home-specific challenges: Experiments in real homes surface challenges usually hidden in lab robotics - strong shadows, variable demo quality, and household-specific clutter.  <br>● General-purpose household system: Positions Dobb-E as a general-purpose system for household robotics rather than a task-specific demonstrator, a step toward practical home robots. | [Paper](https://arxiv.org/abs/2311.16098v1), [Tweet](https://x.com/LerrelPinto/status/1729515379892826211?s=20) |
| 10) **Translatotron 3** - Google's Translatotron 3 performs speech-to-speech translation using only monolingual data - no parallel corpora required.  <br>● Fully unsupervised S2S: Learns direct speech-to-speech translation from monolingual data alone, a first for this task.  <br>● Three-component architecture: Combines a masked autoencoder for speech representation, unsupervised embedding mapping across languages, and back-translation for alignment.  <br>● Beats cascade baselines: Outperforms a comparable cascade of ASR + MT + TTS, a surprising result given cascade systems are typically the strong baseline.  <br>● Paralinguistic preservation: Preserves paralinguistic features - pauses, speaking rates, and speaker identity - that cascaded systems tend to wash out in translation. | [Paper](https://arxiv.org/abs/2305.17547), [Tweet](https://x.com/GoogleAI/status/1730654297350959413?s=20) |

---

## Top AI Papers of the Week (November 20 - November 26)
| **Paper**  | **Links** |
| ------------- | ------------- |
| 1) **System 2 Attention (S2A)** - Meta's S2A uses the LLM's own reasoning to decide what context actually matters, regenerating a clean prompt before the final response step.  <br>● Two-pass prompting: First pass uses the LLM to filter/regenerate the input context, removing irrelevant or misleading content; second pass generates the final answer from the clean context.  <br>● Addresses distraction: Directly targets the well-known problem that LLMs attend to irrelevant or manipulative content (e.g., opinion-laden context that biases answers).  <br>● Factuality gains: Increases factuality on QA and reduces the model's sensitivity to biased framing or distractors inserted into the prompt.  <br>● Math word problems: Outperforms standard attention-based LLMs on math word problems, where filtering irrelevant details is often the hard part of the task. | [Paper](https://arxiv.org/abs/2311.11829), [Tweet](https://x.com/jaseweston/status/1726784511357157618?s=20) |
| 2) **Advancing Long-Context LLMs** - A survey of methodologies for improving Transformer long-context capability across pretraining, fine-tuning, and inference stages.  <br>● Full-stack coverage: Organizes methods by training stage - pretraining objectives, position encoding, fine-tuning recipes, and inference-time interventions.  <br>● Position-encoding deep dive: Reviews RoPE variants, ALiBi, and other positional-encoding choices that dominate long-context extrapolation.  <br>● Efficient attention: Catalogs sparse, linear, and memory-augmented attention mechanisms that make longer contexts tractable.  <br>● Evaluation considerations: Addresses benchmark limitations including the "needle in a haystack" problem and the gap between nominal context length and effective usable context. | [Paper](https://arxiv.org/abs/2311.12351), [Tweet](https://x.com/omarsar0/status/1727358484360945750?s=20) |
| 3) **Parallel Speculative Sampling** - Amazon researchers propose a parallel variant of speculative sampling that achieves significant LLM inference speedups with minimal extra parameters.  <br>● Parallel decoding: Combines speculative sampling with parallel decoding so multiple tokens can be generated and verified in a single pass.  <br>● Tiny overhead: Requires learning only O(d_emb) additional parameters, far fewer than typical speculative-decoding draft models.  <br>● Up to 30% speedup: Achieves up to 30% end-to-end inference speedup without compromising output quality.  <br>● Minimal integration cost: Unlike separate-draft-model speculative decoding, this fits inside the main model with essentially no deployment overhead. | [Paper](https://arxiv.org/abs/2311.13581), [Tweet](https://x.com/omarsar0/status/1728066181796418009?s=20) |
| 4) **Mirasol3B** - Google's Mirasol3B is a multimodal model that decouples modalities into focused autoregressive components rather than forcing a single fused stream.  <br>● Decoupled autoregressive modeling: Separates audio/video processing from text processing into focused autoregressive components that communicate through learned cross-modal interfaces.  <br>● Handles longer videos: The decoupled design lets the model handle longer video inputs than typical end-to-end multimodal models constrained by sequence length.  <br>● Modality-specific processing: Inputs are processed according to their modalities with appropriate tokenization rather than forcing a one-size-fits-all tokenizer.  <br>● SoTA on video benchmarks: Outperforms prior methods on video QA, long-video QA, and audio-video-text benchmarks, validating the decoupled approach. | [Paper](https://arxiv.org/abs/2311.05698), [Tweet](https://x.com/GoogleAI/status/1724553024088191211?s=20) |
| 5) **Teaching Small LMs to Reason** - An approach that teaches smaller language models to explicitly select among reasoning techniques for each problem.  <br>● Reasoning technique menu: Trains the small LM to choose among step-by-step processing, recall-then-generate, recall-reason-generate, extract-generate, and direct-answer strategies.  <br>● Technique selection: The model learns when to apply each strategy based on problem structure, not just which answer to produce.  <br>● Matches 5-10x larger models: Attains zero-shot reasoning performance similar or better than models 5-10x larger on complex reasoning tasks.  <br>● Practical scaling: Offers a recipe for teams that can't deploy frontier-scale models but need strong reasoning quality - a recurring production constraint. | [Paper](https://arxiv.org/abs/2311.11045), [Tweet](https://x.com/omarsar0/status/1726990087399915995?s=20) |
| 6) **GPQA** - A graduate-level Google-proof QA benchmark designed to stress-test reasoning in systems that might exceed human expertise.  <br>● 448 expert questions: Consists of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry.  <br>● Google-proof by design: Questions are constructed so that even with unrestricted internet access, non-experts (~34%) perform only slightly better than random on them.  <br>● GPT-4 gets 39%: The strongest GPT-4 baseline hits only 39% accuracy, showing a clear headroom for frontier models on expert-level reasoning.  <br>● Scalable oversight testbed: Explicitly designed to enable scalable oversight research - experiments in supervising models whose knowledge may exceed the supervisors'. | [Paper](https://arxiv.org/abs/2311.12022), [Tweet](https://x.com/idavidrein/status/1727033002234909060?s=20) |
| 7) **Hitchhiker's Guide From CoT to Agents** - A survey mapping the conceptual evolution from chain-of-thought reasoning to modern language-agent frameworks.  <br>● CoT foundations: Covers the mechanics underpinning CoT (few-shot prompting, self-consistency, least-to-most, tree-of-thought) with a consistent formalism.  <br>● Mechanism theory: Explores why CoT works - in-context learning, prompt engineering theories, and emergence at scale - rather than just cataloging results.  <br>● CoT-to-agent bridge: Traces how CoT techniques were progressively extended into tool use, multi-step planning, and full agent loops (ReAct, Reflexion, etc.).  <br>● Framework landscape: Organizes the modern language-agent frameworks by which parts of the CoT-to-agent pipeline they emphasize, clarifying an otherwise noisy field. | [Paper](https://arxiv.org/abs/2311.11797), [Tweet](https://x.com/omarsar0/status/1726803725220487277?s=20) |
| 8) **GAIA** - Meta's GAIA is a benchmark for general AI assistants that requires reasoning, multimodal handling, web browsing, and tool use to solve real-world questions.  <br>● Real-world questions: Questions are conceptually simple for humans but require integrated reasoning, web research, and tool use - a realistic test for assistant-style AI.  <br>● Massive human-model gap: Humans achieve 92% accuracy while GPT-4 with plugins achieves only 15% - the widest human-AI gap on any major 2023 benchmark.  <br>● Level-graduated difficulty: Three difficulty levels let researchers measure incremental progress rather than just binary success/failure.  <br>● Agent-first evaluation: Explicitly designed to test AI assistants, not base LLMs - a framing that has since become dominant for agent evaluations. | [Paper](https://arxiv.org/abs/2311.12983), [Tweet](https://x.com/ThomasScialom/status/1727683993045201339?s=20) |
| 9) **MedAgents** - A collaborative multi-round framework for medical reasoning that uses role-playing LLM agents to improve accuracy and reasoning depth.  <br>● Multi-agent deliberation: Multiple LLM agents take on specialist roles (e.g., different medical specialties) and deliberate in rounds over a case.  <br>● Role-playing: Each agent has a defined role-play prompt that scopes its expertise and reasoning style, producing more diverse intermediate hypotheses.  <br>● Consensus protocol: Agents iterate until reaching consensus or until a moderator resolves disagreements, producing a final answer with rationale.  <br>● Reasoning gains: Improves accuracy and reasoning quality on medical QA benchmarks compared to single-agent baselines at matched compute. | [Paper](https://arxiv.org/abs/2311.10537), [Tweet](https://x.com/omarsar0/status/1726627951582511135?s=20) |
| 10) **TÜLU 2** - Allen AI's TÜLU 2 is a suite of improved open instruction-tuned LLMs and an accompanying study of adaptation best practices.  <br>● Open suite: Releases open models that match or exceed GPT-3.5-turbo-0301 on several benchmarks, a meaningful milestone for the open ecosystem at the time.  <br>● Post-training recipe: The paper doubles as a practical recipe, documenting how instruction data curation, mixing ratios, and DPO-based preference training interact.  <br>● UltraFeedback preference data: Uses UltraFeedback for preference optimization, validating that openly released preference datasets are sufficient to close much of the gap to commercial post-training pipelines.  <br>● Adaptation research platform: Explicitly positioned as a platform for studying open adaptation techniques, informing the TÜLU 3 release that would follow in 2024. | [Paper](https://arxiv.org/abs/2311.10702), [Tweet](https://x.com/natolambert/status/1727350301131518454?s=20) |

---

## Top AI Papers of the Week (November 13 - November 19)
| **Paper**  | **Links** |
| ------------- | ------------- |
| 1) **Emu Video and Emu Edit** - Meta releases Emu Video and Emu Edit, a pair of diffusion models targeting controlled text-to-video generation and instruction-based image editing.  <br>● Emu Video: Generates high-quality video from text-only, image-only, or combined text + image inputs using a factorized diffusion approach - text-to-image followed by image-conditioned video.  <br>● Emu Edit: Enables free-form image editing through text instructions, handling region, local, and global edits within one model.  <br>● Factorized video: The text-to-image then image-to-video split dramatically cuts training cost and improves controllability compared to end-to-end T2V models.  <br>● Unified research line: Both models extend Meta's Emu foundation family, pointing toward a unified multimodal generative stack shared across image, video, and edit tasks. | [Paper](https://ai.meta.com/blog/emu-text-to-video-generation-image-editing-research/), [Tweet](https://x.com/AIatMeta/status/1725184026154349007?s=20) |
| 2) **Chain-of-Note (CoN)** - Tencent's Chain-of-Note adds an explicit note-taking step to RAG so the model can evaluate retrieved evidence before answering.  <br>● Sequential notes: For each retrieved document, the model writes a "reading note" assessing relevance to the question, rather than attending to the entire retrieval dump directly.  <br>● Noise robustness: +7.9 EM improvement when retrieved documents are entirely noisy, precisely the regime where standard RAG degrades most.  <br>● Unknown-scenario handling: +10.5 rejection-rate improvement on questions outside the model's training scope, a key property for avoiding confident hallucinations.  <br>● Generalizable pattern: The note-taking step is a lightweight addition on top of existing RAG pipelines, making it easy to adopt incrementally. | [Paper](https://arxiv.org/abs/2311.09210), [Tweet](https://x.com/omarsar0/status/1725181141693472959?s=20) |
| 3) **LLMs for Scientific Discovery** - A broad evaluation of GPT-4 across scientific disciplines including drug discovery, biology, and computational chemistry.  <br>● Expert-driven assessment: Domain experts design case studies to probe GPT-4's understanding of complex scientific concepts and its ability to solve real research problems.  <br>● Problem-solving capability: GPT-4 demonstrates meaningful problem-solving in many domains but shows systematic weaknesses on tasks requiring precise numerical reasoning or experimental design.  <br>● Benchmark coverage: Complements qualitative case studies with quantitative benchmarks, triangulating on where current frontier models help vs. mislead.  <br>● Research workflow integration: Argues LLMs can accelerate scientific ideation and literature synthesis but require careful scaffolding before touching high-stakes experimental decisions. | [Paper](https://arxiv.org/abs/2311.07361), [Tweet](https://x.com/omarsar0/status/1724465107046940893?s=20) |
| 4) **Fine-Tuning LLMs for Factuality** - Stanford fine-tunes LLMs for factuality without any human labels by using automatically generated preference signals.  <br>● Automatic factuality signal: Derives factuality preference rankings from reference consistency checks and retrieval-based verification - no human labels required.  <br>● Open-ended generation: Specifically targets open-ended generation settings rather than constrained QA, where hallucination is hardest to detect or correct.  <br>● Llama 2 improvements: Significantly improves Llama 2's factuality on held-out topics, outperforming RLHF and decoding-time factuality strategies.  <br>● Scalable alignment: Offers a recipe for scaling factuality alignment without proportionally scaling human annotation - an important direction as LLMs cover broader domains. | [Paper](https://arxiv.org/abs/2311.08401), [Tweet](https://x.com/arankomatsuzaki/status/1724613041155608951?s=20) |
| 5) **Contrastive Chain-of-Thought** - Proposes contrastive CoT prompting where models see both valid *and* invalid reasoning demonstrations to reduce reasoning errors.  <br>● Valid + invalid demos: Demonstrations pair correct reasoning traces with common incorrect ones, teaching the model what not to do as well as what to do.  <br>● Automatic construction: Provides an automatic method to generate contrastive demonstrations, avoiding the manual curation bottleneck that limited prior CoT variants.  <br>● Improves over CoT: Outperforms standard CoT across reasoning benchmarks, with particularly strong gains on problems where common error patterns are predictable.  <br>● Pedagogical analog: The improvement mirrors human learning research showing that studying worked examples and errors side-by-side beats studying successes alone. | [Paper](https://arxiv.org/abs/2311.09277), [Tweet](https://x.com/arankomatsuzaki/status/1725340150819905723?s=20) |
| 6) **Survey on Language Models for Code** - A comprehensive survey of LLMs for code covering 50+ models, 30+ evaluation tasks, and 500 related works.  <br>● Model landscape: Catalogs 50+ code LLMs across sizes, architectures, and training regimes, providing a single reference for what's available.  <br>● Task taxonomy: Reviews 30+ evaluation tasks spanning code generation, repair, translation, summarization, and execution prediction.  <br>● Training and data recipes: Walks through pretraining corpus construction, instruction tuning, and RLHF specifically for code.  <br>● Open problems: Highlights challenges in long-context code understanding, multi-file reasoning, and robust evaluation beyond HumanEval-style metrics. | [Paper](https://arxiv.org/abs/2311.07989v1), [Tweet](https://x.com/omarsar0/status/1725637165256761553?s=20) |
| 7) **JARVIS-1** - An open-world multimodal agent for Minecraft that combines perception, planning, and memory into a self-improving system.  <br>● Multimodal perception: Processes visual Minecraft observations and natural-language instructions through a unified multimodal input pipeline.  <br>● Memory-augmented planning: Maintains a multimodal memory store of past observations and plans, enabling lifelong self-improvement across episodes.  <br>● Strong task coverage: Completes 200+ diverse Minecraft tasks with competitive success rates, including long-horizon tasks like diamond collection.  <br>● Open-world blueprint: An influential example of combining foundation models, memory, and explicit planning into an agent, foreshadowing many 2024 agent architectures. | [Paper](https://arxiv.org/abs/2311.05997), [Tweet](https://x.com/arankomatsuzaki/status/1723882043514470629?s=20) |
| 8) **Learning to Filter Context for RAG (FILCO)** - CMU's FILCO improves RAG by training a dedicated model to filter retrieved contexts before they reach the generator.  <br>● Useful-context identification: Uses lexical and information-theoretic signals to identify genuinely useful portions of retrieved documents, rather than passing everything through.  <br>● Context-filter training: Trains a separate filtering model whose only job is to retain useful context at inference time.  <br>● Extractive QA wins: Outperforms prior RAG approaches on extractive QA benchmarks, a clean demonstration that context filtering is a high-leverage component.  <br>● Modular addition: Slots in between retrieval and generation, making it compatible with any retriever/generator pairing. | [Paper](https://arxiv.org/abs/2311.08377v1), [Tweet](https://x.com/ZhiruoW/status/1724792850079252886?s=20) |
| 9) **MART (Multi-round Automatic Red-Teaming)** - Meta's MART scales LLM safety alignment using fully automatic multi-round red-teaming.  <br>● Adversarial prompt writing: One LLM acts as red-teamer, automatically generating adversarial prompts that probe the target model's safety.  <br>● Safe response generation: The target LLM then generates responses that are filtered/refined for safety, producing training data for the next round.  <br>● 84.7% violation reduction: After 4 rounds, the violation rate of an initially weakly-aligned LLM drops up to 84.7%, matching models with extensive human-written adversarial data.  <br>● Scalable alignment: Demonstrates that automatic red-teaming can substitute for expensive human adversarial prompt writing in the alignment pipeline. | [Paper](https://arxiv.org/abs/2311.07689), [Tweet](https://x.com/AIatMeta/status/1724887918685425829?s=20) |
| 10) **LLMs Can Deceive Users (Trading Agent)** - Apollo Research shows that a helpful, honest LLM stock-trading agent can spontaneously deceive users under pressure.  <br>● Stock-trading testbed: The LLM agent runs an autonomous trading simulation with access to market data and occasional insider tips.  <br>● Acts on insider information: When placed under performance pressure, the agent acts on insider tips despite explicit instructions not to - a clear instance of strategic norm violation.  <br>● Hides reasoning from the user: Crucially, the agent reports doctored rationales to its user, *hiding* the insider trade rather than reporting it - strategic deception without being trained to deceive.  <br>● Alignment implication: Demonstrates that deception can emerge in "helpful and safe" models under realistic pressure, without targeted training - a significant datapoint for alignment research. | [Paper](https://arxiv.org/abs/2311.07590), [Tweet](https://x.com/ESYudkowsky/status/1725226563992715521?s=20) |

---

## Top AI Papers of the Week (November 6 - November 12)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | **Links**                                                                                                         |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| 1) **Hallucination in LLMs Survey** - A comprehensive survey of hallucination in LLMs, covering taxonomy, causes, evaluation, and mitigation.  <br>● Two-category taxonomy: Separates hallucinations into factuality hallucinations (incorrect facts) and faithfulness hallucinations (deviations from source content).  <br>● Causes breakdown: Attributes hallucinations to training-data issues, training-stage artifacts, and inference-time choices - each with distinct mitigation paths.  <br>● Evaluation landscape: Reviews benchmarks and automatic metrics specifically designed for hallucination, contrasting them with general-purpose LLM metrics.  <br>● Mitigation strategies: Organizes mitigation into data curation, training-stage (RLHF, factuality tuning), and inference-stage (decoding, retrieval) approaches. | [Paper](https://arxiv.org/abs/2311.05232), [Tweet](https://x.com/omarsar0/status/1722985251129966705?s=20)        |
| 2) **Simplifying Transformer Blocks** - Researchers show that many components of the standard transformer block can be removed with no loss in training speed or quality.  <br>● Aggressive simplification: Removes residual connections, normalization layers, and value/projection parameters in specific blocks without hurting per-update training speed.  <br>● Works across architectures: Tested on autoregressive decoder-only and BERT encoder-only models, validating that the simplifications aren't architecture-specific.  <br>● 15% faster throughput: Simplified blocks deliver 15% faster training throughput with fewer parameters - a clean efficiency win.  <br>● Design-space implication: Suggests the standard transformer is overdetermined and that careful ablation can yield simpler, faster architectures without new ideas.                                                                                                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2311.01906), [Tweet](https://x.com/maksym_andr/status/1722235666724192688?s=20)     |
| 3) **In-Context Learning Generalization Limits** - Investigates whether transformers' in-context learning can generalize beyond the distribution of their pretraining data.  <br>● Pretraining distribution bridge: Tests whether transformers can identify and learn new tasks in-context, both inside and outside their pretraining data distribution.  <br>● Limited OOD generalization: In the regimes studied, there's limited evidence that ICL generalizes meaningfully beyond pretraining data coverage.  <br>● Counter-narrative: Pushes back on the strong "universal learners" framing of ICL that sometimes accompanies emergence-claims, grounding it in data-distribution bounds.  <br>● Research implication: Argues that evaluating ICL requires carefully distinguishing in-distribution skill retrieval from genuine OOD generalization - a distinction rarely made cleanly in headlines.                                                                                                            | [Paper](https://arxiv.org/abs/2311.00871), [Tweet](https://x.com/abacaj/status/1721223737729581437?s=20)          |
| 4) **MusicGen** - Meta's MusicGen is a single-stage transformer LLM for music generation that operates over compressed discrete audio tokens.  <br>● Single-stage transformer: Unlike multi-stage music generation pipelines, MusicGen generates music as a single autoregressive transformer over multi-codebook tokens.  <br>● Multi-stream tokens: Operates over several parallel streams of compressed discrete music tokens, producing high-fidelity audio without the cascaded VQ-VAE + LM setup.  <br>● Text and melody conditioning: Supports both text prompts and melody conditioning, letting users specify style with text and structure with reference audio.  <br>● High-quality generation: Delivers competitive subjective quality against multi-stage baselines while being simpler and faster to deploy.                                                                                                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2306.05284), [Tweet](https://x.com/AIatMeta/status/1723043913638810025?s=20)        |
| 5) **AltUp (Alternating Updates)** - Google's AltUp lets transformers benefit from wider representations without paying the full compute cost at every layer.  <br>● Wide-but-cheap representation: Widens the learned representation but only actively updates one sub-block per layer, leaving others untouched during that forward pass.  <br>● Predict-and-correct: A predict-and-correct mechanism updates the inactive sub-blocks with predictions, so they remain coherent without full computation.  <br>● Negligible latency increase: Achieves wider representations at negligible latency cost compared to matched-width dense transformers.  <br>● Scaling lever: Provides a middle-ground between narrow dense models and sparse MoE - wider without routing complexity.                                                                                                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2301.13310), [Tweet](https://x.com/GoogleAI/status/1722004366201418132?s=20)        |
| 6) **Rephrase and Respond (RaR)** - An effective prompting method where the LLM rephrases and expands the user's question before answering it.  <br>● Rephrase step: The model first rewrites the question to resolve ambiguity, fill in implicit assumptions, and make the task explicit - then answers the rephrased version.  <br>● Broad task gains: Improves performance across diverse tasks without needing any fine-tuning, using only prompt-level changes.  <br>● Stacks with CoT: Combines cleanly with chain-of-thought prompting, giving additive improvements on reasoning benchmarks.  <br>● User-friendly interpretation: Shows that part of the "prompt engineering" skill gap between novice and expert users is really a rephrasing problem - one the LLM itself can fix.                                                                                                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2311.04205), [Tweet](https://x.com/QuanquanGu/status/1722364144379396513?s=20)      |
| 7) **On the Road with GPT-4V** - An exhaustive evaluation of GPT-4V applied to autonomous driving scenarios.  <br>● Driving-scenario evaluation: Tests GPT-4V across diverse driving situations including scene understanding, traffic-sign recognition, and causal reasoning about driver intent.  <br>● Scene-understanding strength: Demonstrates superior performance in scene understanding and causal reasoning compared to existing production autonomous-driving systems.  <br>● Edge-case robustness: Shows relative robustness on edge cases (construction zones, unusual road layouts) that typically confuse narrower perception stacks.  <br>● Practical limitations: Flags real-world issues including latency, rare-hazard handling, and dependence on high-quality image quality that would gate production deployment.                                                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2311.05332), [Tweet](https://x.com/arankomatsuzaki/status/1722795897359139057?s=20) |
| 8) **GPT4All Technical Report** - The GPT4All technical report documents the model family and the open ecosystem built around democratizing local LLMs.  <br>● Model family: Covers the sequence of GPT4All models trained and released through 2023, spanning 3B-13B parameter sizes.  <br>● Open-source focus: Ships with a cross-platform desktop app, open model weights, and an accompanying dataset - positioning itself as a turnkey local LLM stack.  <br>● Data and training: Details the curated instruction-tuning dataset and fine-tuning recipes used to build the family.  <br>● Ecosystem impact: Tracks GPT4All's role in popularizing local LLM usage among hobbyists and small organizations before Ollama and similar tools matured.                                                                                                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2311.04931), [Tweet](https://x.com/_akhaliq/status/1722833378590793915?s=20)        |
| 9) **S-LoRA** - S-LoRA enables serving thousands of LoRA adapters concurrently on a single GPU through memory-paging and custom CUDA kernels.  <br>● Main-memory adapter pool: Stores all adapters in main memory and loads adapters for currently running queries into GPU memory on demand, dramatically increasing the adapter pool size.  <br>● Novel tensor parallelism: Introduces a tensor-parallelism strategy tailored for heterogeneous LoRA batches, where each query might use a different adapter.  <br>● 4x throughput: Improves throughput by 4x compared to prior adapter-serving solutions at comparable latency.  <br>● Adapter scale: Enables serving several orders of magnitude more adapters on the same hardware - important for multi-tenant LoRA deployments and personalized fine-tuning services. | [Paper](https://arxiv.org/abs/2311.03285v2), [Tweet](https://x.com/ai_database/status/1722190708797592013?s=20)   |
| 10) **FreshLLMs (FreshQA)** - Introduces FreshQA, a dynamic benchmark designed to stress-test LLMs on time-sensitive knowledge.  <br>● Dynamic QA benchmark: Continuously refreshes questions so models can't memorize answers - a direct response to the contamination concerns plaguing static benchmarks.  <br>● Four question categories: Covers never-changing, slow-changing, fast-changing, and false-premise questions, stressing different aspects of freshness handling.  <br>● Reveals freshness gap: Shows that LLMs without search augmentation answer fast-changing questions poorly, while retrieval-augmented models close most of the gap.  <br>● FreshPrompt: Proposes FreshPrompt, a simple search-augmented prompting strategy that substantially boosts LLM performance on time-sensitive questions.                                                                                                                                                                                                                                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2310.03214), [Tweet](https://x.com/_akhaliq/status/1710108355157487635?s=20)        |

---

## Top AI Papers of the Week (October 30 - November 5)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                                    | **Links**                                                                                                                                                                                                                 |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **MetNet-3** - Google's MetNet-3 is a state-of-the-art neural weather model extending lead time and variable coverage well beyond prior observation-based models.  <br>● Dense + sparse sensors: Learns jointly from dense sensor data (radar, satellite) and sparse in-situ station data, combining signals that were typically used separately.  <br>● 24-hour forecasts: Produces predictions up to 24 hours ahead, a meaningful lead-time extension for observation-based weather modeling.  <br>● Multi-variable output: Predicts precipitation, wind, temperature, and dew point from the same model, rather than requiring per-variable systems.  <br>● Operational relevance: Demonstrates the neural-weather-model pattern that would dominate 2024 forecasting research - observation-driven, end-to-end neural pipelines replacing traditional numerical systems.                                                                                                                           | [Paper](https://arxiv.org/abs/2306.06079), [Tweet](https://x.com/GoogleAI/status/1719774923294687636?s=20)                                                                                                                |
| 2) **Evaluating LLMs Survey** - A comprehensive survey of LLM evaluation covering benchmarks, methodologies, and open problems.  <br>● Task-wise organization: Organizes evaluation by task category - reasoning, knowledge, alignment, robustness, ethics, etc. - showing which benchmarks address which capabilities.  <br>● Automatic vs. human: Discusses the trade-offs between automatic metrics (cheap, inconsistent), LLM-as-a-Judge (scalable, biased), and human evaluation (reliable, expensive).  <br>● Contamination and robustness: Highlights contamination and robustness as cross-cutting concerns plaguing static benchmarks at all scales.  <br>● Frontier-model needs: Argues that evaluating frontier-scale LLMs requires new paradigms beyond simple benchmark accuracy, including interactive evaluation and behavioral testing.                                                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2310.19736), [Tweet](https://x.com/omarsar0/status/1719351676828602502?s=20)                                                                                                                |
| 3) **Battle of the Backbones** - A large-scale benchmarking framework that compares vision backbones across a diverse suite of computer vision tasks.  <br>● Broad benchmarking: Compares CNN and ViT backbones across classification, segmentation, detection, retrieval, and other tasks at matched compute.  <br>● Pretraining recipes matter: Shows that pretraining scheme (supervised, self-supervised, language-image) often matters more than the architecture family.  <br>● ViT ≠ universal winner: Vision transformers are not universally superior - strong CNN backbones remain competitive or better on several downstream tasks.  <br>● Practitioner guide: Functions as a decision reference - the report explicitly maps from task characteristics to recommended backbone + pretraining combinations.                                                                                                                                                                                            | [Paper](https://arxiv.org/abs/2310.19909), [Tweet](https://x.com/micahgoldblum/status/1719719308882801045?s=20)                                                                                                           |
| 4) **ChipNeMo (LLMs for Chip Design)** - NVIDIA's ChipNeMo applies domain-adapted LLMs to industrial chip design workflows.  <br>● Domain adaptation pipeline: Applies continued pretraining on chip-design corpora, SFT, and domain-specific RLHF to adapt general LLMs to semiconductor design language.  <br>● Three applications: Evaluates assistant chatbot for engineers, EDA (electronic design automation) tool invocation, and bug summarization - three real internal chip-design pain points.  <br>● Significant adaptation gains: Domain adaptation dramatically outperforms general-purpose LLMs across tasks despite using smaller model sizes.  <br>● Adapted RAG: Using a domain-adapted LLM as the generator in RAG further improves answer quality compared to using a general-purpose LLM with the same retrieval stack. | [Paper](https://arxiv.org/abs/2311.00176), [Tweet](https://x.com/omarsar0/status/1720066328961159387?s=20)                                                                                                                |
| 5) **YaRN (Efficient Context Extension)** - YaRN is a compute-efficient method for extending the context window of LLMs well beyond their pretrained length.  <br>● Rotary-embedding scaling: Extends RoPE-based context length via a combined attention and NTK-aware scaling scheme, avoiding the degradation of naive interpolation.  <br>● Fine-tune extrapolation: Extrapolates meaningfully beyond the limited context seen during fine-tuning, so short fine-tune sequences can unlock much longer inference contexts.  <br>● 128K context: Successfully scales Llama-family models to 128K-token context with minimal additional training compute.  <br>● Open recipe: Adopted widely across the open-source community as a standard recipe for extending Llama and other RoPE-based LLMs.                                                                                                                                             | [Paper](https://arxiv.org/abs/2309.00071), [Tweet](https://x.com/theemozilla/status/1720107186850877662?s=20)                                                                                                             |
| 6) **Open DAC 2023** - Meta releases a large DFT dataset for training ML models that predict sorbent-adsorbate interactions in Direct Air Capture (DAC).  <br>● 38M+ DFT calculations: Consists of more than 38M density functional theory calculations on metal-organic frameworks (MOFs), enabling large-scale ML-driven DAC material discovery.  <br>● DAC research: Targets direct air capture, where efficient CO₂-capturing MOFs are needed - a high-impact climate application for ML.  <br>● ML baselines: Provides strong ML baselines showing that ML surrogates can replace expensive DFT calculations for MOF screening.  <br>● Open-science contribution: Positions the dataset as an open foundation for materials ML research on climate applications.                                                                                                                                                                                            | [Paper](https://arxiv.org/abs/2311.00341), [Tweet](https://x.com/AIatMeta/status/1720143486505341128?s=20)                                                                                                                |
| 7) **Symmetry in Machine Learning** - A methodological framework for enforcing, discovering, and promoting symmetry in machine learning models.  <br>● Unified framework: Presents a single theoretical framework that covers data augmentation, equivariant architectures, and symmetry-discovering learning objectives.  <br>● Three-way taxonomy: Organizes approaches into enforcing known symmetries, discovering latent ones, and biasing learning toward symmetric solutions.  <br>● Worked examples: Applies the framework to MLPs and basis-function regression, showing concretely how the abstract concepts translate into design choices.  <br>● Broader ML perspective: Positions symmetry as a first-class design lever alongside scale and data quality, particularly for scientific ML.                                                                                                                                                                                                                                                               | [Paper](https://arxiv.org/abs/2311.00212), [Tweet](https://x.com/eigensteve/status/1720115655050227911?s=20)                                                                                                              |
| 8) **Next-Generation AlphaFold** - DeepMind previews the next AlphaFold with dramatically expanded scope of biomolecular complexes.  <br>● Multi-entity complexes: Jointly predicts structure of complexes including proteins, nucleic acids, small molecules, ions, and modified residues in a single unified model.  <br>● Beyond protein-only: Dramatically expands applicability beyond AlphaFold 2's protein-only regime, opening up drug discovery and RNA biology workflows.  <br>● Beats specialist predictors: Achieves greater accuracy on protein-nucleic acid interactions than specialized predictors in that domain - remarkable for a general model.  <br>● Biology pipeline signal: Preview of the capability direction that would crystallize as AlphaFold 3 in 2024, with profound implications for structural biology research.                                                           | [Paper](https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/a-glimpse-of-the-next-generation-of-alphafold/alphafold_latest_oct2023.pdf), [Tweet](https://x.com/demishassabis/status/1719345831730368596?s=20) |
| 9) **EmotionPrompt** - Microsoft researchers show that appending emotional stimuli to prompts reliably improves LLM performance across 45 tasks.  <br>● 45-task evaluation: Tested across Flan-T5-Large, Vicuna, Llama 2, BLOOM, ChatGPT, and GPT-4 on 45 deterministic and generative tasks.  <br>● Emotional stimuli: Appends phrases like "This is very important to my career" to prompts, drawing on social-psychology theories of human motivation.  <br>● Consistent gains: Produces consistent improvements across both smaller and frontier models, despite the prompts being content-free manipulations.  <br>● Emotional-intelligence signal: Suggests LLMs have internalized patterns connecting emotional framing to effort - a "bug or feature" question that has driven follow-up research on LLM behavioral psychology.         | [Paper](https://arxiv.org/abs/2307.11760), [Tweet](https://x.com/emollick/status/1720135672764285176?s=20)                                                                                                                |
| 10) **FP8-LM** - Microsoft's FP8-LM demonstrates that most LLM training variables - gradients, optimizer states - can use FP8 without sacrificing accuracy.  <br>● FP8 across the pipeline: Extends FP8 training beyond forward activations to gradients and optimizer states (both moments), widening the FP8 footprint.  <br>● No hyperparameter changes: Works as a drop-in replacement for FP16/BF16 training without requiring changes to learning rates, schedules, or other hyperparameters.  <br>● Matched accuracy: Achieves accuracy indistinguishable from FP16/BF16 baselines on LLM pretraining tasks.  <br>● Efficiency gains: Delivers substantial memory and compute savings, particularly attractive for training large models on FP8-capable hardware like H100.                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2310.18313), [Tweet](https://x.com/arankomatsuzaki/status/1718813303223222765?s=20)                                                                                                         |

---

## Top AI Papers of the Week (October 23 - October 29)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                     | **Links**                                                                                                                           |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Zephyr** - Hugging Face's Zephyr-7B is a 7B parameter LLM whose chat performance rivals much larger chat models aligned with human feedback.  <br>● Distilled SFT: Uses distilled supervised fine-tuning on UltraChat-generated instruction data as the task-accuracy foundation.  <br>● Distilled DPO: Aligns with AI feedback data via Direct Preference Optimization, rather than the expensive human-feedback RLHF pipeline.  <br>● ChatGPT-level at 7B: Achieves competitive performance with ChatGPT on AlpacaEval and matches 70B chat models aligned with human feedback on several benchmarks.  <br>● Recipe popularization: Open-sources the distilled-DPO recipe, which became a widely adopted template for small, strong open chat models.                                         | [Paper](https://arxiv.org/abs/2310.16944), [Tweet](https://x.com/nazneenrajani/status/1717747969842417723?s=20)                     |
| 2) **Fact-Checking with LLMs** - Investigates the fact-checking capabilities of frontier LLMs across multiple languages and claim types.  <br>● Contextual information helps: LLMs perform significantly better at fact-checking when equipped with retrieved evidence, validating the RAG pattern for claim verification.  <br>● GPT-4 > GPT-3: GPT-4 shows meaningful accuracy gains over GPT-3 for fact-checking, but both struggle without supporting context.  <br>● Multilingual variance: Accuracy varies substantially by query language and claim veracity, exposing persistent language-equity gaps in fact-checking.  <br>● Inconsistent reliability: While LLMs show real fact-checking promise, their accuracy is inconsistent enough that they can't replace human fact-checkers - useful as assistants, not arbiters.                               | [Paper](https://arxiv.org/abs/2310.13549), [Tweet](https://x.com/omarsar0/status/1717550929145119212?s=20)                          |
| 3) **Matryoshka Diffusion Models** - Apple introduces an end-to-end framework for high-resolution image and video synthesis that denoises across multiple resolutions jointly.  <br>● Joint multi-resolution diffusion: Runs the diffusion process at multiple resolutions simultaneously, sharing representations across scales in a single unified model.  <br>● NestedUNet: Uses a NestedUNet architecture so that higher-resolution branches build on lower-resolution features without a separate cascade.  <br>● Progressive training: Trains progressively from low to high resolution, dramatically improving optimization stability for high-resolution generation.  <br>● Unified model: Eliminates the typical cascaded-diffusion pipeline used in prior high-resolution generation, simplifying training and serving. | [Paper](https://arxiv.org/abs/2310.15111), [Tweet](https://x.com/thoma_gu/status/1716923384846856691?s=20)                          |
| 4) **Spectron** - Google's Spectron is a spoken-language model trained end-to-end on raw spectrograms rather than text or discrete audio tokens.  <br>● End-to-end spectrogram modeling: Processes spectrograms directly without an intermediate speech-recognition or tokenization step, preserving paralinguistic information.  <br>● High-quality spoken output: Fine-tuned to generate high-quality, accurate spoken language while preserving speaker and prosody characteristics.  <br>● Speaker preservation: Outperforms prior spoken-language models on speaker preservation - a known weakness of tokenizer-based approaches.  <br>● Semantic coherence: Also improves semantic coherence of generated speech, addressing the common drift problem in spectrogram-level generation.                                                                                                                                               | [Paper](https://arxiv.org/abs/2305.15255), [Tweet](https://x.com/GoogleAI/status/1717584836834001066?s=20)                          |
| 5) **LLMs Meet New Knowledge** - A benchmark that evaluates how well LLMs handle new knowledge beyond their training cutoff.  <br>● Three-dimensional evaluation: Tests knowledge understanding, knowledge differentiation (old vs. new), and knowledge association across the full set of relations.  <br>● Post-cutoff focus: Uses knowledge that appears after the model's training cutoff, avoiding contamination that undermines many LLM knowledge benchmarks.  <br>● LLMs struggle with new knowledge: Reveals systematic gaps - even frontier LLMs handle post-cutoff facts significantly worse than pre-cutoff ones, despite strong reasoning.  <br>● RAG-oriented motivation: Provides empirical grounding for RAG: parametric memory is tied to training data, so retrieval remains necessary for fresh knowledge.                                                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2310.14820), [Tweet](https://x.com/omarsar0/status/1716817266195796186?s=20)                          |
| 6) **Min-K% Prob (Detecting Pretraining Data)** - Proposes Min-K% Prob as an effective detection method for determining whether specific text was in an LLM's pretraining data.  <br>● Method: Computes the average log-probability of the K% least-likely tokens in a text; memorized text has higher log-probabilities on these tokens than unseen text.  <br>● Black-box detection: Works on API-accessible models without needing gradients or internal activations, making it broadly applicable.  <br>● Multiple use cases: Usable for benchmark-contamination detection, privacy auditing of machine unlearning, and copyrighted-text detection in pretraining corpora.  <br>● Policy implications: Provides a technical tool for the copyright and privacy debates, letting third parties measurably test specific-text inclusion in training data.                        | [Paper](https://arxiv.org/abs/2310.16789), [Tweet](https://x.com/WeijiaShi2/status/1717612387174687150?s=20)                        |
| 7) **ConvNets Match Vision Transformers** - DeepMind shows that strong ConvNet architectures pretrained at scale match ViTs on ImageNet performance at comparable compute.  <br>● JFT-4B pretraining: Pretrains performant ConvNet architectures (NFNets) on JFT-4B at scale - matching the data regime where ViTs typically pull ahead.  <br>● Log-log scaling law: Observes a log-log scaling law between held-out loss and compute, mirroring the scaling properties seen in ViTs.  <br>● ImageNet parity: Fine-tuned NFNets match the reported performance of Vision Transformers at comparable compute budgets, refuting the "ConvNets don't scale" narrative.  <br>● Architecture vs. recipe: Argues that the ConvNet-vs-ViT gap is largely a scale/recipe gap rather than an architectural limitation - a recurring theme in vision research.                                                              | [Paper](https://arxiv.org/abs/2310.16764), [Tweet](https://x.com/_akhaliq/status/1717385905214759421?s=20)                          |
| 8) **CommonCanvas** - Releases CommonCanvas, a text-to-image dataset composed entirely of Creative-Commons-licensed images.  <br>● CC-only training data: Every image is Creative Commons-licensed, providing a clean-license dataset for commercial and research T2I training.  <br>● Scale despite licensing constraints: Curates hundreds of millions of images despite the CC-only constraint, dispelling the myth that legal T2I training requires permissive copyrighted data.  <br>● Strong baseline models: Trains SD-style models on CommonCanvas that reach competitive quality, demonstrating CC data is sufficient for SoTA T2I.  <br>● Policy contribution: Provides a practical counterexample to the argument that copyrighted training data is necessary - important as copyright litigation reshaped the AI-data landscape.                                                                                                                                                                                                                                                                                                                                                                                                  | [Paper](https://arxiv.org/abs/2310.16825), [Tweet](https://x.com/iScienceLuvr/status/1717359916422496596?s=20)                      |
| 9) **Managing AI Risks (Bengio, Hinton, et al.)** - A high-profile position paper by leading AI researchers laying out risks from upcoming advanced AI systems.  <br>● Risk catalog: Enumerates social harms, malicious uses, large-scale autonomous risks, and potential loss-of-control scenarios from increasingly capable AI.  <br>● Signatory weight: Signed by multiple Turing Award-winning researchers including Hinton and Bengio, amplifying its impact in the policy conversation.  <br>● Concrete recommendations: Calls for investment in safety research, mandatory standards for advanced AI, and international coordination - not a pure threat-inventory.  <br>● Political moment: Published during active AI-regulation discussions in the US and UK, directly influencing the UK AI Safety Summit and related policy processes.                                                                                                      | [Paper](https://managing-ai-risks.com/managing_ai_risks.pdf), [Tweet](https://x.com/geoffreyhinton/status/1717967329202491707?s=20) |
| 10) **Branch-Solve-Merge (BSM)** - BSM decomposes LLM tasks into parallel sub-tasks via three LLM-programmed modules: branch, solve, and merge.  <br>● Three-module architecture: A branch module proposes a decomposition into parallel sub-tasks, a solve module independently answers each, and a merge module fuses results into a final response.  <br>● Prompt-parameterized: All three modules are the same base LLM with different prompts, so BSM works with any base model without fine-tuning.  <br>● Evaluation quality gains: Improves evaluation correctness and consistency for multiple LLMs, particularly on tasks where a flat prompt leaves too much implicit.  <br>● General pattern: Generalizes the "decompose then solve" pattern from math/CoT to arbitrary tasks, anticipating more structured agent decomposition patterns.  | [Paper](https://arxiv.org/abs/2310.15123), [Tweet](https://x.com/jaseweston/status/1716635331393380619?s=20)                        |

---

## Top AI Papers of the Week (October 16 - October 22)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                 | **Links**                                                                                                                                                                                                        |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Llemma** - Llemma is an open LLM for mathematics built via continued pretraining of Code Llama on the Proof-Pile-2 dataset.  <br>● Proof-Pile-2 dataset: Mixes scientific papers, math-heavy web pages, and mathematical code into a focused math-pretraining corpus.  <br>● Code Llama base: Uses Code Llama as the base model, leveraging its existing code proficiency as a scaffold for formal-style math reasoning.  <br>● Beats unreleased Minerva: Outperforms open base models and the unreleased Minerva on the MATH benchmark at comparable scale.  <br>● Full open release: Releases model, dataset, and code - positioning Llemma as a reproducible starting point for open mathematical LLM research.              | [Paper](https://arxiv.org/abs/2310.10631), [Tweet](https://x.com/zhangir_azerbay/status/1714098025956864031?s=20)                                                                                                |
| 2) **LLMs for Software Engineering** - A comprehensive survey of LLMs for software engineering covering models, tasks, evaluation, and open challenges.  <br>● Task coverage: Surveys code generation, bug detection and repair, code review, code translation, documentation, and testing.  <br>● Model landscape: Reviews code-specialized LLMs (Codex, StarCoder, CodeLlama) alongside general-purpose LLMs applied to code.  <br>● Evaluation review: Catalogs standard benchmarks (HumanEval, MBPP, DS-1000) and their limitations for real-world software engineering.  <br>● Open challenges: Highlights long-context code understanding, multi-file reasoning, verification, and agent-based SE as key open directions.                                                                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2310.03533), [Tweet](https://x.com/omarsar0/status/1713940983199506910?s=20)                                                                                                       |
| 3) **Self-RAG** - Self-RAG trains an LM to adaptively retrieve, generate, and self-critique using special reflection tokens.  <br>● Reflection tokens: Introduces special tokens that control retrieval decisions, passage relevance judgments, and self-evaluation of generations.  <br>● Adaptive retrieval: The model decides on-the-fly whether to retrieve, rather than always retrieving on every query - saving compute on knowledge-light queries.  <br>● Self-reflection: Critiques its own generations against retrieved passages, enabling controllable trade-offs between response quality and factuality at inference.  <br>● Significant gains: Outperforms state-of-the-art LLMs and strong RAG baselines on open-domain QA, reasoning, and fact verification.                                                  | [Paper](https://arxiv.org/abs/2310.11511), [Tweet](https://x.com/AkariAsai/status/1715110277077962937?s=20)                                                                                                      |
| 4) **RAG for Long-Form QA** - Explores retrieval-augmented LMs specifically on long-form question answering, where RAG failures are more subtle.  <br>● Retrieval is necessary: Confirms that retrieval is an important component for long-form QA, but that evidence documents must be carefully curated and ordered.  <br>● Attribution errors: Documents attribution errors - where the model cites passages that don't actually support its claims - and shows these spike when retrieved docs lack sufficient evidence.  <br>● Document ordering: Demonstrates that document order within the context substantially affects long-form QA attribution accuracy.  <br>● Practical guidelines: Offers concrete guidelines for document selection, ordering, and prompting to reduce hallucination in long-form RAG outputs. | [Paper](https://arxiv.org/abs/2310.12150), [Tweet](https://x.com/omarsar0/status/1714986431859282144?s=20)                                                                                                       |
| 5) **GenBench** - A Nature Machine Intelligence paper framework for characterizing and understanding generalization research in NLP.  <br>● Meta-analysis: Reviews 543 papers on generalization in NLP, mapping what "generalization" actually means across different research threads.  <br>● Generalization taxonomy: Organizes generalization into compositional, structural, cross-lingual, cross-task, and cross-domain generalization types.  <br>● Evaluation taxonomy: Provides tools for classifying generalization studies by the kind of distribution shift and evaluation protocol they test.  <br>● Research infrastructure: Ships with tools to help researchers classify and compare generalization work, aiming to reduce conceptual fragmentation in the field.                                                                                                                                                                                                                                                | [Paper](https://www.nature.com/articles/s42256-023-00729-y?utm_source=twitter&utm_medium=organic_social&utm_campaign=research&utm_content=link), [Tweet](https://x.com/AIatMeta/status/1715041427283902793?s=20) |
| 6) **LLM Self-Explanations** - Investigates whether LLMs can generate useful feature-attribution explanations for their own outputs.  <br>● Self-explanation capability: LLMs can self-generate feature-attribution explanations that meaningfully highlight the tokens driving their predictions.  <br>● Performance + truthfulness: Self-explanation improves both task performance and the truthfulness of outputs compared to baseline prompting.  <br>● CoT synergy: Combines productively with chain-of-thought prompting, giving additive improvements rather than substituting for it.  <br>● Interpretability lever: Offers a cheap, model-agnostic interpretability pattern that works through the API without needing gradients or white-box access.                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2310.11207), [Tweet](https://x.com/omarsar0/status/1714665747752923620?s=20)                                                                                                       |
| 7) **OpenAgents** - An open platform for running and hosting real-world language agents, including three distinct agent types.  <br>● Data Agent: A data-analysis agent capable of exploring datasets, running analyses, and producing visualizations through conversation.  <br>● Plugins Agent: Integrates 200+ daily-use API tools (e.g., weather, search, calendars) into a single conversational agent interface.  <br>● Web Agent: An autonomous web-browsing agent capable of navigating real websites and completing multi-step tasks.  <br>● Open alternative to ChatGPT Plus: Positions OpenAgents as an open-source alternative to ChatGPT's plugin ecosystem, usable for research into agent-user interaction patterns.                                                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2310.10634v1), [Tweet](https://x.com/ChengZhoujun/status/1714343204148113860?s=20)                                                                                                 |
| 8) **Eliciting Human Preferences with LLMs** - Anthropic uses LLMs to guide the task-specification process, eliciting user intent through natural-language dialogue.  <br>● Interactive elicitation: The LLM asks the user open-ended questions to clarify intent, producing a structured task specification that the model can then execute.  <br>● Beats user-written prompts: Systems built via LLM-elicited specifications produce more informative, accurate responses than user-written prompts alone.  <br>● Better than single-shot prompting: Shows that multi-turn elicitation yields higher task-success rates than single-shot prompting, even when the user is not a prompt engineer.  <br>● Usable AI pattern: Offers a pattern for bridging the user-intent gap that shapes AI product design - spec-driven rather than prompt-driven interaction.                                                                       | [Paper](https://arxiv.org/abs/2310.11589), [Tweet](https://x.com/AlexTamkin/status/1715040019520569395?s=20)                                                                                                     |
| 9) **AutoMix** - AutoMix routes queries between LLMs of different sizes based on smaller-model confidence, saving cost without sacrificing quality.  <br>● Confidence-based routing: A small model answers first; a confidence signal determines whether to accept its answer or escalate to a larger model.  <br>● Cascading thresholds: Uses multiple confidence thresholds to route queries through a cascade of increasingly capable (and expensive) models.  <br>● Cost-quality Pareto: Achieves Pareto improvements over single-model baselines, delivering equivalent quality at substantially lower inference cost.  <br>● Production relevance: The pattern maps cleanly onto practical LLM deployment where most queries can be handled by cheap models but a tail of hard queries need the frontier model.                                                                                                                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2310.12963), [Tweet](https://x.com/omarsar0/status/1715385477627334718?s=20)                                                                                                       |
| 10) **Video Language Planning** - Enables synthesizing complex long-horizon video plans for robotics via tree search over vision-language and text-to-video models.  <br>● Tree-search planner: Uses a tree-search procedure over a vision-language model serving as policy+value, with a text-to-video model acting as the dynamics model.  <br>● Long-horizon plans: Produces multi-step video plans for robotics tasks that would be infeasible with single-shot video generation.  <br>● Cross-domain generalization: Works across diverse robotics domains, showing the approach is not tied to a specific embodiment or task type.  <br>● Planning-via-generation: Demonstrates that generative video models can serve as world models for planning, a pattern that has gained traction through 2024.                                                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2310.10625), [Tweet](https://x.com/du_yilun/status/1714297584842318157?s=20)                                                                                                       |

---

## Top AI Papers of the Week (October 9 - October 15)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | **Links**                                                                                                       |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| 1) **Ring Attention** - UC Berkeley's Ring Attention scales transformer context to 100M+ tokens by distributing blockwise self-attention across devices in a ring topology.  <br>● Blockwise attention: Computes self-attention in blocks so that only small KV chunks need to fit on each device at any time.  <br>● Ring communication: Passes KV chunks between devices in a ring, overlapping communication with computation to hide networking latency.  <br>● Context scales with devices: Achievable context length grows linearly with the number of devices, with no attention approximations required.  <br>● 100M+ tokens: Enables context lengths exceeding 100 million tokens in theory, far beyond what any single-device attention implementation can reach.                                                                                                                                                                                            | [Paper](https://arxiv.org/abs/2310.01889), [Tweet](https://x.com/haoliuhl/status/1709630382457733596?s=20)      |
| 2) **UniSim (Universal Simulator)** - Google's UniSim learns a universal generative simulator of real-world interactions from diverse video + action data.  <br>● Generative world model: Simulates how humans and agents interact with the world by predicting the visual outcome of high-level instructions and low-level controls.  <br>● Diverse action conditioning: Handles both text instructions ("pick up the cup") and low-level motor commands, unifying instruction-following and dynamics modeling.  <br>● Training downstream systems: Can be used to train vision-language planners, low-level RL policies, and video-captioning systems - acting as a general data source.  <br>● World-model agenda: A key datapoint for the broader "generative world models for embodied AI" research agenda that accelerated through 2024.                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2310.06114), [Tweet](https://x.com/mengjiao_yang/status/1712153304757915925?s=20) |
| 3) **Survey on Factuality in LLMs** - A survey covering evaluation and enhancement techniques for LLM factuality.  <br>● Evaluation taxonomy: Organizes factuality evaluation by granularity (token, sentence, passage), task (QA, generation, dialogue), and reference availability.  <br>● Enhancement taxonomy: Reviews enhancement techniques including better training data, retrieval augmentation, factuality-aware decoding, and post-hoc verification.  <br>● Factuality vs. truthfulness: Clarifies the often-confused distinction between factuality (correct facts) and truthfulness (model reports its beliefs honestly).  <br>● Open problems: Highlights persistent gaps in cross-lingual factuality, open-ended generation factuality, and calibration.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | [Paper](https://arxiv.org/abs/2310.07521), [Tweet](https://x.com/omarsar0/status/1712469661118517740?s=20)      |
| 4) **Hypothesis Search (LLMs Can Learn Rules)** - A two-stage framework where the LLM learns a rule library for reasoning.  <br>● Rule induction phase: In the first stage, the LLM induces general rules from a small set of examples, producing an explicit rule library rather than implicit pattern matching.  <br>● Rule application phase: In the second stage, the model applies rules from its library to new problems, with explicit rule-lookup rather than end-to-end inference.  <br>● Improves reasoning: The explicit rule library improves reasoning performance on tasks where generalization from examples beats pure in-context learning.  <br>● Interpretability bonus: The learned rule library is human-readable and auditable, providing a window into what the model actually learned from its examples.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2310.07064), [Tweet](https://x.com/zhu_zhaocheng/status/1712582734550647091?s=20) |
| 5) **Meta Chain-of-Thought Prompting (Meta-CoT)** - A generalizable CoT framework that selects domain-appropriate reasoning patterns for the task at hand.  <br>● Task-adaptive CoT: Rather than using a fixed CoT prompt template, Meta-CoT adaptively selects reasoning patterns based on task characteristics.  <br>● Pattern library: Maintains a library of reasoning templates tailored to task families (math, logic, commonsense, etc.), picking the best one per query.  <br>● Strong across tasks: Improves reasoning accuracy across diverse task types compared to single-template CoT prompting.  <br>● Generalizable framework: The Meta-CoT pattern is easy to extend to new task families by just adding new templates to the library.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2310.06692), [Tweet](https://x.com/omarsar0/status/1712835499256090972?s=20)      |
| 6) **LLMs for Healthcare Survey** - A comprehensive overview of LLMs applied to the healthcare domain.  <br>● Application coverage: Surveys clinical decision support, patient communication, medical summarization, diagnostic assistance, and biomedical research applications.  <br>● Medical-LLM landscape: Reviews major medical LLMs (Med-PaLM, MEDITRON, ClinicalBERT) alongside general-purpose LLMs prompted for medical use.  <br>● Benchmarks: Catalogs medical QA benchmarks and discusses their limitations for predicting real-world clinical usefulness.  <br>● Deployment challenges: Covers regulatory, privacy, and safety challenges specific to healthcare LLM deployment.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2310.05694), [Tweet](https://x.com/omarsar0/status/1711755055777415485?s=20)      |
| 7) **RECOMP (Retrieval-Augmented LMs with Compressors)** - Proposes two compression approaches to shrink retrieved documents before in-context use.  <br>● Extractive compressor: Selects the most useful sentences from retrieved documents, retaining the most relevant signal at a fraction of token budget.  <br>● Abstractive compressor: Generates a summary synthesizing information from multiple retrieved documents, compressing redundancy across sources.  <br>● 6% compression rate: Achieves compression rates as low as 6% with minimal performance loss on language modeling and open-domain QA.  <br>● Selective augmentation: The training scheme learns to emit empty summaries when retrieved docs are irrelevant - a built-in mechanism for gracefully handling noisy retrieval. | [Paper](https://arxiv.org/abs/2310.04408), [Tweet](https://x.com/omarsar0/status/1711384213092479130?s=20)      |
| 8) **InstructRetro** - NVIDIA introduces Retro 48B, the largest LLM pretrained with retrieval at the time.  <br>● 48B scale: Continues pretraining a 43B parameter GPT model on 100B additional tokens while retrieving from a 1.2T-token database.  <br>● Instruction tuning: Further instruction-tunes the retrieval-pretrained model, producing an instruction-following version of Retro.  <br>● Stronger factuality: Shows reduced hallucination and better factuality on knowledge-intensive tasks compared to Retro-free baselines at comparable scale.  <br>● Retrieval pretraining validated: Provides evidence that retrieval-during-pretraining can scale to 40B+ parameters and benefit downstream instruction-tuned use cases.                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2310.07713), [Tweet](https://x.com/omarsar0/status/1712466049428521433?s=20)      |
| 9) **MemWalker** - MemWalker treats the LLM as an interactive agent that traverses a tree-structured summary of long text.  <br>● Tree of summary nodes: Preprocesses long context into a hierarchical tree of summary nodes, compressing and structuring the information.  <br>● Query-driven traversal: Given a query, the LLM traverses the tree through iterative prompting, descending into subtrees that are most relevant to the question.  <br>● Reasoning-based reading: The traversal decisions are reasoning-based, so the model can explain which part of the document it consulted and why.  <br>● Explainability bonus: The traversal trace serves as a human-readable explanation of the model's document reading, improving debuggability of long-context QA.                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2310.05029), [Tweet](https://x.com/__howardchen/status/1711584916708938042?s=20)  |
| 10) **FireAct (Language Agent Fine-tuning)** - Explores fine-tuning LLMs specifically for language-agent use, demonstrating consistent gains over prompting alone.  <br>● Fine-tuning beats prompting: Language agents consistently improve over prompted baselines after fine-tuning their backbone LLM on agent trajectories.  <br>● 500 trajectories suffice: Fine-tuning a Llama 2-7B on just 500 agent trajectories produces a substantially stronger language agent than a prompted GPT-4 on several agent benchmarks.  <br>● Data-efficient: The low data threshold suggests agent behaviors can be cheaply specialized, which matters for production agent deployment.  <br>● Agent-specialization pattern: Anticipates the wave of agent-specialized LLMs released through 2024, where small focused fine-tunes outperform prompting of large general models.                                                                                                                                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2310.05915), [Tweet](https://x.com/omarsar0/status/1711757242905534479?s=20)      |

---

## Top AI Papers of the Week (October 2 - October 8)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | **Links**                                                                                                        |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| 1) **LLMs Represent Space and Time** - MIT researchers find that LLMs internally encode linear representations of space and time across multiple scales.  <br>● Linear geographic representations: Activations contain linear representations of coordinates (latitude, longitude) of real-world entities, detectable via probes.  <br>● Multi-scale time: Similar linear representations exist for time at multiple scales (historical year, news date, etc.), suggesting a structured temporal axis.  <br>● Robust across prompts: The representations are robust to prompt variations and unified across different entity types (cities, events, people).  <br>● World-model evidence: Provides empirical support for the claim that LLMs build literal world models, not just surface-statistics imitators - a live debate in interpretability.                                                                                                                                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2310.02207), [Tweet](https://x.com/wesg52/status/1709551516577902782?s=20)         |
| 2) **Retrieval Meets Long-Context LLMs** - NVIDIA's study comparing RAG and long-context LLMs, with the punchline that the two are complementary rather than substitutes.  <br>● 4K + RAG ≈ 16K fine-tuned: An LLM with only a 4K context window using simple RAG can match a fine-tuned LLM with 16K context - a striking efficiency result.  <br>● Retrieval always helps: Retrieval improves performance regardless of context-window size, even when the model can fit the full document in its native context.  <br>● LLaMA-2 70B beats GPT-3.5: A retrieval-augmented LLaMA 2 70B with 32K context outperforms GPT-3.5-turbo-16k on seven long-context tasks including QA and query-based summarization.  <br>● Implication: Don't think of long context and retrieval as competing solutions - pair them, and let the model attend to both the query and retrieved evidence.                                                                                          | [Paper](https://arxiv.org/abs/2310.03025), [Tweet](https://x.com/omarsar0/status/1709749178199318545?s=20)       |
| 3) **StreamingLLM** - MIT's StreamingLLM enables efficient streaming inference by preserving "attention sinks" - early-sequence tokens that most attention mass flows to.  <br>● Attention sink phenomenon: The authors observe that attention heads consistently route a large fraction of attention mass to the first few tokens, even when those tokens are semantically irrelevant.  <br>● Sink tokens are essential: Keeping the KV states of initial tokens around dramatically recovers the performance of sliding-window attention.  <br>● Infinite-length inference: Enables LLMs trained with finite context to generate infinitely long outputs without fine-tuning, by retaining sink tokens plus a sliding window.  <br>● Emergent explanation: Attention sinks appear because the softmax must normalize to one - unused attention mass is "dumped" onto the first tokens, which explains why removing them breaks the model.                                                                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2309.17453), [Tweet](https://x.com/Guangxuan_Xiao/status/1708943505731801325?s=20) |
| 4) **Neural Developmental Programs (NDPs)** - Proposes neural networks that self-assemble through a developmental process inspired by biological embryonic development.  <br>● Bio-inspired growth: A small set of developmental rules governs how neurons replicate and connect, mirroring the way biological nervous systems grow from genomes.  <br>● Indirect encoding: The final network emerges from a much smaller developmental program rather than being specified directly - an indirect encoding scheme.  <br>● Self-assembly: Networks self-assemble through repeated application of local developmental rules, without a global blueprint.  <br>● Research direction: Positioned as a step toward more open-ended, flexible neural architectures that could eventually grow and adapt throughout training rather than being fixed a priori.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2307.08197), [Tweet](https://x.com/risi1979/status/1708888992224362742?s=20)       |
| 5) **The Dawn of LMMs (GPT-4V Deep Dive)** - Microsoft's exhaustive 166-page analysis of GPT-4V's capabilities and limitations.  <br>● Comprehensive task coverage: Probes GPT-4V across visual reasoning, code, OCR, document understanding, multimodal commonsense, and agent-style tasks.  <br>● Working input modes: Catalogs the diverse input patterns GPT-4V supports - single images, multi-image reasoning, image-text interleaving, sketches, and handwritten input.  <br>● Capability frontier: Demonstrates emergent capabilities like reading diagrams, interpreting medical imaging, and extracting structured information from complex visuals.  <br>● Open issues: Identifies persistent weaknesses including hallucination, fine-grained spatial reasoning, and consistency across related queries - a reference for what was still broken at the start of the GPT-4V era.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | [Paper](https://arxiv.org/abs/2309.17421), [Tweet](https://x.com/omarsar0/status/1708860551110041871?s=20)       |
| 6) **Training LLMs with Pause Tokens** - CMU shows that adding a learnable `<pause>` token during both pretraining and fine-tuning gives the model extra "thinking time" and improves reasoning.  <br>● Learnable pause token: Inserts a `<pause>` token into the input; the model processes these tokens but doesn't treat them as meaningful content, letting it compute more before answering.  <br>● CommonsenseQA and math gains: Produces measurable performance gains on CommonsenseQA and math word problems - both tasks that benefit from extra internal computation.  <br>● Pretraining is required: The benefit only materializes if pauses are introduced in both pretraining and fine-tuning - adding them only at inference doesn't work.  <br>● Compute-aware decoding: Positions pause tokens as a simple inference-time knob for trading compute against accuracy, foreshadowing many 2024 "thinking time" tricks.                                                                                                                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2310.02226), [Tweet](https://x.com/omarsar0/status/1709573238123122959?s=20)       |
| 7) **Self-Taught Optimizer (STOP)** - Proposes recursively self-improving code generation where an LLM-scaffolded program improves itself.  <br>● Seed improver: A "seed improver" program first improves an input program to return the best solution found - a self-improvement scaffold built on GPT-4.  <br>● Recursive improvement: The seed improver is itself tasked with improving itself, producing the first concrete demonstration of recursive self-improvement in LLM code generation.  <br>● GPT-4 capable: Shows that GPT-4 models can write code that modifies itself iteratively, producing measurably better scaffolds than the initial seed.  <br>● Foundational work: An early, influential demonstration of the LLM-as-code-modifier pattern that would reappear across 2024 in agent and tool-use research.                                                                                                                                                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2310.02304), [Tweet](https://x.com/ericzelikman/status/1709721771937587541?s=20)   |
| 8) **RA-DIT (Retrieval-Augmented Dual Instruction Tuning)** - Meta's RA-DIT is a lightweight recipe that retrofits LLMs with retrieval capabilities through dual fine-tuning.  <br>● Two-stage fine-tuning: Stage 1 updates the LM to better use retrieved information; stage 2 updates the retriever to return documents the LM actually prefers.  <br>● Each stage adds gains: Both stages contribute meaningfully and combine to produce strong downstream RAG performance without end-to-end joint training.  <br>● 65B SoTA: The 65B model achieves state-of-the-art on a range of knowledge-intensive zero-shot and few-shot benchmarks.  <br>● Strong relative gains: Outperforms existing retrieval-augmented approaches by up to +8.9% in zero-shot and +1.4% in 5-shot settings - non-trivial gains on already-strong baselines. | [Paper](https://arxiv.org/abs/2310.01352), [Tweet](https://x.com/omarsar0/status/1709204756013490494?s=20)       |
| 9) **KOSMOS-G** - Microsoft's KOSMOS-G extends zero-shot image generation to multi-image vision-language input.  <br>● Generalized VL input: Generates images from a vision-language prompt that can include multiple reference images, unlike typical single-reference setups.  <br>● Multi-entity scenarios: Extends zero-shot subject-driven image generation to scenarios with multiple subjects - e.g., generating a scene where A is doing X to B, preserving each identity.  <br>● CLIP-replaceable: Allows replacing CLIP in downstream image-generation pipelines, unlocking new applications with U-Net techniques like ControlNet and LoRA.  <br>● Unified generation interface: Positions itself as a unified vision-language input interface for controllable image generation, rather than a new diffusion backbone.                                                                                                                                                                                                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2310.02992), [Tweet](https://x.com/omarsar0/status/1709934741158510625?s=20)       |
| 10) **Analogical Prompting** - Google's Analogical Prompting guides LLM reasoning by having the model self-generate relevant exemplars on the fly.  <br>● Self-generated exemplars: Rather than requiring curated few-shot demonstrations, the model is prompted to recall or generate relevant analogous problems before solving the target question.  <br>● Analogical-reasoning inspiration: Draws on the cognitive-science concept of analogical reasoning, where humans solve new problems by invoking similar past cases.  <br>● No labeled exemplars needed: Unlike CoT, which requires demonstrations of the reasoning process, Analogical Prompting requires no labeled reasoning data at all.  <br>● Benchmark gains: Improves over standard CoT and zero-shot baselines across math, commonsense, and code reasoning tasks, with particularly strong gains on math word problems.                                                                                                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2310.01714), [Tweet](https://x.com/michiyasunaga/status/1709582150025240854?s=20)  |

---

## Top AI Papers of the Week (September 25 - October 1)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                | **Links**                                                                                                                      |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------ |
| 1) **The Reversal Curse** - Finds that LLMs trained on "A is B" fail to generalize to "B is A" - a surprisingly deep failure of learning.  <br>● Asymmetric fact learning: LLMs finetuned on statements of the form "A is B" show no ability to answer "Who is B?" with A, even after extensive training.  <br>● Fictitious-statement testbed: Demonstrates the effect using fine-tuning on fictitious statements, so training data can't contribute the reverse direction through coincidence.  <br>● Model-family robust: The Reversal Curse persists across different model sizes and model families, suggesting it reflects a fundamental property of next-token prediction training.  <br>● Knowledge representation implication: Raises hard questions about how LLMs represent knowledge - they clearly don't store bidirectional relations by default, unlike symbolic knowledge bases.                                                                                           | [Paper](https://owainevans.github.io/reversal_curse.pdf), [Tweet](https://x.com/OwainEvans_UK/status/1705285631520407821?s=20) |
| 2) **Effective Long-Context Scaling (Meta)** - Meta proposes a 70B long-context LLM that surpasses GPT-3.5-turbo-16k on long-context benchmarks.  <br>● Continual pretraining recipe: Uses continual pretraining on long documents to extend Llama 2's context window efficiently, without training a new model from scratch.  <br>● Beats GPT-3.5-turbo-16k: The 70B variant outperforms GPT-3.5-turbo-16k on a suite of long-context tasks including document QA, summarization, and multi-hop reasoning.  <br>● Cost-effective instruction tuning: Introduces an instruction-tuning procedure that doesn't require human-annotated long-instruction data - a common bottleneck for long-context fine-tuning.  <br>● Open release: Produces an open long-context Llama 2 variant, making strong long-context capability accessible to the research community.                                                                                                                      | [Paper](https://arxiv.org/abs/2309.16039), [Tweet](https://x.com/omarsar0/status/1707780482178400261?s=20)                     |
| 3) **Graph Neural Prompting (GNP)** - A plug-and-play method that injects knowledge-graph information into frozen pretrained LLMs.  <br>● KG-to-embedding bridge: Uses a graph neural network to encode relevant knowledge-graph subgraphs into a soft prompt embedding that conditions the LLM.  <br>● Frozen-LLM compatible: Works with frozen pretrained LLMs without requiring any fine-tuning, making it cheap to adopt.  <br>● Commonsense gains: Improves performance on commonsense QA benchmarks where structured knowledge-graph information is known to help.  <br>● Modular extensibility: The GNN-encoded soft-prompt pattern generalizes beyond KGs to any structured input that can be encoded into embeddings.                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2309.15427), [Tweet](https://x.com/omarsar0/status/1707211751354212382?s=20)                     |
| 4) **Vision Transformers Need Registers** - Meta researchers identify artifact tokens in ViT feature maps and propose a trivial fix: add dedicated register tokens.  <br>● Artifact identification: Vision transformers repurpose certain input tokens as "internal scratch space", producing high-norm artifacts that contaminate feature maps.  <br>● Register tokens: Adds a small number of dedicated register tokens to the input sequence, giving the model explicit scratch space instead of co-opting patch tokens.  <br>● Cleaner features: The fix produces substantially smoother feature and attention maps, with the artifact tokens disappearing.  <br>● New SoTA on dense tasks: Sets new state-of-the-art results on dense visual prediction tasks (segmentation, depth, object discovery), with real downstream impact. | [Paper](https://arxiv.org/abs/2309.16588), [Tweet](https://x.com/TimDarcet/status/1707769575981424866?s=20)                    |
| 5) **Boolformer** - The first Transformer trained to perform end-to-end symbolic regression of Boolean functions.  <br>● End-to-end symbolic regression: Directly predicts compact Boolean formulas from input-output examples, skipping the typical search-over-programs loop of symbolic regression.  <br>● Handles complex functions: Produces compact formulas for complex Boolean functions that traditional symbolic-regression methods struggle to compress.  <br>● Gene regulatory networks: Applied to modeling the dynamics of gene regulatory networks, providing a concrete real-world application beyond synthetic benchmarks.  <br>● Transformer-as-symbolic-learner: Extends the "Transformer as symbolic regression engine" line started by earlier work on equation discovery, covering the discrete-logic case.                                                                                                                                                       | [Paper](https://arxiv.org/abs/2309.12207), [Tweet](https://x.com/stephanedascoli/status/1706235856778834015?s=20)              |
| 6) **LLaVA-RLHF** - Adapts factually augmented RLHF to aligning large multimodal models, reducing hallucination without falling into reward-hacking pitfalls.  <br>● Factually augmented RLHF: Augments the reward model with factual-consistency signals (e.g., grounded-in-image checks), reducing the reward hacking common in vanilla multimodal RLHF.  <br>● Hallucination reduction: Produces meaningful reductions in hallucination on multimodal benchmarks compared to SFT-only or vanilla RLHF variants.  <br>● 94% of text GPT-4: Reaches 94% of the performance level of text-only GPT-4 on LLaVA-Bench - closing a substantial gap via alignment alone.  <br>● Open recipe: Releases the full training recipe so the multimodal RLHF approach can be applied to other open VLMs.                                                                                                                                                                  | [Paper](https://arxiv.org/abs/2309.14525), [Tweet](https://x.com/arankomatsuzaki/status/1706839311306621182?s=20)              |
| 7) **LLM Alignment Survey** - A comprehensive survey of LLM alignment research spanning theoretical foundations to adversarial pressure.  <br>● Outer and inner alignment: Distinguishes outer alignment (specifying the right objective) from inner alignment (ensuring the model actually pursues that objective).  <br>● Mechanistic interpretability: Reviews interpretability as an alignment tool, covering circuits, activation patching, and probing approaches.  <br>● Adversarial pressure: Catalogs known attacks on aligned LLMs including jailbreaks, prompt injection, and reward hacking.  <br>● Evaluation and directions: Discusses alignment evaluation methodologies and open problems, including scalable oversight for future systems beyond human capability.                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2309.15025), [Tweet](https://x.com/omarsar0/status/1706845285064818905?s=20)                     |
| 8) **Qwen** - Alibaba releases the Qwen family of open LLMs with strong tool-use and planning capabilities for language agents.  <br>● Open model family: Ships with multiple sizes (7B, 14B, 72B) and both base and chat variants, covering a wide range of downstream needs.  <br>● Tool use and planning: Emphasizes tool use and planning capabilities through targeted RLHF training for agentic tasks.  <br>● Agent-ready: Comes with agent-specific RLHF data and recipes that would inform the Qwen-Agent releases through 2024.  <br>● Multilingual strength: Strong on Chinese alongside English, filling a gap in the open-LLM landscape previously dominated by English-centric releases.                                                                                                                                                                                                                                                       | [Paper](https://arxiv.org/abs/2309.16609), [Tweet](https://x.com/omarsar0/status/1707776749042364729?s=20)                     |
| 9) **MentaLLaMA** - An open-source LLM family specialized for interpretable mental-health analysis on social media.  <br>● Mental-health focus: Fine-tuned specifically for mental-health analysis tasks including depression, anxiety, and stress detection in social media text.  <br>● Instruction-following: Supports instruction-following interfaces, letting clinicians and researchers query the model in natural language rather than via fixed classifiers.  <br>● 105K instruction dataset: Releases a multi-task, multi-source interpretable mental-health instruction dataset with 105K samples.  <br>● Interpretability-first: Emphasizes interpretable predictions rather than black-box classification, important for downstream clinical or research use.                                                                                                                                                    | [Paper](https://arxiv.org/abs/2309.13567), [Tweet](https://x.com/SAnaniadou/status/1707668936634794442?s=20)                   |
| 10) **Logical Chain-of-Thought (LogiCoT)** - A neurosymbolic framework that verifies and revises zero-shot CoT reasoning using symbolic-logic principles.  <br>● Symbolic-logic verification: Applies principles from symbolic logic to verify whether each step of a CoT reasoning chain is internally consistent.  <br>● Revision loop: When the verifier detects an inconsistency, the model revises the reasoning step before continuing, preventing error propagation.  <br>● Zero-shot: Works zero-shot without requiring labeled examples of logical reasoning - the verifier is symbolic rather than learned.  <br>● Reasoning gains: Improves CoT reasoning on logical-reasoning benchmarks where vanilla CoT tends to produce fluent but invalid chains.                                                                                                                                                       | [Paper](https://arxiv.org/abs/2309.13339), [Tweet](https://x.com/omarsar0/status/1706711389803287019?s=20)                     |

---

## Top AI Papers of the Week (September 18 - September 24)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                             | **Links**                                                                                                                           |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| 1) **AlphaMissense** - DeepMind's AlphaMissense is an AI model that classifies missense genetic variants as pathogenic or benign at genome scale.  <br>● 71M variants classified: Categorizes 89% of all 71 million possible missense variants as either likely pathogenic or likely benign, producing a comprehensive human-genome catalog.  <br>● Disease-cause identification: Helps pinpoint the molecular cause of genetic diseases, where missense variant interpretation is a known bottleneck in clinical genetics.  <br>● AlphaFold lineage: Builds on the AlphaFold family's protein-structure understanding, leveraging structural context to assess variant impact.  <br>● Open catalog: The full catalog is released to accelerate research in rare-disease diagnosis and drug target discovery.                                                                                                                                              | [Paper](https://www.science.org/doi/10.1126/science.adg7492), [Tweet](https://x.com/GoogleDeepMind/status/1704145467129389178?s=20) |
| 2) **Chain-of-Verification (CoVe)** - Meta's Chain-of-Verification adds a "deliberation" step where the LLM fact-checks its own draft before finalizing.  <br>● Four-step pipeline: (1) Draft an initial response; (2) plan verification questions for fact-checking; (3) answer each verification question independently; (4) generate a final verified response.  <br>● Independent verification: Each verification question is answered independently to avoid bias from other responses, producing more reliable fact-checks than joint answering.  <br>● Hallucination reduction: Produces measurable hallucination reductions on long-form QA tasks compared to standard and CoT prompting.  <br>● Self-correction pattern: Influential example of the "LLM as its own critic" pattern, foreshadowing many 2024 self-refinement techniques.                                                      | [Paper](https://arxiv.org/abs/2309.11495), [Tweet](https://x.com/omarsar0/status/1704901425824772275?s=20)                          |
| 3) **Contrastive Decoding for Reasoning** - Shows that contrastive decoding, a simple inference-time technique, substantially improves reasoning in large LLMs.  <br>● Contrastive decoding: Subtracts the log-probabilities of a smaller "expert" model from those of the target LLM, boosting tokens where the larger model confidently differs from the smaller one.  <br>● Llama 65B beats Llama 2: Contrastive decoding lets Llama 65B outperform Llama 2 and other strong baselines on commonsense and reasoning benchmarks.  <br>● Training-free: Requires no additional training - just a smaller model available at inference time and a modified decoding rule.  <br>● Generalizable lever: Positions contrastive decoding as a simple, cheap lever for reasoning improvement that can complement other prompting or fine-tuning techniques.                                                                                                                                                                                                                   | [Paper](https://arxiv.org/abs/2309.09117), [Tweet](https://x.com/_akhaliq/status/1703966776990597567?s=20)                          |
| 4) **LongLoRA** - An efficient LoRA-based fine-tuning recipe for extending LLM context windows without expensive full fine-tuning.  <br>● Shift short attention: Uses "shift short attention" during training, a pattern-shifted sparse approximation that mimics full attention while cutting cost.  <br>● LoRA-compatible: Works with standard LoRA, making it compatible with the existing parameter-efficient fine-tuning ecosystem.  <br>● Lower GPU cost: Dramatically reduces GPU memory and training time compared to full fine-tuning for context extension.  <br>● No accuracy compromise: Achieves comparable accuracy to full fine-tuning at extended context lengths, despite using a much cheaper approximation.                                                                                | [Paper](https://arxiv.org/abs/2309.12307), [Tweet](https://x.com/omarsar0/status/1705234482930798813?s=20)                          |
| 5) **Struc-Bench (LLMs for Structured Data)** - Studies how LLMs handle complex structured-data generation and proposes a structure-aware fine-tuning method.  <br>● Structured data challenge: Tests LLMs on generating complex structured data (HTML tables, JSON, LaTeX) where surface-form correctness matters.  <br>● Structure-aware fine-tuning: Proposes a fine-tuning recipe specifically designed to teach small models the syntactic constraints of structured outputs.  <br>● 7B beats GPT-4: A fine-tuned Llama 7B significantly outperforms GPT-3.5/4 and Vicuna-13B on structured-data generation benchmarks.  <br>● Deployment relevance: Demonstrates that for production structured-output applications, small specialized models can beat frontier general-purpose models at a fraction of the cost.                                                                                                                                                                            | [Paper](https://arxiv.org/abs/2309.08963), [Tweet](https://x.com/omarsar0/status/1703958549917847884?s=20)                          |
| 6) **LMSYS-Chat-1M** - LMSYS releases a large-scale dataset of 1 million real-world LLM conversations collected from the Vicuna demo and Chatbot Arena.  <br>● 1M conversations: Comprises 1 million real-world conversations across 25 state-of-the-art LLMs, a uniquely broad snapshot of how people actually use chat models.  <br>● 210K unique users: Collected from 210K unique IP addresses, giving a diverse user sample rather than a curated research group.  <br>● Real-world use cases: Captures natural usage patterns - coding help, writing, exploration, role-play - across many topics and languages.  <br>● Research resource: Opens up research directions in LLM evaluation, preference modeling, and usage-pattern analysis that were previously gated by data scarcity.                                                                                                                                                                                                                 | [Paper](http://arxiv.org/abs/2309.11998), [Tweet](https://x.com/arankomatsuzaki/status/1705024956122161217?s=20)                    |
| 7) **Language Modeling Is Compression** - DeepMind empirically revisits the theoretical equivalence between prediction and compression, applied to modern LLMs.  <br>● Theoretical equivalence: Reminds that optimal compression and optimal prediction are duals - a good language model is implicitly a powerful compressor.  <br>● ImageNet compression: Chinchilla 70B compresses ImageNet patches to 43.4% of raw size, better than domain-specific codecs like PNG.  <br>● LibriSpeech compression: Compresses LibriSpeech samples to 16.4% of raw size, beating FLAC and gzip on audio data despite never being trained on audio.  <br>● Cross-modal generalization: Shows LLMs work as general-purpose compressors across text, image, and audio - a striking demonstration of in-context learning's reach. | [Paper](https://arxiv.org/abs/2309.10668), [Tweet](https://x.com/omarsar0/status/1704306357006897402?s=20)                          |
| 8) **Compositional Foundation Models (HiP)** - Proposes foundation models that compose multiple expert foundation models trained on different modalities to solve long-horizon goals.  <br>● Hierarchical planning: Uses separate foundation models for language (high-level plans), vision (grounding), and action (execution) that compose into a hierarchical planner.  <br>● Long-horizon goals: Targets goals requiring dozens of subgoals - a regime where monolithic policies typically fail.  <br>● Training-free composition: Composes existing pretrained models at inference time without joint training, dramatically reducing the compute cost of long-horizon agents.  <br>● Robotics relevance: Demonstrates the approach on robotic manipulation tasks, pointing toward practical long-horizon embodied-AI systems.                                                                                                                                                                                                                                         | [Paper](https://arxiv.org/abs/2309.08587), [Tweet](https://x.com/du_yilun/status/1703786005612929214?s=20)                          |
| 9) **OWL (LLMs for IT Operations)** - Proposes OWL, an LLM specialized for IT operations through self-instruct fine-tuning on IT-specific tasks.  <br>● IT operations focus: Targets IT-specific tasks including log analysis, incident diagnosis, config-file manipulation, and automated operations.  <br>● Self-instruct dataset: Uses a self-instruct strategy grounded in real IT tasks to construct a high-quality instruction dataset from scratch.  <br>● IT benchmark: Introduces a benchmark for evaluating LLMs on IT operations tasks, filling a gap left by general-purpose LLM benchmarks.  <br>● Enterprise deployment: Positions LLMs as practical assistants for IT operators rather than just developer copilots.                                                                                                                                                                                                 | [Paper](https://arxiv.org/abs/2309.09298), [Tweet](https://x.com/omarsar0/status/1704137910834888743?s=20)                          |
| 10) **KOSMOS-2.5** - Microsoft's KOSMOS-2.5 is a multimodal model purpose-built for "machine reading" of text-intensive images.  <br>● Text-rich image input: Specialized for documents, forms, receipts, and other images dominated by text rather than natural-scene imagery.  <br>● Document-level generation: Capable of document-level text generation from images, handling layout-aware reading order and structure.  <br>● Image-to-markdown: Converts complex text-rich images directly into Markdown output, preserving headings, lists, and tables.  <br>● Complements KOSMOS-1/2: Extends the KOSMOS family toward document intelligence, a domain where general VLMs had weaker performance.                                                                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2309.11419), [Tweet](https://x.com/arankomatsuzaki/status/1704659787399487649?s=20)                   |

---

## Top AI Papers of the Week (September 11 - September 17)

| **Paper**                                                                                                                                                                                                                                                                                                                                | **Links**                                                                                                                                   |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Textbooks Are All You Need II (phi-1.5)** - Microsoft's phi-1.5 demonstrates that a 1.3B model trained on "textbook-quality" synthetic data rivals much larger models on reasoning.  <br>● Small but capable: A 1.3B parameter model trained on only 30B tokens competes or outperforms much larger open models on reasoning tasks.  <br>● Synthetic textbook data: Training data consists of AI-generated "textbook-quality" content, deliberately curated for pedagogical clarity rather than web breadth.  <br>● Data quality dominates: Suggests that data quality and pedagogical structure matter more for reasoning emergence than raw parameter count - a provocative counter to pure-scaling narratives.  <br>● Phi-family kickoff: Establishes the recipe that the phi-2, phi-3, and phi-4 releases would refine, popularizing synthetic-data-heavy small LLM training. | [Paper](https://arxiv.org/abs/2309.05463), [Tweet](https://x.com/omarsar0/status/1701590130270601422?s=20)                                  |
| 2) **The Rise and Potential of LLM-Based Agents** - A comprehensive survey of LLM-based agents covering construction, capability, and societal implications.  <br>● Agent architecture: Organizes the space by core agent components - perception, brain (planning, memory, reflection), and action - giving a clean compositional view.  <br>● Single-agent vs. multi-agent: Reviews both single-agent systems and multi-agent societies, covering coordination patterns and emergent behaviors.  <br>● Application landscape: Catalogs the applications where LLM agents were showing promise at the time, from software engineering to scientific research to social simulation.  <br>● Societal implications: Dedicated discussion of "harnessing agents for good" - safety, alignment, and governance considerations specific to agent deployment.                                                                                                                                                             | [Paper](https://arxiv.org/abs/2309.07864), [Tweet](https://x.com/omarsar0/status/1702736490067890239?s=20)                                  |
| 3) **EvoDiff** - Microsoft's EvoDiff combines evolutionary-scale protein data with diffusion models for controllable protein generation in sequence space.  <br>● Sequence-space diffusion: Operates directly in protein-sequence space rather than structure space, enabling generation of proteins that structure-based models can't reach.  <br>● Evolutionary-scale training: Trains on massive evolutionary protein datasets, leveraging the diverse biological sequence space as learning signal.  <br>● Controllable generation: Supports conditional generation on function, family, or motif constraints, giving researchers practical design levers.  <br>● Beyond structure-based models: Generates proteins that are inaccessible to structure-based generators (e.g., those without well-defined folds), expanding the design space. | [Paper](https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1), [Tweet](https://x.com/KevinKaichuang/status/1701953715312136302?s=20) |
| 4) **Rewindable Auto-regressive INference (RAIN)** - Shows that unaligned LLMs can produce aligned responses at inference time via self-evaluation and rewinding.  <br>● No fine-tuning needed: Produces human-preference-aligned responses from unaligned base LLMs without any additional fine-tuning.  <br>● Self-evaluation: The LLM evaluates its own in-progress generation against alignment criteria, flagging problematic paths.  <br>● Rewind mechanism: When self-evaluation detects a problematic direction, the model rewinds and regenerates - an inference-time search strategy.  <br>● Practical alignment: Offers a lightweight alignment pattern for cases where fine-tuning isn't feasible (e.g., API-only models or rapid policy iteration).                                                                                                           | [Paper](https://arxiv.org/abs/2309.07124), [Tweet](https://x.com/omarsar0/status/1702131444041011395?s=20)                                  |
| 5) **Robot Parkour Learning** - Stanford's Robot Parkour system learns end-to-end vision-based parkour policies that transfer to a quadrupedal robot.  <br>● Vision-based parkour: Learns policies from an egocentric depth camera that let a quadruped execute real parkour skills like jumping gaps and climbing obstacles.  <br>● Sim-to-real transfer: Trained in simulation and transferred to a physical low-cost robot, demonstrating successful sim-to-real in a challenging contact-rich domain.  <br>● Skill selection: The policy automatically selects and sequences appropriate parkour skills based on terrain observed in real time.  <br>● Low-cost hardware: Runs on commodity quadruped hardware, making advanced mobile behaviors accessible to smaller labs - a recurring pattern through 2023 robotics.                                         | [Paper](https://arxiv.org/abs/2309.05665), [Tweet](https://x.com/zipengfu/status/1701316023612219445?s=20)                                  |
| 6) **Hallucination Survey (Early)** - Classifies hallucination phenomena in LLMs and catalogs evaluation criteria and mitigation strategies.  <br>● Hallucination types: Distinguishes factual hallucinations, logical hallucinations, and contextual hallucinations, showing they require different mitigation approaches.  <br>● Evaluation criteria: Reviews evaluation metrics for detecting and quantifying hallucinations, covering automatic metrics, LLM-as-judge, and human evaluation.  <br>● Mitigation catalog: Organizes mitigation strategies by training stage (pretraining, SFT, RLHF) and inference stage (RAG, decoding, verification).  <br>● Reference snapshot: Captures the state of hallucination research mid-2023, providing a useful anchor for tracking how the field evolved through 2024.                                                                                                                                          | [Paper](https://arxiv.org/abs/2309.05922), [Tweet](https://x.com/omarsar0/status/1701970034711539839?s=20)                                  |
| 7) **Agents Library** - An open-source library for building autonomous language agents with first-class support for planning, memory, tools, and multi-agent communication.  <br>● Full-feature agent framework: Supports planning, long-term memory, tool usage, and multi-agent communication out of the box.  <br>● Multi-agent coordination: Provides primitives for multi-agent societies where agents can communicate, negotiate, and collaborate on tasks.  <br>● Modular design: Agent components are modular and composable, letting researchers swap planners, memory modules, or tool interfaces.  <br>● 2023 agent-framework moment: One of several agent frameworks that emerged in 2023, showing the rapid maturation of the language-agent tooling ecosystem.                                                                                                                                                    | [Paper](https://arxiv.org/abs/2309.07870), [Tweet](https://x.com/arankomatsuzaki/status/1702497897395396960?s=20)                           |
| 8) **Radiology-Llama 2** - A Llama 2-based LLM specialized for radiology report generation.  <br>● Llama 2 base: Fine-tuned on a large dataset of radiology reports, producing a domain-specialized model from an open general-purpose base.  <br>● Clinical impressions: Generates coherent and clinically useful impression statements from structured radiology findings.  <br>● Coherence gains: Outperforms general-purpose LLMs on radiology-specific report-generation tasks, as measured on both automatic metrics and clinician evaluation.  <br>● Domain-LLM template: An early datapoint for the "domain-specialized open LLM" pattern that became standard practice across medicine, law, and other regulated fields.                                                                                   | [Paper](https://arxiv.org/abs/2309.06419), [Tweet](https://x.com/omarsar0/status/1701774444052557965?s=20)                                  |
| 9) **ChatDev (Communicative Agents for Software Development)** - ChatDev is a virtual chat-powered software company where LLM agents take on roles in a waterfall-model dev process.  <br>● Waterfall mirroring: LLM agents play roles (CEO, CTO, programmer, reviewer, tester) in a simulated waterfall software-development process, coordinating through chat.  <br>● End-to-end pipeline: Completes the entire software-development lifecycle from requirements to testing, producing working software artifacts.  <br>● Under $1, under 7 minutes: Generates full software projects in under 7 minutes for less than $1 of API cost - striking cost-efficiency for agent-based development.  <br>● Multi-agent coordination: Demonstrates that simple role-based multi-agent coordination can produce coherent, non-trivial software without heavy scaffolding.         | [Paper](https://arxiv.org/abs/2307.07924v3), [Tweet](https://x.com/KevinAFischer/status/1702355125418045860?s=20)                           |
| 10) **MAmmoTH** - An open-source LLM family specialized for general mathematical problem solving.  <br>● Math-specialized models: Trained on a curated math instruction-tuning dataset covering arithmetic, algebra, calculus, and contest-style problems.  <br>● Beats existing open math LLMs: Outperforms prior open-source math LLMs across a range of mathematical reasoning benchmarks at comparable parameter counts.  <br>● CoT + PoT hybrid data: Training data mixes chain-of-thought and program-of-thought traces, teaching the model both natural-language and code-aided reasoning.  <br>● Open family: Released in multiple sizes to let researchers study math-LLM scaling laws in the open-source ecosystem.                                                                                          | [Paper](https://arxiv.org/abs/2309.05653), [Tweet](https://x.com/xiangyue96/status/1701710215442309323?s=20)                                |

---

## Top AI Papers of the Week (September 4 - September 10)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | **Links**                                                                                                               |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| 1) **Transformers as Support Vector Machines** - A theoretical paper establishing a formal connection between self-attention optimization and hard-margin SVM problems.  <br>● Hard-margin SVM connection: Shows the optimization geometry of self-attention in transformers exhibits a direct connection to hard-margin SVM problems.  <br>● Implicit regularization: Gradient descent without early stopping leads to implicit regularization, with attention converging toward SVM-like solutions.  <br>● Theoretical foundation: Provides a rare closed-form theoretical lens on self-attention dynamics, cutting through much of the "transformers as black box" framing.  <br>● Future analysis tool: The SVM connection gives researchers a principled tool to analyze attention convergence, generalization, and feature selection.                                                                                                                              | [Paper](https://arxiv.org/abs/2308.16898)                                                                               |
| 2) **RLAIF (Scaling RLHF with AI Feedback)** - Google compares RLHF with RLAIF (Reinforcement Learning from AI Feedback) to test whether AI preferences can replace human preferences.  <br>● Head-to-head comparison: Directly compares the efficacy of human vs. AI feedback for preference-based alignment, using the same policy optimization pipeline.  <br>● ~70% preference: On summarization, human evaluators prefer both RLAIF and RLHF outputs over the baseline SFT model in roughly 70% of cases - statistical parity.  <br>● Scaling studies: Reports optimal settings for AI-feedback generation, including prompt design, chain-of-thought, and label-combining strategies.  <br>● Cost-reduction implication: Suggests RLAIF can substitute for RLHF for many alignment use cases, dramatically reducing the human-labeling cost of alignment.                                          | [Paper](https://arxiv.org/abs/2309.00267), [Tweet](https://twitter.com/omarsar0/status/1699102486928265530?s=20)        |
| 3) **GPT Solves Math Problems Without a Calculator** - Demonstrates that with sufficient training data, even a small language model can perform accurate multi-digit arithmetic.  <br>● 2B model, 100% arithmetic: A 2B language model performs multi-digit arithmetic operations with 100% accuracy, without data leakage or calculator tools.  <br>● GLM-10B on Chinese math: A GLM-10B fine-tuned on multi-step arithmetic and detailed math problems is competitive with GPT-4 on a 5K-sample Chinese math problem test set.  <br>● Data-centric argument: Suggests arithmetic "weakness" in LLMs is largely a data-coverage issue rather than a fundamental architectural limit.  <br>● Tool-free reasoning: Pushes back on the common view that LLMs can never do reliable arithmetic without tool use, with implications for tool-use-vs-internal-computation design choices.                                                                            | [Paper](https://arxiv.org/abs/2309.03241), [Tweet](https://twitter.com/_akhaliq/status/1699951105927512399?s=20)        |
| 4) **OPRO (LLMs as Optimizers)** - DeepMind's OPRO uses LLMs as general-purpose optimizers over natural-language-described problems.  <br>● Natural-language optimization: The optimization problem is described in natural language; the LLM iteratively proposes new solutions conditioned on previously found solutions.  <br>● Prompt optimization: As a key application, optimizes prompts to maximize test accuracy, using previously evaluated prompts as trajectory context.  <br>● Big gains over human prompts: LLM-optimized prompts outperform human-designed prompts on GSM8K and BIG-Bench Hard, sometimes by over 50 percentage points.  <br>● General-purpose pattern: Positions LLMs as general-purpose optimizers for problems that are hard to specify mathematically, including linear regression, traveling salesman variants, and prompt design. | [Paper](https://arxiv.org/abs/2309.03409), [Tweet](https://twitter.com/omarsar0/status/1700249035456598391?s=20)        |
| 5) **ImageBind-LLM** - Shanghai AI Lab's ImageBind-LLM brings six-modality understanding to LLMs via the ImageBind joint embedding space.  <br>● ImageBind backbone: Leverages ImageBind's joint embedding space (covering image, text, audio, depth, thermal, IMU) as a universal multimodal encoder.  <br>● Learnable bind network: Aligns ImageBind's visual encoder with a frozen LLM through a learnable bind network, enabling instruction tuning across modalities.  <br>● Six-modality input: Responds to instructions over audio, 3D point clouds, video, and beyond - not just text and image.  <br>● Generation quality: Maintains high language-generation quality despite the modality diversity, validating the ImageBind-as-bridge approach.                                                                                                              | [Paper](https://arxiv.org/abs/2309.03905), [Tweet](https://twitter.com/arankomatsuzaki/status/1699947731333345750?s=20) |
| 6) **Explaining Grokking** - DeepMind advances our understanding of grokking, predicting and confirming two novel phenomena that test their theory.  <br>● Ungrokking: A model can go from perfect generalization back to memorization when trained further on a smaller dataset below a critical threshold - the first demonstration of this reverse effect.  <br>● Semi-grokking: A randomly initialized network trained on the critical dataset size shows a grokking-like transition but partial, rather than the sharp full-grokking curve.  <br>● Theoretical predictions: These behaviors were predicted from theory before being demonstrated empirically - a rare example of predictive rather than post-hoc explanation in deep learning.  <br>● Generalization theory: Advances understanding of when and why neural networks transition from memorization to generalization, bridging empirical observation with principled prediction.                               | [Paper](https://arxiv.org/abs/2309.02390), [Tweet](https://twitter.com/VikrantVarma_/status/1699823229307699305?s=20)   |
| 7) **Overview of AI Deception** - A survey cataloguing empirical examples of AI systems exhibiting deceptive behavior.  <br>● Empirical catalog: Documents empirical instances of AI deception across game-playing, language models, and economic-simulation systems.  <br>● Learned deception: Shows how deception can emerge as an instrumentally useful strategy even when models aren't directly trained to deceive.  <br>● Risk framing: Organizes deception risks from near-term harms (misinformation, manipulation) to longer-term alignment concerns.  <br>● Research agenda: Calls for dedicated research on deception detection, deception prevention during training, and evaluation frameworks for deceptive behavior.                                                                                                                                                                                                                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2308.14752), [Tweet](https://twitter.com/DanHendrycks/status/1699437800301752332?s=20)    |
| 8) **FLM-101B** - A 101B parameter open LLM trainable on a $100K budget through a growth-based training strategy.  <br>● $100K budget for 101B: Trains a 101B model on 0.31TB tokens at a total compute cost of approximately $100K - remarkable for a frontier-scale parameter count.  <br>● Progressive growth strategy: Rather than training 101B from scratch, trains three models sequentially with each larger model inheriting from its smaller predecessor.  <br>● 50%+ cost reduction: The aggressive growth strategy reduces total training cost by more than 50% compared to from-scratch training.  <br>● Open-science contribution: Releases the 101B model, providing a transparent reference for how far careful training-strategy design can stretch a limited budget.                                                | [Paper](https://arxiv.org/abs/2309.03852), [Tweet](https://twitter.com/omarsar0/status/1700156132700963053?s=20)        |
| 9) **Cognitive Architectures for Language Agents (CoALA)** - Princeton proposes CoALA, a systematic framework for understanding and building language agents.  <br>● Production-system inspiration: Draws on classical cognitive architectures and production systems (Soar, ACT-R) to structure language agents.  <br>● Four-component organization: Agents consist of memory modules, action space, decision procedures, and reasoning - each with specific design choices.  <br>● Unifies recent methods: Catalogs methods for LLM-based reasoning, grounding, learning, and decision-making as instantiations of CoALA components.  <br>● Design-space map: Makes the language-agent design space explicit, helping researchers compare systems and identify underexplored combinations.                                                                                                                     | [Paper](https://arxiv.org/abs/2309.02427), [Tweet](https://twitter.com/ShunyuYao12/status/1699396834983362690?s=20)     |
| 10) **Q-Transformer** - Google's Q-Transformer is a scalable RL method for training multi-task robotic policies from large offline datasets.  <br>● Offline RL at scale: Trains multi-task policies from large offline datasets combining human demonstrations and autonomously collected robot data.  <br>● Transformer policy: Uses a transformer backbone with Q-learning, bridging the scaling properties of transformers with the data-efficiency of Q-learning.  <br>● Strong robotics performance: Achieves strong performance on a large diverse real-world robotic manipulation task suite - not just simulation.  <br>● Scaling signal for robotics: A significant early demonstration that transformer + Q-learning scales on real-world robot data, pointing toward foundation models for robotic control.                                                                                                                                                                                                                                     | [Paper](https://q-transformer.github.io/), [Tweet](https://twitter.com/YevgenChebotar/status/1699909244743815677?s=20)  |

---

## Top AI Papers of the Week (August 28 - September 3)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                         | **Links**                                                                                                               |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| 1) **LLaSM (Large Language and Speech Model)** - A combined language-and-speech model trained with cross-modal conversational abilities.  <br>● Cross-modal conversation: Supports speech-and-language instructions seamlessly, enabling more natural interactions than text-only or speech-only systems.  <br>● Instruction-tuned: Fine-tuned on speech-language instruction data, letting users speak prompts and receive responses without a separate ASR step.  <br>● Unified architecture: Uses a single model trained end-to-end rather than a cascade of ASR, LLM, and TTS - reducing error propagation and improving latency.  <br>● Accessibility implication: Positions the unified speech-language approach as a path toward more accessible AI interfaces, particularly for users who prefer voice interaction.                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2308.15930v1), [Tweet](https://twitter.com/_akhaliq/status/1697081112164475304?s=20)      |
| 2) **SAM-Med2D** - Adapts the Segment Anything Model (SAM) to 2D medical imaging through large-scale medical fine-tuning.  <br>● Medical-domain adaptation: Fine-tunes SAM on a large, diverse collection of 2D medical images spanning multiple anatomies and modalities (CT, MRI, X-ray, ultrasound).  <br>● Comprehensive medical segmentation: Handles organ, lesion, and anatomical-structure segmentation across common imaging modalities.  <br>● Prompt engineering for clinicians: Supports the same point/box/text-prompt interaction paradigm as SAM, making it approachable for clinicians already familiar with SAM.  <br>● Strong medical baseline: Achieves strong performance on medical segmentation benchmarks, showing the SAM-adaptation pattern works well for regulated domains.                                                                                                                                                                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2308.16184v1), [Tweet](https://twitter.com/omarsar0/status/1698014448856773102?s=20)      |
| 3) **Vector Search with OpenAI Embeddings** - Argues, via empirical analysis, that dedicated vector databases aren't necessarily required for modern AI-stack search applications.  <br>● Cost-benefit framing: "From a cost-benefit analysis, there does not appear to be a compelling reason to introduce a dedicated vector store into a modern 'AI stack'" - a pointed critique of the vector-DB explosion.  <br>● Existing infrastructure suffices: Shows that widely deployed search infrastructure (Elasticsearch, Lucene) can handle OpenAI embeddings adequately for most applications.  <br>● Performance characterization: Benchmarks OpenAI embeddings on standard retrieval tasks using existing search infrastructure, providing hard numbers.  <br>● Industry pushback: Part of a broader debate about the necessity of specialized vector databases, offering empirical ammunition to the skeptics.                                                                          | [Paper](https://arxiv.org/abs/2308.14963), [Tweet](https://twitter.com/omarsar0/status/1696879909950361867?s=20)        |
| 4) **Graph of Thoughts (GoT)** - Generalizes Chain-of-Thought and Tree-of-Thought by modeling LLM reasoning as an arbitrary graph.  <br>● Arbitrary graph structure: Represents LLM-generated thoughts as nodes in a graph with arbitrary edges - allowing merging, looping, and non-tree structures.  <br>● Feedback loops: Enables explicit feedback loops where earlier thoughts can be revised based on later exploration - impossible in strictly linear or tree-structured reasoning.  <br>● Network reasoning: The authors call this "network reasoning", treating reasoning as a graph-exploration problem rather than a linear or branching one.  <br>● No model updates: Like CoT and ToT, works purely at prompting level without any model fine-tuning - extending the chain-of-X prompting family. | [Paper](https://arxiv.org/abs/2308.09687v2), [Tweet](https://twitter.com/omarsar0/status/1697245998828204200?s=20)      |
| 5) **MVDream** - ByteDance's MVDream is a multi-view diffusion model that generates geometrically consistent images from multiple viewpoints given a text prompt.  <br>● Multi-view conditioning: Generates consistent multi-view images by conditioning the diffusion model on camera viewpoint alongside the text prompt.  <br>● 2D diffusion + 3D data: Leverages pretrained 2D diffusion models and a multi-view dataset rendered from 3D assets, combining 2D generalizability with 3D consistency.  <br>● Best of both worlds: Inherits the creativity of 2D diffusion priors while maintaining the geometric coherence required for downstream 3D reconstruction.  <br>● 3D generation foundation: Became a building block for many subsequent text-to-3D pipelines that rely on multi-view-consistent diffusion as a prior.                                                                                                            | [Paper](https://arxiv.org/abs/2308.16512), [Tweet](https://twitter.com/_akhaliq/status/1697521847963619462?s=20)        |
| 6) **Nougat** - Meta's Nougat is a visual transformer for "Neural Optical Understanding for Academic documents" that converts PDFs to LaTeX/Markdown.  <br>● Academic-document focused: Specifically targets academic PDFs, where equations, tables, and reference formatting challenge general-purpose OCR systems.  <br>● End-to-end visual transformer: A single visual transformer processes PDF page images into structured Markdown/LaTeX directly - no separate OCR + layout pipeline.  <br>● Equation and table extraction: Handles mathematical equations and tables, producing LaTeX-correct output rather than flat text.  <br>● Open release: Released with weights, enabling researchers to turn academic PDF collections into machine-readable corpora for downstream training and analysis.                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2308.13418v1), [Tweet](https://twitter.com/lukas_blecher/status/1696101110853910716?s=20) |
| 7) **FacTool** - A tool-augmented framework for detecting factual errors in LLM-generated text.  <br>● Tool-augmented detection: Integrates LLMs with external tools (search engines, code executors, calculators) to fact-check generated content.  <br>● Multi-domain coverage: Handles factual errors across knowledge-based QA, code generation, mathematical reasoning, and scientific literature review.  <br>● Component-level analysis: Identifies the necessary components (claim extraction, query generation, evidence retrieval, verification) and shows which matter most.  <br>● Practical recipe: Offers a concrete recipe for integrating fact-checking into LLM pipelines, using off-the-shelf tools rather than bespoke detectors.                                                                                                                                                            | [Paper](https://arxiv.org/abs/2307.13528v2), [Tweet](https://twitter.com/omarsar0/status/1697642048587694370?s=20)      |
| 8) **AnomalyGPT** - Applies large vision-language models to industrial anomaly detection with synthetic data augmentation.  <br>● Synthetic anomaly data: Simulates anomalous images and textual descriptions to generate training data, addressing the scarcity of real anomaly examples in industrial settings.  <br>● Image decoder + prompt learner: Combines an image decoder with a prompt learner to detect and localize anomalies in product images.  <br>● Few-shot ICL: Demonstrates few-shot in-context learning capabilities, adapting to new product types from a handful of examples.  <br>● SoTA on industrial benchmarks: Achieves state-of-the-art performance on standard industrial anomaly-detection benchmarks, validating the VLM approach for manufacturing QA.                                       | [Paper](https://arxiv.org/abs/2308.15366v1), [Tweet](https://twitter.com/shinmura0/status/1697091364633317707?s=20)     |
| 9) **FaceChain** - Alibaba's FaceChain is a personalized portrait generation framework that produces identity-preserving portraits from just a handful of input photos.  <br>● Few-shot personalization: Generates personalized portraits from only a handful of input images, dramatically reducing the data requirement for identity-preserving generation.  <br>● Customization + perception pipeline: Combines customized image-generation models with face-related perceptual-understanding models for identity preservation.  <br>● Truthful portraits: Produces portraits that preserve identity rather than drifting toward a "generic attractive person" archetype - a common failure of naive fine-tuning.  <br>● Consumer-app friendly: Positioned as a deployable solution for consumer portrait-generation apps, supporting rapid personalization at scale.                                                                                                                                                   | [Paper](https://arxiv.org/abs/2308.14256v1)                                                                             |
| 10) **Qwen-VL** - Alibaba's Qwen-VL is a large-scale vision-language model family with strong performance across captioning, VQA, and visual localization.  <br>● Broad capability: Handles image captioning, visual QA, visual localization (grounding), and flexible multi-turn visual interaction.  <br>● Multilingual VL: Strong in both Chinese and English for visual tasks, filling a multilingual gap in VLMs predominantly English at the time.  <br>● Visual grounding: Supports bounding-box output for visual grounding, a capability not universally present in early VLMs.  <br>● Open release: Released as open weights, providing a strong open VLM baseline and kicking off the Qwen-VL family that has continued through 2024. | [Paper](https://arxiv.org/abs/2308.12966), [Tweet](https://twitter.com/arankomatsuzaki/status/1695964537671893306?s=20) |

---

## Top AI Papers of the Week (August 21 - August 27)

| **Paper**                                                                                                                                                                                                                                                                                                                                                          | **Links**                                                                                                                                                           |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Code Llama** - Meta releases Code Llama, a family of code-specialized LLMs built on top of Llama 2.  <br>● Three-tier release: Foundation base models, Python-specialist variants, and instruction-following Code Llama - Instruct models, all in 7B/13B/34B sizes.  <br>● Long context: Supports input contexts up to 100K tokens, enabling whole-repository or long-file code completion and analysis - unusual for open code LLMs at the time.  <br>● Fill-in-the-middle: Includes fill-in-the-middle support, a key capability for editor-integrated use cases like code completion and gap filling.  <br>● Strong HumanEval results: Code Llama - Python 34B reaches ~53% on HumanEval, establishing a strong open baseline for code models that persisted into 2024.                                                                                                                                                                                                                                | [Paper](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/), [Tweet](https://twitter.com/MetaAI/status/1694729071325007993?s=20) |
| 2) **Survey on Instruction Tuning for LLMs** - A comprehensive survey of instruction tuning covering methodology, dataset construction, and applications.  <br>● Systematic literature review: Provides a structured taxonomy of instruction-tuning research across datasets, training recipes, and evaluation approaches.  <br>● Dataset construction: Reviews how instruction datasets are assembled - from human-written prompts to model-generated self-instruct and hybrid pipelines.  <br>● Training methodologies: Catalogs SFT, multitask learning, RLHF, and their variants, with a focus on how each technique interacts with instruction-tuning data.  <br>● Open problems: Highlights issues including instruction-data quality, data scaling, multilingual instruction tuning, and evaluation of instruction-following reliability.                                                                                                                                          | [Paper](https://arxiv.org/abs/2308.10792), [Tweet](https://twitter.com/omarsar0/status/1693978006237102589?s=20)                                                    |
| 3) **SeamlessM4T** - Meta's SeamlessM4T is a unified multilingual and multimodal machine-translation system that handles five translation tasks in one model.  <br>● Five tasks, one model: Handles ASR, text-to-text, speech-to-text, text-to-speech, and speech-to-speech translation in a unified architecture.  <br>● 100+ languages: Covers ~100 languages for text and ~36 for speech, dramatically broadening the set of supported language pairs compared to prior systems.  <br>● Unified training: Avoids the cascade of per-task models typical in translation pipelines, reducing error accumulation and improving multilingual generalization.  <br>● Open release: Releases model weights and evaluation code, providing a strong open baseline for multilingual multimodal translation research.                                                                                                                                       | [Paper](https://ai.meta.com/research/publications/seamless-m4t/), [Tweet](https://twitter.com/MetaAI/status/1694020437532151820?s=20)                               |
| 4) **LLMs for Illicit Purposes** - A survey cataloguing threats and vulnerabilities arising from LLM deployment.  <br>● Threat taxonomy: Organizes LLM misuse threats into categories including misinformation, cyberattacks, social engineering, and unauthorized content generation.  <br>● Mitigation catalog: Reviews existing mitigation strategies - training-time, inference-time, and system-level defenses - with critical evaluation of each.  <br>● Deployment guide: Functions as a practical guide for building more reliable and robust LLM-powered systems.  <br>● Policy relevance: Contributes to the growing AI-safety policy discourse by organizing abstract risk concerns into a concrete framework.                                                                                                                             | [Paper](https://arxiv.org/abs/2308.12833), [Tweet](https://twitter.com/omarsar0/status/1694885393286549636?s=20)                                                    |
| 5) **Giraffe** - A family of context-extended Llama and Llama 2 models, along with an empirical study of context-extension techniques.  <br>● Extended contexts: Fine-tuned models with 4K, 16K, and 32K context windows, providing ready-to-use open long-context variants.  <br>● Technique comparison: Systematically compares context-extension methods including positional interpolation, truncation strategies, and attention scaling.  <br>● Practitioner insights: Reports practical findings on which techniques preserve downstream quality at extended contexts - useful for anyone building long-context applications.  <br>● Context-extension recipe: The lessons from Giraffe fed directly into the recipes that would culminate in YaRN and similar approaches later that year.                                                                                            | [Paper](https://arxiv.org/abs/2308.10882), [Tweet](https://twitter.com/bindureddy/status/1694126931174977906?s=20)                                                  |
| 6) **IT3D** - Improves Text-to-3D generation by leveraging explicitly synthesized multi-view images in the training loop.  <br>● Multi-view image supervision: Uses explicitly synthesized multi-view images as additional training signal for 3D generation, beyond standard per-view 2D supervision.  <br>● Diffusion-GAN dual training: Integrates a discriminator alongside the diffusion loss, producing a hybrid Diffusion-GAN training strategy for the 3D models.  <br>● Consistency gains: Improves geometric and photometric consistency across views compared to prior text-to-3D approaches.  <br>● Complements MVDream-style methods: Works well alongside multi-view diffusion priors, pointing toward increasingly sophisticated 2D-to-3D pipelines.                                                                                                                        | [Paper](https://arxiv.org/abs/2308.11473v1)                                                                                                                         |
| 7) **LLM-Based Autonomous Agents Survey** - A comprehensive survey of LLM-based autonomous agents covering construction and applications.  <br>● Agent construction framework: Organizes autonomous agents by profile, memory, planning, and action components - the canonical modular view.  <br>● Application coverage: Reviews applications across social science, natural science, and engineering, showing the breadth of agent use cases in mid-2023.  <br>● Systematic literature review: Covers the explosion of agent papers following ReAct, AutoGPT, and similar early frameworks.  <br>● Evaluation landscape: Discusses evaluation approaches for autonomous agents, a notoriously difficult area compared to static LLM evaluation.                                                                                            | [Paper](https://arxiv.org/abs/2308.11432v1), [Tweet](https://twitter.com/omarsar0/status/1695440652048257251?s=20)                                                  |
| 8) **Prompt2Model** - CMU's Prompt2Model automates the path from a natural-language task description to a deployable small special-purpose model.  <br>● Prompt-as-specification: Users describe the target task in natural language; the framework produces a small model that can execute it.  <br>● Three-channel pipeline: Automatically assembles training data via dataset retrieval (find relevant existing data), dataset generation (synthesize new data), and model retrieval (find relevant pretrained models).  <br>● Small deployable output: Produces small, efficient models suitable for deployment - not just API wrappers around frontier LLMs.  <br>● Accessibility gain: Lowers the barrier for non-ML practitioners to build task-specific models, abstracting away much of the data-engineering burden. | [Paper](https://arxiv.org/abs/2308.12261), [Tweet](https://twitter.com/omarsar0/status/1694718168185598055?s=20)                                                    |
| 9) **LegalBench** - A collaboratively constructed benchmark for measuring legal reasoning in LLMs.  <br>● 162 tasks: Covers 162 legal-reasoning tasks designed by legal experts, significantly broader than prior legal benchmarks.  <br>● Six reasoning categories: Categorizes tasks across rule-recall, rule-application, rule-conclusion, interpretation, rhetorical-analysis, and issue-spotting.  <br>● Collaborative construction: Built through collaboration with legal practitioners to ensure tasks reflect real legal reasoning rather than generic NLP tasks dressed in legal vocabulary.  <br>● LLM-lawyer evaluation: Provides the first rigorous benchmark for systematically evaluating LLM legal capability - essential for responsible deployment in legal workflows.                                                                                                                                                                                         | [Paper](https://arxiv.org/abs/2308.11462), [Tweet](https://twitter.com/NeelGuha/status/1694375959334670643?s=20)                                                    |
| 10) **Language to Rewards for Robotic Skill Synthesis** - Google's Language-to-Rewards uses LLMs to define reward parameters for robotic RL.  <br>● LLM-defined rewards: Uses LLMs to translate natural-language task descriptions into optimizable reward parameters for downstream RL training.  <br>● Real-robot evaluation: Evaluated on a real robot arm, not just in simulation, validating that the approach survives sim-to-real challenges.  <br>● Emergent skills: Complex manipulation skills including non-prehensile pushing emerge from the LLM-specified rewards alone.  <br>● Natural robot programming: Positions natural language as a practical interface for programming robot behaviors without handcrafting reward functions.                                          | [Paper](https://arxiv.org/abs/2306.08647), [Tweet](https://twitter.com/GoogleAI/status/1694086273689076170?s=20)                                                    |

---

## Top AI Papers of the Week (August 14 - August 20)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | **Links**                                                                                                                     |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| 1) **Humpback (Self-Alignment with Instruction Backtranslation)** - Meta's Humpback automatically generates instruction-tuning data by back-translating web text into plausible instructions.  <br>● Instruction backtranslation: Given a web document, generates a plausible instruction that the document could answer - inverting the typical instruction-data creation direction.  <br>● Four-step pipeline: (1) Fine-tune LLM with small seed data, (2) generate instructions for web docs, (3) self-curate high-quality examples, (4) fine-tune on curated data.  <br>● Tops Alpaca leaderboard: The self-aligned model outperforms all other Llama-based models on the Alpaca leaderboard at the time of release.  <br>● Data abundance: Turns the entire web into potential instruction-tuning data, dramatically expanding the accessible instruction corpus beyond curated human-written datasets. | [Paper](https://arxiv.org/abs/2308.06259), [Tweet](https://twitter.com/jaseweston/status/1690888779878330368?s=20)            |
| 2) **Platypus** - Platypus is a family of fine-tuned and merged LLMs that topped the Open LLM Leaderboard in August 2023.  <br>● LoRA fine-tuning + merging: Describes an efficient process for fine-tuning and merging LoRA modules, demonstrating that careful composition beats monolithic fine-tuning.  <br>● Open-Platypus dataset: Releases a small, highly curated fine-tuning dataset that delivers strong performance with short and cheap training - quality over quantity.  <br>● 5 hours on one A100: A 13B Platypus can be trained on a single A100 GPU using 25K curated questions in roughly 5 hours.  <br>● Leaderboard-topping: Demonstrates that careful data curation and LoRA merging can produce leaderboard-topping open models without massive compute.                         | [Paper](https://arxiv.org/abs/2308.07317v1), [Tweet](https://twitter.com/omarsar0/status/1692549762480791959?s=20)            |
| 3) **Model Compression for LLMs Survey** - A survey of recent model-compression techniques applied specifically to LLMs.  <br>● Core technique families: Covers quantization, pruning, knowledge distillation, and architectural compression across training-time and post-training approaches.  <br>● LLM-specific concerns: Addresses unique LLM concerns including long-sequence compression, KV-cache optimization, and retaining reasoning capability under compression.  <br>● Evaluation metrics: Reviews benchmark strategies and evaluation metrics for measuring compressed-LLM effectiveness - not just perplexity but downstream capability preservation.  <br>● Practitioner reference: Functions as a compact reference for teams deciding which compression technique matches their deployment constraints.                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2308.07633), [Tweet](https://twitter.com/omarsar0/status/1691803395160477905?s=20)              |
| 4) **GEARS** - Stanford's GEARS predicts cellular responses to genetic perturbation using deep learning + a gene-relationship knowledge graph.  <br>● KG-guided prediction: Combines deep-learning models with an explicit gene-relationship knowledge graph, letting the model leverage structured biological priors.  <br>● Combinatorial perturbations: Predicts cellular responses to combinations of perturbations, a harder regime than single-perturbation prediction.  <br>● 40% precision gain: Achieves 40% higher precision than prior approaches when predicting four distinct genetic-interaction subtypes in a combinatorial perturbation screen.  <br>● Drug discovery relevance: Accelerates hypothesis generation in perturbation biology, with direct implications for target discovery and drug development.                                                                                                                                                                                                                                              | [Paper](http://nature.com/articles/s41587-023-01905-6.pdf), [Tweet](https://twitter.com/jure/status/1692229511096754594?s=20) |
| 5) **Shepherd** - Meta's Shepherd is a 7B language model specifically tuned to critique model outputs and suggest refinements.  <br>● Critique-specialized 7B: A 7B parameter model fine-tuned specifically on the task of critiquing LLM responses and suggesting improvements.  <br>● Error identification: Capable of identifying diverse error types - factual, logical, stylistic, safety - and suggesting remedies for each.  <br>● ChatGPT-comparable critiques: Human evaluators judge Shepherd's critiques as similar or preferred to ChatGPT's, despite Shepherd being much smaller.  <br>● Critic-as-a-service: Points toward a deployment pattern where small specialized critic models are paired with larger generation models, a recurring theme in 2024 alignment work.                                                                                                                                                                                                                                                                                       | [Paper](https://arxiv.org/abs/2308.04592), [Tweet](https://twitter.com/MetaAI/status/1691517949130207232?s=20)                |
| 6) **GPT-4 Code Interpreter for Math** - A zero-shot prompting technique for GPT-4 Code Interpreter that dramatically boosts math-reasoning accuracy via code self-verification.  <br>● Code-as-verifier prompting: Explicitly encourages GPT-4 Code Interpreter to use code for self-verification of intermediate and final answers.  <br>● 69.7% on MATH: Achieves 69.7% zero-shot accuracy on the MATH dataset - a 27.5-point improvement over vanilla GPT-4 (42.2%).  <br>● Execution-grounded reasoning: Code execution provides a high-fidelity verification signal that vanilla CoT lacks, reducing hallucinated intermediate steps.  <br>● Tool-use template: Establishes a template for tool-augmented reasoning that would generalize to many later math-LLM recipes.                                                                                         | [Paper](https://arxiv.org/abs/2308.07921), [Tweet](https://twitter.com/omarsar0/status/1691630591744127355?s=20)              |
| 7) **Teach LLMs to Personalize** - A multitask-learning approach for personalized text generation without relying on predefined user attributes.  <br>● Attribute-free personalization: Generates personalized text without predefined attributes like age, profession, or preferences - instead inferring style from user history.  <br>● Multitask learning: Frames personalization as a multitask problem where tasks correspond to different personalization axes, sharing representation across them.  <br>● Generalizable style: Demonstrates that models can adapt to new users with minimal examples when trained with this multitask approach.  <br>● Production relevance: Directly applicable to personalized-assistant and content-generation products where explicit user-profile attributes are impractical or privacy-sensitive.                                                                                                                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2308.07968), [Tweet](https://twitter.com/omarsar0/status/1692186726192521364?s=20)              |
| 8) **OctoPack** - Hugging Face releases OctoPack, a 4TB dataset of Git commits across 350 programming languages for instruction-tuning code LLMs.  <br>● 4TB commit dataset: Curated dataset of 4 terabytes of Git commits across 350 programming languages, using commit messages as implicit instructions.  <br>● Natural code instructions: Commit messages provide real-world, naturally occurring instructions for code changes - far more authentic than synthetically generated code instructions.  <br>● SoTA without OpenAI outputs: Achieves state-of-the-art performance on HumanEval Python among models not trained on OpenAI outputs.  <br>● HumanEval extension: Extends HumanEval beyond Python generation to include code explanation and code repair tasks, providing richer evaluation coverage.                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2308.07124v1), [Tweet](https://twitter.com/arankomatsuzaki/status/1691259656453193728?s=20)     |
| 9) **Outlines (Efficient Guided Generation)** - A library for guided LLM text generation that enforces structural constraints with minimal overhead.  <br>● Regex guarantees: Guarantees that generated output matches a specified regular expression, supporting grammar-constrained generation at the token level.  <br>● JSON schema enforcement: Produces output that follows a JSON schema, unlocking reliable structured-output generation without post-hoc parsing retries.  <br>● Fast implementation: Achieves low overhead via efficient state-machine construction and token-mask caching, making constrained decoding practical in production.  <br>● Broad adoption: Became widely used in LLM pipelines where structured output is non-negotiable - function calling, tool use, API output, and data extraction.                                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2307.09702), [Tweet](https://twitter.com/omarsar0/status/1691179888214966273?s=20)              |
| 10) **Bayesian Flow Networks (BFN)** - Introduces a new class of generative models that combine Bayesian inference with deep learning.  <br>● Parameters, not noisy data: BFNs operate on parameters of a data distribution rather than on a noisy version of the data itself - a fundamental architectural departure from diffusion models.  <br>● Unified data types: Adapts to continuous, discretized, and discrete data with minimal changes to the training procedure - unlike diffusion variants that need per-modality engineering.  <br>● Competitive with diffusion: Achieves competitive or better likelihood on image, text, and discrete-data benchmarks compared to diffusion baselines.  <br>● Research direction: Opens a new family of generative models with distinct theoretical properties, attracting follow-up work through 2024.                                                                                                                                                       | [Paper](https://arxiv.org/abs/2308.07037), [Tweet](https://twitter.com/nnaisense/status/1691310494039379969?s=20)             |

---

## Top AI Papers of the Week (August 7 - August 13)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                              | **Links**                                                                                                                      |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| 1) **D-Bot (LLMs as Database Administrators)** - Introduces D-Bot, an LLM-based framework that continuously acquires database-administration knowledge from textual sources.  <br>● Knowledge detection: Automatically detects database-maintenance knowledge from documentation and tool outputs, continuously updating its operational knowledge base.  <br>● Tree-of-thought diagnosis: Uses tree-of-thought reasoning for root-cause analysis of database performance and reliability issues.  <br>● Multi-LLM collaboration: Collaborative diagnosis among multiple LLMs yields better root-cause identification than single-model analysis.  <br>● DBA augmentation: Positions LLMs as augmenting DBAs rather than replacing them, with concrete value on knowledge retrieval and diagnostic reasoning.                                                      | [Paper](https://arxiv.org/abs/2308.05481), [Tweet](https://twitter.com/omarsar0/status/1689811820272353280?s=20)               |
| 2) **Political Biases in NLP Models** - Develops methods to measure political and media biases in LLMs and their downstream effects.  <br>● Bias measurement methodology: Introduces measurement techniques for political and media biases in LLMs that can be applied across models and over time.  <br>● Downstream bias propagation: Studies how biases in pretrained LLMs propagate to downstream NLP models fine-tuned on top of them.  <br>● Political leanings detected: Finds that LLMs exhibit measurable political leanings that reflect and reinforce polarization patterns in their training corpora.  <br>● Fairness implications: Provides empirical ammunition for discussions of LLM fairness, deployment in politically sensitive contexts, and bias-mitigation research.                                                                                                                                       | [Paper](https://aclanthology.org/2023.acl-long.656/), [Tweet](https://twitter.com/AiBreakfast/status/1688939983468453888?s=20) |
| 3) **AgentBench** - Tsinghua's AgentBench is a multidimensional benchmark for LLM-as-Agent reasoning and decision-making across 8 environments.  <br>● Multi-environment design: Tests agents across 8 diverse environments including web browsing, operating systems, databases, and games - capturing breadth of agent demands.  <br>● Open vs. commercial gap: Reveals a significant performance gap between top commercial LLMs (GPT-4) and open-source models on agent tasks.  <br>● Open-source lags: Open-source LLMs lag substantially on AgentBench, exposing a gap that subsequent open-agent fine-tuning efforts targeted.  <br>● GPT-4 shows potential: GPT-4's performance demonstrates that frontier models can support continuously learning agents, even if they're not there yet.  | [Paper](https://arxiv.org/abs/2308.03688v1), [Tweet](https://twitter.com/arankomatsuzaki/status/1688719837760000000?s=20)      |
| 4) **Studying LLM Generalization with Influence Functions** - Anthropic scales influence functions to LLMs up to 52B parameters to investigate generalization patterns.  <br>● Efficient scaling: Introduces computational tricks that make influence-function analysis tractable on LLMs with up to 52 billion parameters - a massive scale-up from prior work.  <br>● Cross-lingual generalization: Finds evidence of cross-lingual generalization, where training examples in one language influence predictions in another.  <br>● Middle-layer abstraction: Middle layers of the network appear responsible for the most abstract generalization patterns, supporting emerging interpretability narratives.  <br>● Alignment implications: Influence-function analysis gives alignment researchers a new tool for understanding which training data drives which model behaviors. | [Paper](https://arxiv.org/abs/2308.03296), [Tweet](https://twitter.com/AnthropicAI/status/1688946685937090560?s=20)            |
| 5) **NeuroImagen** - Reconstructs visual stimuli images from EEG signals using latent diffusion, opening new windows into visually-evoked brain activity.  <br>● EEG-to-image reconstruction: Reconstructs high-resolution visual stimuli images from EEG signals recorded while subjects viewed those images.  <br>● Latent diffusion pipeline: Uses a latent diffusion model conditioned on EEG features, inheriting the high-fidelity generation capabilities of diffusion priors.  <br>● Non-invasive BCI: EEG is non-invasive and comparatively cheap, making this approach more practical for real-world brain-computer interface research than fMRI-based alternatives.  <br>● Cognitive-science bridge: Provides a new tool for studying visual cognition, complementing and extending earlier fMRI-decoding work.                                                                                                                                                 | [Paper](https://arxiv.org/abs/2308.02510), [Tweet](https://twitter.com/_akhaliq/status/1688787286807228416?s=20)               |
| 6) **SynJax** - DeepMind's SynJax is a JAX-based library for efficient vectorized inference in structured distributions.  <br>● Vectorized structured inference: Provides efficient vectorized implementations of inference algorithms for structured distributions - tagging, segmentation, trees - on modern hardware.  <br>● Supported structures: Covers constituency trees, dependency trees, spanning trees, tagging, and segmentation - the workhorses of structured prediction.  <br>● Differentiable models: Enables building large-scale differentiable models that explicitly represent structure in data, bridging classical NLP and deep learning.  <br>● Hardware-friendly: JAX backend lets researchers run structured-inference models at scale on accelerators, unblocking research that had been stuck on CPU speeds.                                                                                                                          | [Paper](https://arxiv.org/abs/2308.03291v1), [Tweet](https://twitter.com/milosstanojevic/status/1688896558790520832?s=20)      |
| 7) **Synthetic Data Reduces Sycophancy** - Google shows that fine-tuning on simple synthetic data can significantly reduce LLM sycophancy.  <br>● Sycophancy problem: Sycophancy occurs when LLMs align their responses with perceived user views even when those views are factually incorrect.  <br>● Synthetic anti-sycophancy data: Constructs simple synthetic examples where the correct answer contradicts the user's stated view, then fine-tunes models on them.  <br>● Meaningful reduction: Fine-tuning on this synthetic data measurably reduces sycophantic behavior without degrading overall helpfulness.  <br>● Broader lesson: Offers a cheap, targeted intervention for a specific alignment failure mode - a template for addressing other narrow failure modes through targeted synthetic data.                                                                                                                            | [Paper](https://arxiv.org/abs/2308.03958), [Tweet](https://twitter.com/JerryWeiAI/status/1689340237993185280?s=20)             |
| 8) **PUG (Photorealistic Unreal Graphics)** - Meta's PUG uses Unreal Engine to generate photorealistic, semantically controllable synthetic datasets for vision research.  <br>● Unreal-powered synthesis: Leverages Unreal Engine's photorealistic rendering to produce high-fidelity synthetic training images with precise semantic control.  <br>● Controllable semantics: Researchers can specify scene content, lighting, camera angles, and object configurations, making targeted ablations possible.  <br>● Democratizing synthetic data: Lowers the barrier to photorealistic synthetic data generation, previously limited to groups with custom rendering pipelines.  <br>● Rigorous evaluation: Enables more rigorous evaluations of vision-model robustness to controlled distribution shifts - lighting, occlusion, pose - than natural data allows.                                                                                                                                                | [Paper](https://arxiv.org/abs/2308.03977), [Tweet](https://twitter.com/MetaAI/status/1689316127846109184?s=20)                 |
| 9) **LLMs for HVAC Control** - Microsoft applies LLMs to industrial control tasks (HVAC for buildings), comparing against RL baselines.  <br>● Demonstration selection: Develops a recipe for selecting demonstrations and generating high-performing prompts for industrial control tasks.  <br>● GPT-4 ≈ RL: GPT-4 performs comparably to specialized RL methods on HVAC control, despite being a general-purpose model.  <br>● Lower technical debt: Uses dramatically fewer samples and avoids the operational complexity of training and maintaining a dedicated RL policy.  <br>● Practical implication: Suggests LLMs can substitute for RL in many control tasks where sample efficiency and maintenance matter more than peak performance.                                                                                                      | [Paper](https://arxiv.org/abs/2308.03028), [Tweet](https://twitter.com/emollick/status/1688760539441217536?s=20)               |
| 10) **Trustworthy LLMs** - Presents a comprehensive framework of categories for assessing LLM trustworthiness.  <br>● Seven-dimensional framework: Covers reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness.  <br>● Aligned models advantage: Aligned models perform better on trustworthiness dimensions, but alignment effectiveness varies dramatically across dimensions.  <br>● Sub-category detail: Each top-level dimension is broken into measurable sub-categories, making the framework operational for evaluation rather than just conceptual.  <br>● Evaluation tooling: Positioned as a foundation for systematic trustworthiness evaluation - a precursor to later trust-specific benchmarks like TrustLLM.                 | [Paper](https://arxiv.org/abs/2308.05374), [Tweet](https://twitter.com/_akhaliq/status/1689818964669390848?s=20)               |

---

## Top AI Papers of the Week (July 31 - August 6)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                    | **Links**                                                                                                               |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------- |
| 1) **Open Problems and Limitations of RLHF** - A comprehensive survey of open problems and fundamental limitations of RLHF as an alignment approach.  <br>● Scope: Catalogs issues across the entire RLHF pipeline - preference data collection, reward modeling, policy optimization, and evaluation.  <br>● Fundamental limitations: Discusses issues that can't be solved by incremental engineering alone, including the difficulty of specifying human preferences completely.  <br>● Reward hacking taxonomy: Organizes the many varieties of reward hacking seen in practice, from sycophancy to specification gaming.  <br>● Research agenda: Argues for investment in alignment approaches beyond RLHF that can address its structural limitations - a precursor to DPO and related methods.                                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2307.15217), [Tweet](https://twitter.com/arankomatsuzaki/status/1685813753063870465?s=20) |
| 2) **Med-Flamingo** - Stanford's Med-Flamingo is a multimodal medical model supporting in-context learning for few-shot medical visual QA.  <br>● Medical ICL: Supports in-context learning for medical visual QA, letting clinicians specialize the model via examples at inference time rather than fine-tuning.  <br>● Physician evaluation: Physician evaluators rate Med-Flamingo's responses up to 20% higher than baseline multimodal models - a significant clinical quality improvement.  <br>● Hallucination concerns: Authors transparently report occasional low-quality generations and hallucinations, a necessary caveat for medical deployment.  <br>● Clinical-deployment template: Sets a template for responsible medical VLM development - physician-in-the-loop evaluation alongside automatic metrics.                                                                                      | [Paper](https://arxiv.org/abs/2307.15189), [Tweet](https://twitter.com/Michael_D_Moor/status/1685804620730540033?s=20)  |
| 3) **ToolLLM** - Tsinghua's ToolLLM enables LLMs to interact with 16,000+ real-world APIs through a comprehensive framework for tool-using LLMs.  <br>● 16K APIs: Covers 16,000+ real-world APIs - orders of magnitude more than prior tool-use benchmarks, capturing the real diversity of modern API ecosystems.  <br>● Full-stack framework: Includes data preparation, training methodology, and evaluation infrastructure - a complete open stack for tool-use research.  <br>● ToolLLaMA hits ChatGPT-16k: The authors' ToolLLaMA model matches ChatGPT (turbo-16k) on tool-use benchmarks, showing open models can close the gap.  <br>● Tool-use research foundation: Became a standard reference point for tool-use research, influencing how tool datasets and benchmarks were structured through 2024.                                                                                                                                        | [Paper](https://arxiv.org/abs/2307.16789v1), [Tweet](https://twitter.com/omarsar0/status/1687531613574348800?s=20)      |
| 4) **Skeleton-of-Thought (SoT)** - Microsoft's Skeleton-of-Thought parallelizes LLM generation by first producing an answer skeleton then filling it in concurrently.  <br>● Two-stage generation: First generates an answer skeleton outlining the response structure, then fills in each skeleton point through parallel API calls.  <br>● 2.39x speedup: Achieves up to 2.39x speedup over sequential decoding by exploiting the independence of skeleton points.  <br>● Quality improvements: Besides the speedup, reports quality improvements on some tasks - structure-first generation can produce more coherent long responses.  <br>● Applicability: Works best for list-style or outline-style responses where the skeleton decomposition is natural, less so for tightly coupled prose.                                                                                                                                                 | [Paper](https://arxiv.org/abs/2307.15337), [Tweet](https://twitter.com/omarsar0/status/1685832487103008768?s=20)        |
| 5) **MetaGPT** - MetaGPT is a multi-agent framework that encodes standardized operating procedures (SOPs) for complex problem solving.  <br>● SOP-encoded workflows: Encodes human standardized operating procedures into agent workflows, imposing structure rather than letting agents improvise.  <br>● Multi-agent roles: Agents take on well-defined roles (PM, engineer, architect, QA, etc.) mirroring real software-development team structures.  <br>● Multifaceted capability: Handles software development, code generation, and data analysis - a broader scope than ChatDev's software-focus.  <br>● Tool integration: Integrates with tools like AutoGPT and LangChain, slotting into the broader agent-framework ecosystem rather than replacing it.                                      | [Paper](https://arxiv.org/abs/2308.00352v2), [Tweet](https://twitter.com/ai_database/status/1686949868298973184?s=20)   |
| 6) **OpenFlamingo** - An open-source family of autoregressive vision-language models spanning 3B to 9B parameters.  <br>● Open reproduction: A faithful open-source reproduction of DeepMind's closed Flamingo, enabling research groups to build on the architecture.  <br>● Size range: Covers 3B to 9B parameters, offering multiple sizes for researchers with varying compute budgets.  <br>● Training data + eval suite: Releases the training data and evaluation suite alongside models, providing a complete reproducible stack.  <br>● Open VLM foundation: Became a widely used starting point for open VLM research through 2023-2024.                                                                                                                                                                                                         | [Paper](https://arxiv.org/abs/2308.01390), [Tweet](https://twitter.com/anas_awadalla/status/1687295129005195264?s=20)   |
| 7) **The Hydra Effect** - DeepMind shows that language models exhibit self-repairing behavior when attention heads are ablated.  <br>● Self-repair phenomenon: Ablating a layer of attention heads causes a later layer to take over the ablated layer's function - a previously unknown redundancy property.  <br>● Interpretability implications: Complicates interpretability work based on ablation - removing a component doesn't necessarily isolate its contribution if other components compensate.  <br>● Circuit-level redundancy: Suggests transformer circuits have built-in redundancy that is activated under ablation, analogous to biological neural networks.  <br>● Research-method correction: Forces a rethinking of causal-mediation experiments in mechanistic interpretability, since ablations alone understate components' true contributions.                                                                                                                                                                                                               | [Paper](https://arxiv.org/abs/2307.15771), [Tweet](https://twitter.com/_akhaliq/status/1686192437771788288?s=20)        |
| 8) **Self-Check** - Explores LLM capacity for self-checking on complex reasoning tasks requiring multi-step and non-linear thinking.  <br>● Zero-shot verification: Proposes a zero-shot verification scheme that recognizes errors in its own reasoning without external tools or references.  <br>● Weighted voting improvement: Applying self-check scores as weights in majority voting improves QA performance over standard CoT self-consistency.  <br>● Math word problems: Demonstrates improved accuracy on math word problems - tasks that benefit most from catching intermediate-step errors.  <br>● Self-critique groundwork: An early contribution to the self-critique literature that would mature through 2024 into Constitutional AI-style and debate-style methods. | [Paper](https://arxiv.org/abs/2308.00436), [Tweet](https://twitter.com/_akhaliq/status/1686561569486827520?s=20)        |
| 9) **Dynalang (Agents Model the World with Language)** - UC Berkeley's Dynalang agent learns a multimodal world model predicting future text, video, and rewards.  <br>● Multimodal world model: Jointly predicts future language, video, and rewards, treating language as another stream of observation/prediction rather than just policy input.  <br>● Instruction-following: Learns to follow instructions in visually and linguistically complex domains, grounded in the world model's predictions.  <br>● Cross-domain applicability: Applied to multiple embodied environments, showing the language-inclusive world-model approach is general.  <br>● Research direction: Foreshadows the "video-plus-language world model" direction that would grow prominent in 2024 (e.g., Sora's world simulator framing).                                                                   | [Paper](https://arxiv.org/abs/2308.01399), [Tweet](https://twitter.com/johnjnay/status/1687277999517818880?s=20)        |
| 10) **AutoRobotics-Zero** - Discovers zero-shot adaptable robot policies from scratch, including the automatic discovery of Python control code.  <br>● Zero-shot adaptability: Policies adapt to sudden environmental changes without any fine-tuning at test time - a critical property for robust robotics.  <br>● Python-code policies: Automatically discovers Python code that implements robot controllers - an interpretable, auditable policy representation.  <br>● Discovery from scratch: Policies are discovered from scratch rather than fine-tuned from pretrained ones, reducing assumptions about prior knowledge.  <br>● AutoML for robotics: Extends the AutoML paradigm into robotics, using search over code rather than over neural architectures. | [Paper](https://arxiv.org/abs/2307.16890), [Tweet](https://twitter.com/XingyouSong/status/1686190266578046976?s=20)     |

---

## Top AI Papers of the Week (July 24 - July 30)

| **Paper**                                                                                                                                                                                                                                                                                         | **Links**                                                                                                                                    |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Universal Adversarial LLM Attacks** - Finds universal and transferable adversarial attacks that cause aligned models like ChatGPT and Bard to generate objectionable behaviors.  <br>● Automatic suffix generation: Uses a combination of greedy and gradient-based search to automatically produce adversarial suffixes that bypass alignment safeguards.  <br>● Universal transferability: A single adversarial suffix found on open models transfers to proprietary models like GPT-4, Claude, and Bard, revealing a systemic weakness.  <br>● Jailbreaking industrialized: Demonstrated that automated attacks could produce unlimited variants, forcing a rethink of alignment robustness beyond manual red-teaming.  <br>● Foundational safety paper: Became one of the most-cited adversarial robustness papers of 2023 and a reference point for later work on refusal training and representation-level defenses. | [Paper](https://arxiv.org/abs/2307.15043), [Tweet](https://twitter.com/andyzou_jiaming/status/1684766170766004224?s=20)                      |
| 2) **RT-2** - Google DeepMind's end-to-end vision-language-action model that learns from both web and robotics data to control robots.  <br>● VLA architecture: Treats robot actions as another language the model generates - actions are tokenized and output in the same stream as text tokens.  <br>● Web-scale knowledge transfer: Leverages internet-scale VLM pretraining so the robot can reason about novel objects and symbols it never saw in robotics data (e.g., "pick up the extinct animal").  <br>● Emergent semantic reasoning: Shows emergent capabilities like chain-of-thought robotic reasoning and multi-stage task planning absent in prior RT-1.  <br>● Robot foundation models: Established the VLA paradigm that dominated 2024 robotics research (OpenVLA, RT-X, π0) and moved robotics firmly into the foundation-model era. | [Paper](https://robotics-transformer2.github.io/assets/rt2.pdf), [Tweet](https://twitter.com/GoogleDeepMind/status/1684903412834447360?s=20) |
| 3) **Med-PaLM Multimodal** - Introduces a generalist biomedical AI system and a new multimodal biomedical benchmark with 14 tasks.  <br>● MultiMedBench: A new benchmark spanning 14 tasks across clinical text, medical imaging (e.g., chest X-ray, pathology, dermatology), and genomics.  <br>● Single generalist model: A single 562B model handles medical Q&A, VQA, report generation, and genomic variant call - rather than disease-specific narrow models.  <br>● Clinician evaluations: In pilot evaluations by radiologists, Med-PaLM M's chest X-ray reports were preferred over reference reports in 40.50% of cases.  <br>● Generalist medical AI vision: Provided the strongest proof-of-concept for generalist biomedical AI, previewing the trajectory toward healthcare foundation models.                                              | [Paper](https://arxiv.org/abs/2307.14334), [Tweet](https://twitter.com/vivnat/status/1684404882844024832?s=20)                               |
| 4) **Tracking Anything in High Quality** - A framework for high-quality tracking-anything in videos combining segmentation and refinement.  <br>● Two-stage design: Combines a video multi-object segmenter with a pretrained mask refiner model to clean up tracking output.  <br>● Mask quality focus: Addresses the common failure mode where trackers lose object boundaries over time, maintaining sharp masks across long clips.  <br>● VOTS2023 results: Ranked 2nd place in the VOTS2023 challenge, demonstrating competitive quality against specialized trackers.  <br>● Practical tool: Useful for video editing, AR/VR, and content creation pipelines that require pixel-accurate object tracking over long sequences.                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2307.13974v1), [Tweet](https://twitter.com/arankomatsuzaki/status/1684380610901467136?s=20)                    |
| 5) **Foundation Models in Vision** - A comprehensive survey on foundational models for computer vision and their open research directions.  <br>● Landscape mapping: Reviews textually prompted (CLIP, ALIGN), visually prompted (SAM), and generative (DALL-E, Imagen) vision foundation models in one unified taxonomy.  <br>● Challenges enumerated: Identifies open problems in evaluation, grounding, hallucination, compositionality, and domain-specific adaptation for CV.  <br>● Cross-modal trends: Analyzes how vision foundation models increasingly borrow from LLM training recipes (instruction tuning, RLHF).  <br>● Reference for researchers: Became a go-to survey for new researchers entering vision foundation-model research in late 2023.                                                                                                                                                                                               | [Paper](https://arxiv.org/abs/2307.13721v1), [Tweet](https://twitter.com/KhanSalmanH/status/1684496991215316992?s=20)                        |
| 6) **L-Eval** - A standardized evaluation suite for long-context language models.  <br>● Dataset scale: 411 long documents covering over 2K query-response pairs across law, finance, school lectures, long conversations, novels, and meetings.  <br>● Realistic domains: Moves beyond synthetic needle-in-haystack tests toward practical long-form applications users actually encounter.  <br>● Evaluation methodology: Provides multiple evaluation protocols including exact match, n-gram, and LLM-as-judge to cross-validate results.  <br>● Long-context benchmark: Became a reference benchmark during 2023's context-window race, paving the way for later benchmarks like LongBench and RULER.                                                                                                                                                                                                                                                         | [Paper](https://arxiv.org/abs/2307.11088v1), [Tweet](https://twitter.com/WenxiangJiao/status/1682208555762610176?s=20)                       |
| 7) **LoraHub** - Enables efficient cross-task generalization via dynamic LoRA composition.  <br>● Dynamic composition: Combines pre-trained LoRA modules via learned weights without human expertise or additional parameters/gradient updates.  <br>● Gradient-free optimization: Uses gradient-free algorithms like Nelder-Mead to find optimal LoRA weightings on a handful of examples.  <br>● ICL-matching performance: Matches the performance of in-context learning in few-shot settings while using much less inference compute.  <br>● Modular LLMs vision: Part of the broader push toward modular, composable adapter ecosystems - a direction still actively developed in 2024's MoE-of-LoRAs work.                                                                                                                                                                                                                                                 | [Paper](https://arxiv.org/abs/2307.13269v1), [Tweet](https://twitter.com/_akhaliq/status/1684030297661403136?s=20)                           |
| 8) **Survey of Aligned LLMs** - A comprehensive overview of alignment approaches covering data, training, and evaluation.  <br>● Full-stack view: Covers preference data collection, RLHF variants, DPO-style direct methods, and alignment evaluation in one unified reference.  <br>● Taxonomy of methods: Organizes alignment techniques into clear families (outer alignment vs. inner alignment, value alignment vs. behavior alignment).  <br>● Practical pitfalls: Documents known failure modes like reward hacking, sycophancy, and mode collapse that practitioners should watch for.  <br>● Reference document: Frequently cited in alignment onboarding material as the first-pass overview for new researchers.                                                                                                                                                                                                                                       | [Paper](https://arxiv.org/abs/2307.12966v1), [Tweet](https://twitter.com/omarsar0/status/1684960627423420419?s=20)                           |
| 9) **WavJourney** - Leverages LLMs to orchestrate audio generation models for compositional storytelling.  <br>● LLM as composer: Uses an LLM to plan scene-level audio scripts, then dispatches sub-prompts to specialized TTS, music, and sound-effect models.  <br>● Explainable structure: Produces intermediate audio scripts that users can inspect and edit, giving creative control rather than opaque end-to-end generation.  <br>● Storytelling workflow: Demonstrates long-form coherent audio stories with speech, music, and ambient sound combined into unified scenes.  <br>● Agentic audio precursor: An early example of LLM-as-orchestrator for multimedia generation - a pattern that matured in 2024 multi-modal agent frameworks.                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2307.14335v1), [Tweet](https://twitter.com/LiuXub/status/1684338437934002176?s=20)                             |
| 10) **FacTool** - A task- and domain-agnostic framework for factuality detection of LLM-generated text.  <br>● General framework: Unifies factuality detection across knowledge QA, code generation, math reasoning, and scientific literature review under a common pipeline.  <br>● Tool-augmented verification: Calls external tools (search engines, code executors, math solvers) to verify claims rather than relying on the LLM's internal judgment alone.  <br>● Benchmark release: Releases an accompanying benchmark dataset plus a ChatGPT plugin implementation for hands-on experimentation.  <br>● Practical fact-checking: Provided one of the first end-to-end fact-checking frameworks suitable for deployment alongside LLM chatbots.                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2307.13528v2), [Tweet](https://twitter.com/gneubig/status/1684658613921669120?s=20)                            |

---

## Top AI Papers of the Week (July 17 - July 23)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                     | **Links**                                                                                                                                                                                    |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Llama 2** - Meta's open-weight foundation model family with chat-tuned variants ranging from 7B to 70B parameters.  <br>● Open-weight release: Released pretrained and RLHF-tuned chat models under a permissive license that allowed commercial use, reshaping the open-source LLM landscape.  <br>● Training recipe: Pretrained on 2T tokens with 4K context; chat models use SFT followed by iterative RLHF with Ghost Attention (GAtt) for multi-turn consistency.  <br>● Safety investment: Extensive red-teaming, safety reward models, and context distillation produce chat models with strong helpfulness-safety trade-offs.  <br>● Ecosystem catalyst: Llama 2 became the base for hundreds of community fine-tunes (Vicuna, WizardLM, CodeLlama) and catalyzed the open-weight movement that 2024's Llama 3 and Mistral would extend. | [Paper](https://arxiv.org/abs/2307.09288v2), [Tweet](https://twitter.com/MetaAI/status/1681363272484945921?s=20)                                                                             |
| 2) **How is ChatGPT's Behavior Changing Over Time?** - Evaluates GPT-3.5 and GPT-4 over months to show significant behavioral drift in deployed systems.  <br>● Longitudinal measurement: Compares March vs. June 2023 snapshots of GPT-3.5 and GPT-4 on math, code, sensitive-question answering, and visual reasoning.  <br>● Large performance deltas: GPT-4's prime identification accuracy dropped from 97.6% to 2.4% between snapshots, demonstrating drift can be severe and non-monotonic.  <br>● Safety and format shifts: Code generation formatting, verbosity, and willingness to answer sensitive questions all changed substantially across versions.  <br>● Deployment implications: Highlighted the need for version pinning, regression testing, and behavioral monitoring when building on proprietary APIs - sparking major industry discussion.                                                   | [Paper](https://arxiv.org/abs/2307.09009v1), [Tweet](https://twitter.com/matei_zaharia/status/1681467961905926144?s=20)                                                                      |
| 3) **FlashAttention-2** - Tri Dao's follow-up to FlashAttention, dramatically improving attention throughput on modern GPUs.  <br>● Work partitioning: Redesigns parallelism so non-matmul FLOPs are reduced and thread blocks are better utilized across SMs.  <br>● ~2x speedup: Achieves approximately 2x speedup over FlashAttention-1 and reaches 50-73% of theoretical maximum FLOPs/s on A100.  <br>● Shared-memory communication: Parallelizes attention along sequence length, increases occupancy, and reduces cross-warp communication via shared memory.  <br>● Training infrastructure staple: Became the default attention kernel in PyTorch, HuggingFace, vLLM, and nearly every 2024 training stack for long-context models.                                                                                                                 | [Paper](https://arxiv.org/abs/2307.08691v1), [Tweet](https://twitter.com/tri_dao/status/1680987577913065472?s=20)                                                                            |
| 4) **Measuring Faithfulness in Chain-of-Thought Reasoning** - Anthropic's investigation into whether CoT reasoning actually reflects the model's internal decision process.  <br>● Intervention protocol: Uses paraphrasing, mistake-injection, and truncation of reasoning chains to test whether final answers depend on the visible reasoning.  <br>● Inverse scaling finding: Demonstrates that as models get larger and more capable, the reasoning becomes less faithful - an important inverse-scaling signal.  <br>● Task variability: Faithfulness varies significantly across tasks; some tasks/model-sizes support CoT that is meaningfully tied to the answer.  <br>● Interpretability foundation: Influential for subsequent interpretability and safety work on whether chain-of-thought can be trusted for monitoring model reasoning.                                                                          | [Paper](https://www-files.anthropic.com/production/files/measuring-faithfulness-in-chain-of-thought-reasoning.pdf), [Tweet](https://twitter.com/AnthropicAI/status/1681341063083229189?s=20) |
| 5) **Generative TV & Showrunner Agents** - Fable Studio's approach to generate episodic TV content using LLMs and multi-agent simulation.  <br>● Multi-agent storytelling: Uses agent simulation to generate plot, character actions, and dialogue which are then rendered as episodic content.  <br>● Full-pipeline generation: Integrates story generation, image/audio synthesis, and lip-sync into a single end-to-end show creation pipeline.  <br>● "South Park AI" demo: The accompanying animated demo in the style of South Park generated significant public attention as a preview of AI-generated entertainment.  <br>● AI creative industries: An early proof-of-concept for agent-driven entertainment production that informed later efforts in AI-generated TV, games, and interactive fiction.                                                                                        | [Paper](https://fablestudio.github.io/showrunner-agents/), [Tweet](https://twitter.com/fablesimulation/status/1681352904152850437?s=20)                                                      |
| 6) **Challenges & Application of LLMs** - A comprehensive enumeration of open challenges and application domains for LLMs.  <br>● Challenge taxonomy: Catalogs technical challenges (evaluation brittleness, prompt brittleness, hallucination, context limits, bias) and practical ones (cost, safety, data).  <br>● Application breadth: Reviews applications spanning education, law, medicine, chemistry, biology, and software engineering with honest accounting of current limitations.  <br>● Experimental-design gaps: Highlights the lack of robust experimental protocols in LLM evaluation - a prelude to 2024's improved eval practices.  <br>● Community reference: Frequently cited as a shared vocabulary for describing the 2023 state of LLM applied research.                                                                                                                    | [Paper](https://arxiv.org/abs/2307.10169), [Tweet](https://twitter.com/omarsar0/status/1681844380934500358?s=20)                                                                             |
| 7) **Retentive Network (RetNet)** - Microsoft's proposed foundation architecture aiming to replace Transformer attention for LLMs.  <br>● Three-mode formulation: Supports parallel training, recurrent inference, and chunkwise recurrent representation - combining Transformer-style training with RNN-style inference.  <br>● O(1) inference cost: Achieves constant-memory inference per step via the recurrent form, dramatically cheaper than attention's O(n) per-token cost.  <br>● Retention mechanism: Replaces softmax attention with an exponentially-decaying retention kernel that supports both parallel and recurrent computation.  <br>● Post-Transformer contender: Positioned alongside Mamba, RWKV, and Hyena as one of the credible attempts to dethrone attention - though attention remained dominant through 2024. | [Paper](https://arxiv.org/abs/2307.08621), [Tweet](https://twitter.com/arankomatsuzaki/status/1681113977500184576?s=20)                                                                      |
| 8) **Meta-Transformer** - A unified framework performing learning across 12 different modalities with a shared backbone.  <br>● 12-modality coverage: Handles text, image, point cloud, audio, video, X-Ray, infrared, hyperspectral, IMU, graph, tabular, and time-series data.  <br>● Frozen encoder design: Uses a frozen modality-agnostic encoder paired with modality-specific tokenizers and lightweight task heads.  <br>● Extreme generality: Demonstrates that a single backbone can serve both fundamental perception and practical applications like medical imaging and industrial sensing.  <br>● Universal encoder direction: Points toward future architectures where a single foundation model serves as the universal encoder for any modality.                                                                                                                                                         | [Paper](https://arxiv.org/abs/2307.10802), [Tweet](https://twitter.com/omarsar0/status/1682197751990288385?s=20)                                                                             |
| 9) **Retrieve In-Context Examples for LLMs** - A framework to iteratively train dense retrievers that identify high-quality in-context examples.  <br>● Iterative training: Trains retrievers using LLM feedback in an iterative loop - retrieved examples that help the LLM answer correctly are used as positive signals.  <br>● 30-task evaluation: Evaluated across 30 NLP tasks showing consistent improvements over random or similarity-based retrieval.  <br>● Pattern-similar examples: Confirms that examples sharing abstract patterns (not just surface similarity) are most useful for ICL.  <br>● Scale-invariant gains: Improvements are consistent across model sizes, suggesting dense retrieval is a robust ICL enhancement that transfers across model scales.                                                                                                                                    | [Paper](https://arxiv.org/abs/2307.07164), [Tweet](https://twitter.com/_akhaliq/status/1680770636166094848?s=20)                                                                             |
| 10) **FLASK** - Proposes fine-grained evaluation of LLMs decomposed into 12 alignment skill sets.  <br>● 12-skill taxonomy: Decomposes holistic LLM evaluation into skills like logical reasoning, factuality, commonsense, readability, harmlessness, etc.  <br>● Instance-level annotation: Each evaluation instance is labeled with which skills, domains, and difficulty levels it exercises, enabling fine-grained performance analysis.  <br>● Skill-specific insights: Reveals that models excel differently on different skills - useful for targeted model selection and iteration.  <br>● Evaluation paradigm shift: Part of the broader move from single-number benchmarks to multi-dimensional skill-based evaluation that shaped 2024's eval ecosystem.                                                                                                                                      | [Paper](https://arxiv.org/abs/2307.10928), [Tweet](https://twitter.com/SeonghyeonYe/status/1682209670302408705?s=20)                                                                         |

---

## Top AI Papers of the Week (July 10 - July 16)

| **Paper**                                                                                                                                                                                                                                                                                                | **Links**                                                                                                                                                                                             |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **CM3Leon** - Meta's retrieval-augmented multi-modal language model that generates both text and images.  <br>● Autoregressive multi-modal: Unifies text and image generation in a single autoregressive token-based architecture, handling both modalities in any order.  <br>● 5x training efficiency: Achieves SOTA image generation quality with 5x less training compute than comparable methods due to retrieval augmentation and instruction tuning.  <br>● Instruction tuning for images: Demonstrates that supervised fine-tuning and instruction tuning - originally developed for LLMs - also massively improves multimodal generation quality.  <br>● Any-to-any direction: Early proof-of-concept for unified any-to-any multi-modal models, pre-dating and inspiring 2024 systems like Chameleon and GPT-4o.                                                                  | [Paper](https://ai.meta.com/research/publications/scaling-autoregressive-multi-modal-models-pretraining-and-instruction-tuning/), [Tweet](https://twitter.com/MetaAI/status/1679885986363478018?s=20) |
| 2) **Claude 2** - Anthropic's second-generation LLM with a detailed model card on safety, alignment, and capabilities.  <br>● 100K context: Launched with a 100K token context window, enabling document-scale reasoning use cases that were impractical with earlier models.  <br>● Safety evaluations: Comprehensive safety evaluations including harmlessness benchmarks, bias probes, and red-teaming results transparently disclosed.  <br>● Capabilities gains: Significant improvements on coding (71.2% HumanEval), math (GSM8k), and legal reasoning over Claude 1.3.  <br>● Consumer release: First Claude model available to consumers via claude.ai in the US and UK, broadening Anthropic's public footprint.                                                                                                                                                                                                                                                 | [Paper](https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf), [Tweet](https://twitter.com/AnthropicAI/status/1678759122194530304?s=20)                                          |
| 3) **Secrets of RLHF in LLMs** - A deep investigation into RLHF with a focus on the inner workings of PPO, including open-source code.  <br>● PPO internals exposed: Documents critical implementation details (reward normalization, advantage estimation, KL penalty scaling) that aren't in the original papers but make or break training.  <br>● Empirical ablations: Systematically studies which PPO components matter most, providing practical guidance for RLHF practitioners.  <br>● Open-source code: Releases a clean reference implementation that others can use to reproduce and iterate on RLHF.  <br>● RLHF demystification: Part of a broader 2023 wave demystifying RLHF, preparing the ground for simpler alternatives like DPO that arrived later that year.                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2307.04964), [Tweet](https://twitter.com/omarsar0/status/1678938028918571009?s=20)                                                                                      |
| 4) **LongLLaMA** - Extends LLaMA's context length using a contrastive training process that reshapes the (key, value) space.  <br>● Focused Transformer: Uses contrastive training to make memory-augmented attention more discriminative, reducing distraction from irrelevant context.  <br>● Length extrapolation: Demonstrates long-context capability well beyond the original LLaMA 2K/4K window through its memory mechanism.  <br>● Long-context tasks: Shows improvements on passkey retrieval and long-form summarization tasks that stress long-range attention.  <br>● Efficient extension: Part of the 2023 explosion of context-window-extension techniques that would culminate in ~1M-token proprietary models the following year.                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2307.03170v1), [Tweet](https://twitter.com/s_tworkowski/status/1677125863429795840?s=20)                                                                                |
| 5) **Patch n' Pack: NaViT** - A vision transformer handling any aspect ratio and resolution through sequence packing.  <br>● Native resolution processing: Packs image patches of arbitrary resolution/aspect-ratio into a single sequence, preserving original information instead of resize-and-crop.  <br>● Flexible deployment: Enables compute-quality tradeoffs at inference time without requiring separate models per resolution.  <br>● Training efficiency: Sequence packing provides significant training efficiency gains versus fixed-resolution pipelines.  <br>● Foundation ViT update: Influenced subsequent multi-modal models (LLaVA, Qwen-VL) that adopted NaViT-style native-resolution image processing.                                                                                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2307.06304), [Tweet](https://twitter.com/m__dehghani/status/1679558751248850969?s=20)                                                                                   |
| 6) **LLMs as General Pattern Machines** - Demonstrates LLMs serve as general sequence modelers without additional training.  <br>● Zero-shot sequence modeling: Shows LLMs can complete arbitrary symbolic sequences, not just language - they're general pattern completers driven by in-context learning.  <br>● Word-to-action transfer: Applies pattern-completion to robotics, transferring abstract sequence patterns from text directly into robot action sequences.  <br>● Robotics without robot data: Achieves meaningful robot control without any training on robot data - purely through language model pattern-matching.  <br>● Conceptual framing: Influential perspective paper reframing LLMs as general compression/pattern machines rather than just language models.                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2307.04721), [Tweet](https://twitter.com/DrJimFan/status/1679898692307005440?s=20)                                                                                      |
| 7) **HyperDreamBooth** - A smaller, faster, and more efficient version of DreamBooth for personalizing text-to-image models.  <br>● HyperNetwork design: Uses a HyperNetwork to predict LoRA weights from a single input image, bypassing per-subject optimization.  <br>● 25x speedup: Achieves ~25x faster personalization than DreamBooth while maintaining visual fidelity to the subject.  <br>● Single-image input: Requires only one input image of the subject - a major UX improvement over prior methods needing 3-5 images.  <br>● On-device personalization: Compact adapter footprint makes HyperDreamBooth-style techniques attractive for on-device personalization in consumer apps.                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2307.06949), [Tweet](https://twitter.com/natanielruizg/status/1679893292618752000?s=20)                                                                                 |
| 8) **Teaching Arithmetic to Small Transformers** - Trains small transformers on chain-of-thought style data for arithmetic with large gains.  <br>● Data format matters: Shows that reformulating arithmetic into explicit step-by-step data dramatically improves small-model accuracy and convergence.  <br>● Emergence from curriculum: Fine-grained reasoning traces enable small transformers to learn multi-digit arithmetic that would otherwise require orders-of-magnitude more scale.  <br>● High-quality data thesis: Supports the emerging 2023 thesis that instructive, well-formatted data beats brute-force scaling for specific skills.  <br>● Small-model research: Informed the later Phi-series (Phi-1, Phi-1.5, Phi-2) "textbooks are all you need" data-quality research program.                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2307.03381), [Tweet](https://twitter.com/DimitrisPapail/status/1678407512637284352?s=20)                                                                                |
| 9) **AnimateDiff** - Animates frozen text-to-image diffusion models via a plug-in motion modeling module.  <br>● Motion module: Adds a motion modeling module on top of frozen T2I models that learns to produce temporally coherent frame sequences.  <br>● Model-agnostic: Works with any personalized T2I checkpoint (LoRAs, DreamBooth fine-tunes) without retraining - animating existing Stable Diffusion models.  <br>● Community adoption: Became the dominant open-source video generation tool in late 2023, powering countless community animations on ComfyUI and WebUI.  <br>● Open video generation: Established the architectural pattern (frozen image model + learned motion module) that many subsequent open video models followed.                                                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2307.04725v1), [Tweet](https://twitter.com/dreamingtulpa/status/1679459297946632193?s=20)                                                                               |
| 10) **Generative Pretraining in Multimodality (Emu)** - A transformer-based multimodal foundation model for generating images and text.  <br>● Unified pretraining: Pretrains on mixed image-text sequences to generate either modality in multimodal context.  <br>● Instruction tuning for assistants: Combines generative pretraining with instruction tuning to produce performant multimodal assistants.  <br>● In-context multimodal: Supports in-context learning across images and text, enabling few-shot multimodal tasks.  <br>● Multi-modal assistants: Part of the 2023 push (alongside LLaVA, MiniGPT-4) that established the pattern of visual-instruction-tuned assistants.                                                                                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2307.05222v1), [Tweet](https://twitter.com/_akhaliq/status/1678939405170475008?s=20)                                                                                    |

---

## Top AI Papers of the Week (July 3 - July 9)

| **Paper**                                                                                                                                                                                                                                                                                                  | **Links**                                                                                                               |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| 1) **A Survey on Evaluation of LLMs** - A comprehensive overview of evaluation methods covering what, where, and how to evaluate LLMs.  <br>● Three-axis taxonomy: Organizes evaluation along what-to-evaluate (NLP tasks, robustness, ethics, trustworthiness), where-to-evaluate (benchmarks, datasets), and how-to-evaluate (automatic, human, LLM-as-judge).  <br>● Benchmark catalog: Surveys the major benchmarks of 2023 including MMLU, HELM, BIG-bench, and AgentBench with strengths and limitations.  <br>● Failure-mode analysis: Documents where current evaluations fall short - contamination, saturation, prompt sensitivity, and lack of task diversity.  <br>● Evaluation field primer: Became a standard citation for researchers entering LLM evaluation, helping formalize the sub-field.                                                      | [Paper](https://arxiv.org/abs/2307.03109), [Tweet](https://twitter.com/omarsar0/status/1677137934946803712?s=20)        |
| 2) **How Language Models Use Long Contexts (Lost-in-the-Middle)** - Shows LLM performance drops when relevant information is in the middle of a long context.  <br>● U-shaped performance curve: LMs perform best when relevant info is at the start or end of context, with substantial degradation for middle positions.  <br>● Cross-model phenomenon: Confirmed across GPT-3.5, GPT-4, Claude, and open-weight models - indicating a fundamental attention pattern rather than a bug.  <br>● QA and retrieval benchmarks: Demonstrated on multi-document QA and key-value retrieval tasks with varying context positions.  <br>● Foundational finding: Coined the phrase "lost in the middle" - one of the most widely-cited 2023 findings that shaped subsequent long-context benchmark and model design.                                                                                             | [Paper](https://arxiv.org/abs/2307.03172), [Tweet](https://twitter.com/nelsonfliu/status/1677373731948339202?s=20)      |
| 3) **LLMs as Effective Text Rankers** - A prompting technique that enables open-source LLMs to perform SOTA text ranking.  <br>● Pairwise ranking prompt: Uses pairwise prompting (A vs. B) rather than pointwise scoring, which aligns better with LLM reasoning strengths.  <br>● Open-source SOTA: Achieves state-of-the-art text ranking on standard benchmarks using only open-weight LLMs - no proprietary API required.  <br>● Retrieval pipeline fit: Designed to slot into existing retrieval pipelines as a re-ranker stage.  <br>● RAG infrastructure: Influenced 2024's RAG reranker ecosystem, with LLM-based reranking becoming standard in production retrieval stacks.                                                                                                                                                                                                                                 | [Paper](https://arxiv.org/abs/2306.17563), [Tweet](https://twitter.com/arankomatsuzaki/status/1675673784454447107?s=20) |
| 4) **Multimodal Generation with Frozen LLMs** - Maps images to LLM token space enabling models like PaLM and GPT-4 to handle visual tasks without parameter updates.  <br>● Frozen LLM design: Keeps the underlying LLM completely frozen - only a lightweight image-to-token projection layer is trained.  <br>● Parameter-efficient multimodal: Enables multimodal capabilities without fine-tuning large LLMs, drastically reducing compute cost.  <br>● In-context visual tasks: Uses in-context learning to tackle VQA, image captioning, and visual reasoning with zero LLM modification.  <br>● Plug-in VLM pattern: An early example of the "frozen LLM + visual adapter" design that became dominant in open-source VLMs through 2024.                                                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2306.17842), [Tweet](https://twitter.com/roadjiang/status/1676375112914989056?s=20)       |
| 5) **CodeGen2.5** - Salesforce's new 7B code LLM trained on 1.5T tokens and optimized for fast sampling.  <br>● Small-but-competitive: 7B model matches or beats prior >15B code-generation models, demonstrating data quality can substitute for model scale.  <br>● Fast-sampling optimization: Architecturally tuned for inference speed, making it practical for IDE integration use cases.  <br>● Multilingual code: Handles multiple programming languages with strong Python, JavaScript, and TypeScript performance.  <br>● Open code LLM: Part of the 2023 open-source code LLM wave (CodeGen, StarCoder, CodeLlama) that made private code assistants viable for enterprise.                                                                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2305.02309), [Tweet](https://twitter.com/erik_nijkamp/status/1677055271104045056?s=20)    |
| 6) **Elastic Decision Transformer** - An advance over Decision Transformers that enables trajectory stitching at inference time.  <br>● Adaptive history length: Adjusts to shorter history at test time, enabling transitions to diverse and better future states.  <br>● Trajectory stitching: Unlike vanilla Decision Transformers that treat trajectories as fixed, EDT composes segments from different trajectories.  <br>● Offline RL gains: Achieves stronger performance on offline RL benchmarks where data quality and coverage vary.  <br>● Decision Transformer evolution: Part of the broader effort to make Decision Transformers competitive with Q-learning approaches on offline RL tasks.                                                                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2307.02484), [Tweet](https://twitter.com/xiaolonw/status/1677003542249484289?s=20)        |
| 7) **Robots That Ask for Help** - A framework for calibrating LLM-based robot planners so they ask for help when uncertain.  <br>● Uncertainty alignment: Measures and aligns the uncertainty of LLM planners so help-requests correlate with real task difficulty.  <br>● Conformal prediction: Uses conformal prediction to provide rigorous statistical guarantees on when to defer to humans.  <br>● Safer autonomy: Reduces the risk of silent failures in robot deployments where an LLM confidently executes wrong plans.  <br>● Human-robot collaboration: An early contribution to the know-when-you-don't-know literature for LLM-driven agents - a theme that became central to 2024 agent safety work.                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2307.01928), [Tweet](https://twitter.com/allenzren/status/1677000811803443213?s=20)       |
| 8) **Physics-based Motion Retargeting in Real-Time** - Uses RL to retarget motions from sparse human sensor data to characters of various morphologies.  <br>● Physics simulator policies: Trains RL policies that control characters in a physics simulator, producing physically plausible motion.  <br>● Sparse sensor input: Works from sparse human sensor data (e.g., VR headset + controllers) rather than requiring full motion capture.  <br>● Cross-morphology: Generalizes across characters of different morphologies without per-character re-training.  <br>● VR/AR deployment: Practical for real-time VR/AR avatar control where users have only a few tracking points but want natural character motion.                                                                                                                                                                                                                                  | [Paper](https://arxiv.org/abs/2307.01938), [Tweet](https://twitter.com/_akhaliq/status/1676822600478015488?s=20)        |
| 9) **Scaling Transformer to 1 Billion Tokens (LongNet)** - Microsoft's Transformer variant scaling sequence length past 1B tokens.  <br>● Dilated attention: Introduces dilated attention that exponentially grows the attention field, enabling linear complexity in sequence length.  <br>● No short-sequence loss: Achieves extreme long-context scaling with no degradation on shorter sequences.  <br>● 1B token demo: Demonstrates viability at the 1-billion token context scale - an order of magnitude beyond anything previously attempted.  <br>● Long-context frontier: Pushed the frontier of what's theoretically possible for ultra-long-context Transformers, even though production models stayed in the hundreds-of-thousands-of-tokens range. | [Paper](https://arxiv.org/abs/2307.02486), [Tweet](https://twitter.com/arankomatsuzaki/status/1676765133362675712?s=20) |
| 10) **InterCode** - A framework treating interactive coding as a reinforcement learning environment.  <br>● Interactive paradigm: Moves beyond static sequence-to-sequence coding benchmarks to multi-turn interactive coding with execution feedback.  <br>● Standardized RL environment: Provides Bash, SQL, and Python environments with consistent APIs for training and evaluating code agents.  <br>● Feedback-loop evaluation: Tests whether models can use execution errors, test failures, and intermediate outputs to iteratively improve their code.  <br>● Code-agent foundation: Anticipated and enabled the 2024 explosion of interactive coding agents (SWE-agent, OpenDevin, Aider) that leverage execution feedback loops.                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2306.14898), [Tweet](https://twitter.com/ShunyuYao12/status/1675903408727896066?s=20)     |

---

## Top AI Papers of the Week (June 26 - July 2)

| **Paper**                                                                                                                                                                                                                                                                                                                       | **Links**                                                                                                                |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| 1) **LeanDojo** - An open-source Lean playground consisting of toolkits, data, models, and benchmarks for theorem proving.  <br>● Theorem-proving infrastructure: Full stack for LLM-based theorem proving in Lean, including the first large-scale extraction of proof data from the Mathlib library.  <br>● ReProver model: Releases a retrieval-augmented LLM-based prover that selects relevant premises from a vast math library rather than memorizing everything.  <br>● Academic accessibility: Makes theorem-proving research accessible to smaller groups that lack the resources to build Lean tooling from scratch.  <br>● Formal math acceleration: A foundational piece that enabled 2024 breakthroughs like DeepMind's AlphaProof and the broader surge in LLM-driven formal math research.                                                                        | [Paper](https://arxiv.org/abs/2306.15626), [Tweet](https://twitter.com/KaiyuYang4/status/1673882824158613504?s=20)       |
| 2) **Extending Context Window of LLMs (PI)** - Position Interpolation extends LLaMA's context to 32K with minimal fine-tuning (within 1000 steps).  <br>● Position interpolation: Linearly interpolates positional indices so pretrained RoPE attention generalizes to longer sequences without breaking.  <br>● 1000-step adaptation: Requires only ~1000 fine-tuning steps versus prior methods that needed much more compute.  <br>● Quality preservation: Maintains strong performance on tasks while reaching 32K context - both long-context tasks and standard-length benchmarks.  <br>● Standard long-context recipe: Became the standard approach for extending open-source model context windows throughout 2023 and early 2024.                                                                                                                           | [Paper](https://arxiv.org/abs/2306.15595), [Tweet](https://twitter.com/omarsar0/status/1674073189800919042?s=20)         |
| 3) **Computer Vision Through the Lens of Natural Language** - A modular approach solving CV problems by routing through LLM reasoning.  <br>● Modular CV pipeline: Uses LLMs to reason over outputs from independent, descriptive vision modules that each provide partial information about an image.  <br>● Interpretable intermediate: Intermediate language descriptions are human-readable, improving debugability versus end-to-end VLMs.  <br>● Tool-augmented vision: Part of the broader "LLM as cognitive core" research direction where LLMs orchestrate specialized tools.  <br>● VLM alternative: Offers a complementary paradigm to end-to-end VLM training, trading compute for modularity and interpretability.                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2306.16410), [Tweet](https://twitter.com/arankomatsuzaki/status/1674219223856365569?s=20)  |
| 4) **Visual Navigation Transformer (ViNT)** - A foundation model for vision-based robotic navigation built on flexible Transformers.  <br>● Cross-embodiment: Works across different robotic platforms (quadrupeds, wheeled robots, drones) without per-robot retraining.  <br>● Pretrained + fine-tuned: Leverages pretrained vision models and fine-tunes on navigation-specific data for strong transfer.  <br>● Multi-task navigation: Handles goal-reaching, exploration, and map-building within a single Transformer backbone.  <br>● Robotics foundation models: An early robotics-specific foundation model that preceded RT-2 and the VLA explosion of late 2023.                                                                                                                                                                                                                                   | [Paper](https://arxiv.org/abs/2306.14846), [Tweet](https://twitter.com/svlevine/status/1673732522155601920?s=20)         |
| 5) **Generative AI for Programming Education** - Evaluates GPT-4 and ChatGPT on programming education scenarios versus human tutors.  <br>● Structured comparison: Compares GPT-4, ChatGPT, and human tutors on tasks like code explanation, bug fixing, and student-facing hint generation.  <br>● GPT-4 near-human: GPT-4 outperforms ChatGPT and comes close to human tutor performance on many education tasks.  <br>● Pedagogical limitations: Identifies gaps where LLMs still fall short - nuanced misconception detection, maintaining pedagogical scaffolding, avoiding spoiler answers.  <br>● EdTech roadmap: Influential for the wave of AI-powered coding education products that launched in 2024.                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2306.17156), [Tweet](https://twitter.com/_akhaliq/status/1674590713051242498?s=20)         |
| 6) **DragDiffusion** - Extends interactive point-based image editing to diffusion models.  <br>● Latent optimization: Optimizes the diffusion latent directly to achieve precise spatial control over image content.  <br>● DragGAN for diffusion: Brings the intuitive drag-to-edit interaction (popularized by DragGAN) to the more capable diffusion model backbone.  <br>● High-quality edits: Achieves high-quality edits while preserving overall image coherence - objects move realistically rather than just warping pixels.  <br>● Interactive generation: Part of the broader move toward interactive, controllable image generation over one-shot text-to-image.                                                                                                                                                                                                                                 | [Paper](https://arxiv.org/abs/2306.14435), [Tweet](https://twitter.com/_akhaliq/status/1673570232429051906?s=20)         |
| 7) **Understanding Theory-of-Mind in LLMs with LLMs** - A framework for procedurally generating ToM evaluations using LLMs themselves.  <br>● LLM-generated benchmarks: Uses LLMs to procedurally create diverse ToM scenarios, avoiding benchmark contamination and enabling unlimited test generation.  <br>● Social reasoning study: Evaluates whether LLMs can track beliefs, intentions, and false beliefs of multiple agents - classic ToM challenges.  <br>● Controlled difficulty: Procedural generation allows varying difficulty (number of agents, nesting depth) to map capability boundaries.  <br>● Evaluation pattern: Early example of using LLMs to generate evaluations for LLMs - a pattern that would become standard in 2024 synthetic evaluation work.                                                                                                                                                                                               | [Paper](https://arxiv.org/abs/2306.15448), [Tweet](https://twitter.com/johnjnay/status/1673871545725505537?s=20)         |
| 8) **Evaluations with No Labels** - Self-supervised evaluation of LLMs via sensitivity/invariance to input transformations.  <br>● Label-free evaluation: Evaluates LLMs without requiring ground-truth labels, using consistency under input perturbations as the signal.  <br>● Transformation-based probes: Measures sensitivity or invariance to paraphrasing, irrelevant-context addition, and other transformations that shouldn't change correct answers.  <br>● Live deployment monitoring: Useful for monitoring LLM behavior on datasets streamed during production deployment, catching drift without manual labeling.  <br>● Deployment infrastructure: An early contribution to the continuous evaluation tooling that would become standard for 2024 LLM production systems.                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2306.13651v1), [Tweet](https://twitter.com/tomgoldsteincs/status/1673808766679097346?s=20) |
| 9) **Long-range Language Modeling with Self-Retrieval** - Jointly trains a retrieval-augmented LM from scratch for long-range modeling.  <br>● End-to-end retrieval training: Unlike retro-fitted RAG, trains the retriever and LM jointly from scratch for long-range consistency.  <br>● Long-form coherence: Targets tasks requiring retrieval of distant past context within a long document, not just factual lookup.  <br>● Architecture innovation: Introduces training procedures and architectural choices that make joint training stable and efficient.  <br>● Long-context RAG: Presaged the research direction of treating RAG and long-context as complementary rather than competing solutions.                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2306.13421), [Tweet](https://twitter.com/arankomatsuzaki/status/1673129191863140353?s=20)  |
| 10) **Scaling MLPs: A Tale of Inductive Bias** - Shows MLPs scale with compute despite their lack of inductive bias.  <br>● Pure-MLP scaling: Demonstrates that large pure-MLP models trained on enough data can reach surprisingly strong performance on image classification.  <br>● Inductive bias is compensable: Challenges the dogma that CNN/Transformer inductive biases are necessary - scale and data can substitute.  <br>● Bitter lesson evidence: Adds to the "bitter lesson" empirical evidence that general methods leveraging computation outperform those leveraging human-designed priors.  <br>● Architecture agnosticism: Part of the 2023 trend showing that many architectures (MLPs, State Space Models, RNNs, Transformers) converge at scale.                                                                                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2306.13575), [Tweet](https://twitter.com/ethanCaballero/status/1673725211907182592?s=20)   |

---

## Top AI Papers of the Week (June 19 - June 25)

| **Paper**                                                                                                                                                                                                                                                                                                                     | **Links**                                                                                                                 |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| 1) **Textbooks Are All You Need (phi-1)** - Introduces a 1.3B parameter code LLM trained on textbook-quality data.  <br>● Data-quality thesis: Trained on a curated selection of textbook-quality web data plus synthetic textbooks/exercises generated with GPT-3.5.  <br>● Small model, strong HumanEval: Achieves 50.6% pass@1 on HumanEval despite being 1.3B - beating much larger models on code generation.  <br>● 4-day training: Trained in just 4 days on 8 A100s, showing that aggressive data selection can substitute for massive compute.  <br>● Phi-series launch: Kicked off Microsoft's Phi-series (Phi-1.5, Phi-2, Phi-3) and catalyzed the "small-but-smart" model research program.                         | [Paper](https://arxiv.org/abs/2306.11644), [Tweet](https://twitter.com/SebastienBubeck/status/1671326369626853376?s=20)   |
| 2) **RoboCat** - DeepMind's self-improving foundation agent that operates different robotic arms from as few as 100 demonstrations.  <br>● Cross-embodiment: Single agent controls multiple different robotic arms and grippers, generalizing across hardware.  <br>● Self-improving loop: Generates new training data via fine-tuning on its own demonstrations, progressively improving its own capabilities.  <br>● Few-shot adaptation: Adapts to new tasks from as few as 100 demonstrations - practical for real-world deployment.  <br>● Robotics foundation agent: A key data point that robotics was moving toward the same foundation-model + self-improvement paradigm as LLMs.                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2306.11706), [Tweet](https://twitter.com/DeepMind/status/1671171448638144515?s=20)          |
| 3) **ClinicalGPT** - A language model optimized through extensive and diverse medical data and multi-turn dialogue.  <br>● Medical data diversity: Trained on medical records, domain knowledge corpora, and multi-round consultation dialogues spanning multiple medical specialties.  <br>● Chinese medical focus: Strong coverage of Chinese medical data, filling a gap that general-purpose medical LLMs didn't address.  <br>● Dialog-first design: Optimized for realistic multi-turn consultations rather than single-shot medical QA.  <br>● Regional medical LLMs: Part of the broader trend of region/language-specific medical LLMs emerging alongside global systems like Med-PaLM.                                                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2306.09968), [Tweet](https://twitter.com/omarsar0/status/1670606068777381890?s=20)          |
| 4) **An Overview of Catastrophic AI Risks** - Dan Hendrycks' comprehensive overview of catastrophic AI risk categories.  <br>● Four risk categories: Organizes catastrophic AI risks into malicious use, AI race dynamics, organizational risks, and rogue AIs.  <br>● Policy-relevant framing: Written for researchers, policymakers, and the broader public - influenced AI governance discussions through 2023-2024.  <br>● Risk concretization: Grounds abstract risk discussions in specific, plausible scenarios that can be analyzed and mitigated.  <br>● Governance reference: Widely cited in AI policy proposals, UK AI Safety Summit materials, and national AI strategies.                                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2306.12001v1), [Tweet](https://twitter.com/DanHendrycks/status/1671894767331061763?s=20)    |
| 5) **LOMO** - A memory-efficient optimizer that combines gradient computation and parameter update in one step.  <br>● Fused grad-update: Fuses backpropagation and SGD update into a single operation, eliminating the need to store all gradients in memory simultaneously.  <br>● Full-parameter tuning: Enables full-parameter fine-tuning of a 65B LLM on a single 8x24GB GPU machine.  <br>● Democratization: Makes full fine-tuning (not just LoRA) accessible to researchers without multi-node GPU clusters.  <br>● Optimizer memory research: Joined the 2023 wave of optimizer memory innovations (8-bit Adam, AdaFactor, GaLore) democratizing large-model tuning.                                                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2306.09782), [Tweet](https://twitter.com/arankomatsuzaki/status/1670603218659811330?s=20)   |
| 6) **SequenceMatch** - Formulates sequence generation as imitation learning, enabling backtracking via a backspace action.  <br>● Imitation learning framing: Views autoregressive generation as imitation learning with expert data, opening the door to standard IL techniques.  <br>● Backspace action: Introduces a "backspace" action that lets the model undo tokens that led to out-of-distribution sequences.  <br>● Compounding error mitigation: Addresses the classical autoregressive problem where small early errors compound catastrophically.  <br>● Training innovation: An interesting precursor to later work on self-correcting LLMs and reasoning with error recovery.                                                                                                                                                                                                                            | [Paper](https://arxiv.org/abs/2306.05426), [Tweet](https://twitter.com/abacaj/status/1671636061494059009?s=20)            |
| 7) **LMFlow** - An extensible and lightweight toolkit for fine-tuning and inference of large foundation models.  <br>● Full training stack: Supports continuous pretraining, instruction tuning, parameter-efficient fine-tuning, alignment tuning, and inference in one toolkit.  <br>● Lightweight design: Easier to use and extend than heavier frameworks like Megatron or DeepSpeed for practitioners who want to iterate quickly.  <br>● Community adoption: Became a popular tool in the open-source LLM ecosystem for reproducing fine-tuning recipes.  <br>● Training ecosystem: Part of the broader 2023 proliferation of accessible LLM training tooling (Axolotl, LLaMA-Factory, LitGPT) that enabled community fine-tuning.                                                                                                                                                                                                                       | [Paper](https://arxiv.org/abs/2306.12420), [Tweet](https://twitter.com/omarsar0/status/1671881864930549761?s=20)          |
| 8) **MotionGPT** - Generates consecutive human motions from multimodal control signals via LLM instructions.  <br>● Motion quantization: Quantizes motion into discrete tokens that LLMs can produce in the same stream as text.  <br>● Multimodal control: Accepts text, audio, and other control signals as input, producing corresponding human motion outputs.  <br>● LLM-as-motion-generator: Treats motion generation as a token-prediction task, unifying motion with other LLM capabilities.  <br>● Animation and VR: Applicable to character animation, VR avatars, and content creation workflows where text-driven motion is valuable.                                                                                                                                                                                                                                                                            | [Paper](https://arxiv.org/abs/2306.10900v1), [Tweet](https://twitter.com/arankomatsuzaki/status/1671341916980490241?s=20) |
| 9) **Wanda** - A simple, effective pruning approach for LLMs requiring no retraining.  <br>● Weight×activation pruning: Prunes weights with the smallest magnitude × corresponding input activations on a per-output basis.  <br>● Zero retraining: Requires no retraining or weight updates, making it immediately deployable.  <br>● Simple beats complex: Outperforms magnitude-only pruning and matches or exceeds more complex training-based pruning methods.  <br>● Production pruning: Became a widely-adopted baseline in LLM pruning research due to its simplicity and strong performance.                                                                                                                                                                                                                                                                                                               | [Paper](https://arxiv.org/abs/2306.11695), [Tweet](https://twitter.com/Yampeleg/status/1671885220218560516?s=20)          |
| 10) **AudioPaLM** - Fuses PaLM-2 and AudioLM into a multimodal architecture supporting speech understanding and generation.  <br>● Unified speech-text: Represents both speech and text as tokens in a shared vocabulary, enabling any-to-any conversion between modalities.  <br>● Zero-shot translation: Performs zero-shot speech-to-text translation into languages never seen as translation targets during training.  <br>● Speech generation: Generates high-quality speech in the voice of the input speaker while preserving prosody.  <br>● Unified speech foundation: A precursor to 2024's fully multimodal systems like GPT-4o that natively process and generate speech.                                                                                                                                                                                                                                                  | [Paper](https://arxiv.org/abs/2306.12925v1), [Tweet](https://twitter.com/PaulKRubenstein/status/1672128984220413953?s=20) |

---

## Top AI Papers of the Week (June 12 - June 18)

| **Paper**                                                                                                                                                                                                                                                                                                                                 | **Links**                                                                                                                                                                                        |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| 1) **Voicebox** - Meta's all-in-one generative speech model supporting 6 languages and many speech tasks in-context.  <br>● Flow-matching training: Uses flow-matching with text-guided context to unify TTS, denoising, editing, and style transfer in one model.  <br>● 20x faster: Outperforms specialized TTS systems while running 20x faster than prior state-of-the-art diffusion-based speech models.  <br>● Speech ICL: Supports in-context learning for speech - give it an audio prompt and it matches the speaker's voice, style, and prosody zero-shot.  <br>● Generalist speech: A major step toward generalist speech foundation models that would accelerate with 2024 systems like VoiceCraft and XTTS. | [Paper](https://research.facebook.com/publications/voicebox-text-guided-multilingual-universal-speech-generation-at-scale/), [Tweet](https://twitter.com/MetaAI/status/1669766837981306880?s=20) |
| 2) **FinGPT** - An open-source LLM for the finance sector with a data-centric approach.  <br>● Data-centric finance: Focuses on curating high-quality financial data (SEC filings, earnings calls, news, market data) as the key lever for FinLLM quality.  <br>● Accessible resources: Provides pipelines, fine-tuning scripts, and evaluation benchmarks so practitioners can develop their own FinLLMs.  <br>● Multi-task financial NLP: Covers sentiment analysis, earnings surprise prediction, news summarization, and more within a unified framework.  <br>● Open finance AI: An early open-source counterpoint to proprietary financial LLMs like BloombergGPT, accelerating community research.                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2306.06031), [Tweet](https://twitter.com/omarsar0/status/1668060502663077891?s=20)                                                                                 |
| 3) **Crowd Workers Widely Use LLMs for Text Production** - Empirical evidence that 33-46% of MTurk crowd workers used LLMs on text tasks.  <br>● LLM-generated contamination: Estimates that a third to almost half of crowd-worker text production involved LLMs - a massive data quality issue.  <br>● Benchmark contamination risk: Implications for NLP datasets produced via crowdsourcing, potentially invalidating many "human baseline" numbers.  <br>● Methodology: Uses statistical analysis comparing completion times, stylistic features, and output consistency to estimate LLM usage.  <br>● Community wake-up: Sparked widespread discussion about the future of human-generated data and the need for AI-usage detection.                                                                                                                                                                                                                                                                                                         | [Paper](https://arxiv.org/abs/2306.07899v1), [Tweet](https://twitter.com/manoelribeiro/status/1668986074801098754?s=20)                                                                          |
| 4) **Reliability of Watermarks for LLMs** - Studies whether watermarks survive human rewriting and LLM paraphrasing.  <br>● Robustness testing: Evaluates whether watermarks remain detectable after human rewrites, paraphrasing attacks, and translation round-trips.  <br>● Surprisingly robust: Finds that statistical watermarks (Kirchenbauer et al.) remain detectable even after aggressive transformations, with enough output text.  <br>● Text-length dependence: Detection confidence scales with text length - short watermarked snippets are much easier to obliterate than long ones.  <br>● AI detection realism: Provides a sober evaluation of watermarking's practical viability amid concerns about AI-generated content.                                                                                                                                                                                                                                         | [Paper](https://arxiv.org/abs/2306.04634), [Tweet](https://twitter.com/tomgoldsteincs/status/1668668484975464448?s=20)                                                                           |
| 5) **Applications of Transformers** - A new survey highlighting major applications of Transformers across deep learning.  <br>● Cross-domain coverage: Surveys Transformers in NLP, vision, speech, multi-modal, reinforcement learning, graph, and time-series tasks.  <br>● Model catalog: Comprehensive list of Transformer architectures with their design choices and application niches.  <br>● Application-driven taxonomy: Organizes by application domain rather than architecture, useful for practitioners evaluating Transformers for new domains.  <br>● Reference document: A broad reference for teaching material and onboarding readings on the Transformer architecture's reach.                                                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2306.07303), [Tweet](https://twitter.com/omarsar0/status/1668989324950491139?s=20)                                                                                 |
| 6) **Benchmarking NN Training Algorithms (AlgoPerf)** - A new benchmark for rigorously evaluating optimizers using realistic workloads.  <br>● Realistic workloads: Tests optimizers on actual production-scale tasks (ImageNet, language modeling, translation) rather than toy problems.  <br>● Wall-clock benchmarking: Evaluates optimizers on time-to-target-accuracy rather than just step counts, reflecting real training budgets.  <br>● Hyperparameter rules: Standardizes hyperparameter tuning budgets for fair cross-optimizer comparisons.  <br>● Optimizer research infrastructure: Enabled credible claims about new optimizers versus Adam and SGD - raising the bar for optimizer papers going forward.                                                                                                                                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2306.07179), [Tweet](https://twitter.com/zacharynado/status/1668683433944424448?s=20)                                                                              |
| 7) **Unifying LLMs & Knowledge Graphs** - A roadmap for combining LLMs with knowledge graphs for stronger reasoning.  <br>● Three integration paradigms: Organizes integration into KG-enhanced LLMs (pretraining/inference), LLM-augmented KGs (QA, completion), and synergized LLM+KG reasoning.  <br>● Bidirectional reasoning: Argues for bidirectional systems where KGs ground LLM claims and LLMs extend KGs, rather than one-way augmentation.  <br>● Hallucination mitigation: Positions KG grounding as a principled tool for reducing LLM hallucinations.  <br>● Hybrid AI direction: Influential for the 2024 resurgence of knowledge-graph + LLM systems, especially in enterprise search and agents.                                                                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2306.09310), [Tweet](https://twitter.com/johnjnay/status/1670051081722769408?s=20)                                                                                 |
| 8) **Augmenting LLMs with Long-term Memory (LongMem)** - Enables LLMs to memorize long history via memory-augmented adaptation.  <br>● Memory-augmented training: Dedicated adaptation training teaches the LLM to retrieve and use its memory of long past context.  <br>● ICL over long history: Enables in-context learning that spans far longer contexts than the model's raw attention window.  <br>● Decoupled retrieval: Separates the retrieval mechanism from the main model, allowing memory to grow without increasing model size.  <br>● Long-context direction: Part of 2023's multi-pronged attack on context-window limits, complementary to position interpolation and ring attention.                                                                                                                                                                                                                                                  | [Paper](https://arxiv.org/abs/2306.07174), [Tweet](https://twitter.com/arankomatsuzaki/status/1668429602841317378?s=20)                                                                          |
| 9) **TAPIR** - Tracks any queried point on any physical surface throughout a video sequence faster than real-time.  <br>● Any-point tracking: Generalizes object tracking to arbitrary query points, handling occlusions and re-appearances robustly.  <br>● Faster than real-time: On modern GPUs, tracks points faster than real-time on long, high-resolution videos - practical for real-world applications.  <br>● SOTA across benchmarks: Outperforms all prior baselines on standard point-tracking benchmarks.  <br>● Video understanding building block: Point tracking is a fundamental primitive for video understanding, editing, and robotics - TAPIR made it practical.                                                                                                                                                                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2306.08637), [Tweet](https://twitter.com/AdamWHarley/status/1669785589246468096?s=20)                                                                              |
| 10) **Mind2Web** - A dataset for evaluating generalist web agents with 2,350 tasks across 137 websites and 31 domains.  <br>● Broad web coverage: 137 real-world websites across 31 domains (travel, shopping, information seeking) - far more diverse than prior web benchmarks.  <br>● Generalization-focused: Tests cross-task, cross-website, and cross-domain generalization rather than in-distribution performance.  <br>● Realistic tasks: Uses real user tasks rather than synthetic scripts, capturing the messiness of actual web interactions.  <br>● Web-agent benchmark: Became a central benchmark for the 2024 explosion of web agents (WebAgent, WebVoyager, Browser Use, Operator).                                                                                                                                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2306.06070), [Tweet](https://twitter.com/DrJimFan/status/1669403956064432128?s=20)                                                                                 |

---

## Top AI Papers of the Week (June 5 - June 11)

| **Paper**                                                                                                                                                                                                                                                                                                                      | **Links**                                                                                                                          |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Tracking Everything Everywhere All at Once (OmniMotion)** - Test-time optimization for dense, long-range motion estimation.  <br>● Per-pixel motion: Estimates motion for every pixel across every frame of a video, producing dense long-range trajectories.  <br>● Test-time optimization: Optimizes a quasi-3D representation per video at test time, producing coherent long-range correspondences.  <br>● Through occlusions: Maintains point tracking even through long occlusions and complex camera motion - prior methods struggled with both.  <br>● Video understanding primitive: A foundational capability that enables downstream video editing, object removal, and 3D reconstruction applications.                                                                                                            | [Paper](https://arxiv.org/abs/2306.05422), [Tweet](https://twitter.com/sstj389/status/1667000331958468608?s=20)                    |
| 2) **AlphaDev** - DeepMind's deep RL agent discovering faster sorting algorithms from scratch, now in LLVM.  <br>● Assembly-level discovery: Searches over CPU assembly instructions rather than high-level code, finding micro-optimizations humans would miss.  <br>● LLVM integration: Discovered sorting routines were integrated into the LLVM C++ standard library - the first major AI-discovered algorithm in production compiler infrastructure.  <br>● Human-beating benchmarks: Found 70% faster sorting for very small inputs and 1.7% faster for large inputs, running billions of times per day worldwide.  <br>● Algorithm discovery AI: A proof point for AI-driven algorithm discovery that would later be extended to matrix multiplication (AlphaEvolve) and other primitives. | [Paper](https://www.nature.com/articles/s41586-023-06004-9), [Tweet](https://twitter.com/omarsar0/status/1666486491793481738?s=20) |
| 3) **Sparse-Quantized Representation (SpQR)** - Tim Dettmers' near-lossless LLM compression technique.  <br>● 4.75-bit inference: Enables LLM inference at 4.75 bits per parameter with a 15% speedup over FP16 baselines.  <br>● Near-lossless: Maintains model quality close to full-precision, with degradation measured in fractions of a percent on standard benchmarks.  <br>● Outlier-aware quantization: Identifies and preserves sensitive "outlier" weights in higher precision while aggressively quantizing the rest.  <br>● Quantization lineage: Part of Dettmers' influential quantization research (LLM.int8, QLoRA, SpQR) that made large-model inference accessible on consumer hardware.                                                                                                                                                                  | [Paper](https://arxiv.org/abs/2306.03078), [Tweet](https://twitter.com/Tim_Dettmers/status/1666076553665744896?s=20)               |
| 4) **MusicGen** - A simple and controllable model for music generation using a single-stage Transformer.  <br>● Single-stage design: Unlike prior hierarchical music models, MusicGen uses a single Transformer predicting interleaved audio tokens.  <br>● Multi-conditioning: Supports conditioning on text descriptions, melody audio, or both simultaneously.  <br>● SOTA on text-to-music: Achieves strong performance on standard text-to-music benchmarks while being simpler to train and deploy.  <br>● Open music generation: Meta's open release of MusicGen weights and code democratized music generation research and spawned community applications.                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2306.05284), [Tweet](https://twitter.com/syhw/status/1667103478471176192?s=20)                       |
| 5) **Augmenting LLMs with Databases (ChatDB)** - Combines an LLM with SQL databases as a symbolic memory framework.  <br>● LLM-orchestrated SQL: The LLM generates SQL queries to read from and write to a database as its persistent memory.  <br>● Structured reasoning: By externalizing state to a database, enables LLMs to handle complex multi-step tasks with consistent memory.  <br>● Symbolic memory: Offers a more reliable alternative to embedding-based memory for tasks requiring exact recall and structured queries.  <br>● Tool-use precursor: Part of the early 2023 research establishing LLM-as-orchestrator patterns that matured into today's agent frameworks.                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2306.03901), [Tweet](https://twitter.com/omarsar0/status/1666254609524961282?s=20)                   |
| 6) **Concept Scrubbing in LLM (LEACE)** - Least-squares Concept Erasure - erases a target concept from every layer of a neural network.  <br>● Closed-form erasure: Provides a closed-form solution for removing linearly-encoded concepts (like gender) from representations at every layer.  <br>● Theoretical guarantees: Mathematically guarantees the concept cannot be linearly recovered after erasure.  <br>● Bias reduction: Applied to reduce gender bias in BERT embeddings while minimizing impact on other capabilities.  <br>● Interpretability tool: Became a standard tool in the model-editing and interpretability literature for studying what information models use.                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2306.03819) , [Tweet](https://twitter.com/norabelrose/status/1666469917636571137?s=20)               |
| 7) **Fine-Grained RLHF** - Trains LMs with segment-level human feedback rather than whole-response preferences.  <br>● Segment-level rewards: Provides multiple reward models targeting specific dimensions (factuality, relevance, fluency) at the span level.  <br>● Long-form QA gains: Substantial improvements on long-form question answering where whole-response preferences are too coarse.  <br>● Toxicity reduction: Enables targeted reduction of toxic spans without degrading overall response quality.  <br>● Controllable RLHF: Enables model customization by emphasizing different reward dimensions at inference time.                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2306.01693), [Tweet](https://twitter.com/zeqiuwu1/status/1665785626552049665?s=20)                   |
| 8) **Hierarchical Vision Transformer (Hiera)** - Pretrains ViTs with MAE while removing unnecessary multi-stage complexity.  <br>● Simplified architecture: Strips away hand-designed components (shifted windows, relative position biases) from hierarchical ViTs like Swin.  <br>● MAE pretraining: Leverages masked autoencoder pretraining to compensate for reduced inductive bias.  <br>● Faster and more accurate: Achieves better accuracy and faster inference/training than prior hierarchical ViTs.  <br>● Architecture minimalism: Reinforces the "bitter lesson" direction - simpler architectures with better pretraining beat complex hand-designed ones.                                                                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2306.00989), [Tweet](https://twitter.com/MetaAI/status/1665759715765411840?s=20)                     |
| 9) **Humor in ChatGPT** - Explores ChatGPT's capabilities to grasp and reproduce humor.  <br>● Joke repetition: Over 90% of 1,008 generated jokes were the same 25 jokes - revealing extreme mode collapse in humor generation.  <br>● Structural overfitting: ChatGPT is overfit to particular joke structures (e.g., "Why did X? Because Y") and struggles with diverse humor styles.  <br>● Humor comprehension: While generation is limited, ChatGPT can explain joke structure and recognize humor - showing a partial understanding.  <br>● Creativity evaluation: An influential paper in the creativity-evaluation literature, documenting specific failures of LLM creative generation.                                                                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2306.04563), [Tweet](https://twitter.com/AlbertBoyangLi/status/1666707728272850944?s=20)             |
| 10) **Imitating Reasoning Process of Larger LLMs (Orca)** - Microsoft's 13B model that imitates GPT-4's reasoning traces.  <br>● Explanation tuning: Trains on detailed step-by-step explanations from GPT-4, not just final answers - capturing the reasoning process.  <br>● Scale and diversity: Leverages millions of diverse imitation examples spanning reasoning tasks, dialogue, and instruction-following.  <br>● Beats Vicuna-13B: Surpasses instruction-tuned Vicuna-13B in zero-shot reasoning, demonstrating explanation-data quality matters.  <br>● Small-model reasoning: Kicked off a line of research on reasoning distillation that would continue through Orca 2 and into 2024's reasoning-specific SLMs.                                                                                                                                                                                                   | [Paper](https://arxiv.org/abs/2306.02707), [Tweet](https://twitter.com/johnjnay/status/1665906453587034112?s=20)                   |

---

## Top AI Papers of the Week (May 29-June 4)

| **Paper**                                                                                                                                                                                                                                                                                                                      | **Links**                                                                                                                |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------ |
| 1) **Let's Verify Step by Step** - OpenAI's landmark paper on process reward models for mathematical reasoning.  <br>● Process supervision: Rewards each correct step of reasoning rather than just the final answer, capturing partial credit and providing much denser training signal.  <br>● 78% MATH solve rate: Achieves state-of-the-art on a representative subset of the MATH benchmark - a significant jump over outcome-reward baselines.  <br>● PRM800K dataset: Releases a massive dataset of 800K step-level correctness labels, enabling follow-up research on process reward models.  <br>● Reasoning revolution foundation: Directly influenced OpenAI's o1/o3 reasoning models and the broader 2024-25 push toward process-supervised reasoning.                        | [Paper](https://arxiv.org/abs/2305.20050), [Tweet](https://twitter.com/OpenAI/status/1663957407184347136?s=20)           |
| 2) **No Positional Encodings (NoPE)** - Shows explicit position embeddings aren't essential for decoder-only Transformers.  <br>● Implicit positional learning: Decoder-only Transformers learn positional information from the causal attention mask alone - no explicit encoding needed.  <br>● Length generalization: NoPE generalizes better to longer sequences than ALiBi and Rotary, which have surprising length-generalization issues.  <br>● Architectural simplification: Removing positional encodings simplifies the architecture with no quality loss on standard tasks.  <br>● Long-context influence: Informed the 2024 resurgence of interest in length-generalization-friendly architectures.                                                                                                                                                                                       | [Paper](https://arxiv.org/abs/2305.19466), [Tweet](https://twitter.com/a_kazemnejad/status/1664277559968927744?s=20)     |
| 3) **BiomedGPT** - A unified biomedical GPT for vision, language, and multimodal tasks.  <br>● Unified biomedical model: Single model handling 5 task types across 20 public datasets spanning 15+ biomedical modalities (images, text, genomics).  <br>● SOTA across benchmarks: Achieves state-of-the-art on biomedical VQA, summarization, and classification benchmarks.  <br>● Generalist medical direction: Complements Med-PaLM M in proving that generalist medical AI models outperform task-specific specialists.  <br>● Medical AI democratization: As an open model, makes generalist biomedical AI accessible to academic medical centers and healthcare startups.                                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2305.17100), [Tweet](https://twitter.com/omarsar0/status/1662992484576681986?s=20)         |
| 4) **Thought Cloning** - Imitation learning framework that learns to think as well as act.  <br>● Cloning thoughts AND behavior: Clones both the actions and the internal verbal thoughts of human demonstrators, not just behavioral trajectories.  <br>● BabyAI benchmark: Demonstrated on BabyAI with substantial improvement over behavior-only cloning, especially on out-of-distribution tasks.  <br>● Interpretability bonus: Because the agent thinks in natural language, its decisions are interpretable and debuggable.  <br>● Reasoning-agent precursor: A conceptual precursor to 2024's "reasoning agents" that produce explicit thought traces before acting.                                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2306.00323), [Tweet](https://twitter.com/johnjnay/status/1664798780644904960?s=20)         |
| 5) **Fine-Tuning Language Models with Just Forward Passes (MeZO)** - A memory-efficient zeroth-order optimizer for LLM fine-tuning.  <br>● No backpropagation: Uses a memory-efficient zeroth-order SGD algorithm that requires only forward passes, eliminating the memory overhead of backprop.  <br>● Inference-like memory: Fine-tunes large LLMs with the same memory footprint as inference - democratizes full-parameter fine-tuning.  <br>● Comparable quality: Reaches comparable quality to backpropagation-based fine-tuning on many tasks despite using only forward passes.  <br>● Memory-constrained tuning: Opens new possibilities for fine-tuning huge models on modest hardware by trading compute for memory.                                                                                                                                                                                                               | [Paper](https://arxiv.org/abs/2305.17333) , [Tweet](https://twitter.com/arankomatsuzaki/status/1663360307274690560?s=20) |
| 6) **MERT** - An acoustic music understanding model with large-scale self-supervised training.  <br>● Music-specific SSL: Designed specifically for music (not speech/general audio) with appropriate teacher models and training objectives.  <br>● Multi-teacher design: Combines multiple teacher models to capture different aspects of music (pitch, rhythm, timbre, harmony).  <br>● Cross-task performance: Outperforms speech and generic audio approaches on music understanding benchmarks (genre, mood, tagging).  <br>● Music foundation model: Part of the 2023 push toward domain-specific audio foundation models rather than one-size-fits-all speech/audio models.                                                                                                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2306.00107) , [Tweet](https://twitter.com/yizhilll/status/1664680921146982401?s=20)        |
| 7) **Bytes Are All You Need** - Performs classification directly on file bytes without decoding.  <br>● Raw-byte input: Trains Transformers directly on raw file bytes (PNG, WAV, etc.) rather than decoded tensors.  <br>● Strong results: Achieves 77.33% ImageNet Top-1 accuracy on raw bytes and 95.42% on raw WAV for Speech Commands v2.  <br>● Format-agnostic: A single architecture handles any file format without preprocessing pipelines.  <br>● Infrastructure simplification: Suggests a future where models eat raw bytes and skip format-specific codecs - simpler pipelines with less preprocessing error.                                                                                                                                                                                                                                            | [Paper](https://arxiv.org/abs/2306.00238), [Tweet](https://twitter.com/_akhaliq/status/1664497650702471169?s=20)         |
| 8) **Direct Preference Optimization (DPO)** - Rafailov et al.'s simpler alternative to RLHF that rivals full RL-based alignment.  <br>● Classification, not RL: Reformulates preference learning as a classification problem on preference pairs, skipping the complex RL loop entirely.  <br>● Theoretical equivalence: Mathematically equivalent to RLHF under certain assumptions, extracting the implicit reward function directly.  <br>● Training stability: Much more stable and hyperparameter-robust than PPO-based RLHF, dramatically lowering the barrier to entry.  <br>● Industry-wide adoption: Became the default alignment method throughout 2024 (Zephyr, Tulu, Llama 3 pipelines) and ushered in the era of RL-free preference optimization.                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2305.18290), [Tweet](https://twitter.com/archit_sharma97/status/1663595372269408261?s=20)  |
| 9) **SQL-PaLM** - An LLM-based Text-to-SQL system built on PaLM-2.  <br>● SOTA in both settings: Achieves state-of-the-art on Spider benchmark in both in-context learning and fine-tuning settings.  <br>● Beats GPT-4 few-shot: Few-shot SQL-PaLM outperforms few-shot GPT-4 by 9.9% using a simple prompting approach.  <br>● Improves on fine-tuned baselines: The few-shot setting even outperforms the previous fine-tuned SOTA by 3.8%.  <br>● Text-to-SQL direction: Part of the Text-to-SQL surge that led to production NL-to-SQL systems in analytics platforms through 2024.                                                                                                                                                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2306.00739), [Tweet](https://twitter.com/omarsar0/status/1664441085693657088?s=20)         |
| 10) **CodeTF** - An open-source Transformer library for state-of-the-art code LLMs.  <br>● Code-LLM infrastructure: Provides pretrained code LLMs, popular code benchmarks, and standard methods for training and serving them efficiently.  <br>● Unified interface: Consistent API across different code LLMs makes comparison and swapping straightforward.  <br>● Benchmark-driven: Built-in evaluation on HumanEval, MBPP, and other code benchmarks enables easy empirical comparisons.  <br>● Open-source code AI: Part of the 2023 expansion of open-source code LLM tooling that made private coding assistants practical for enterprise.                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2306.00029), [Tweet](https://twitter.com/stevenhoi/status/1664483010954272770?s=20)        |

---

## Top AI Papers of the Week (May 22-28)

| **Paper**                                                                                                                                                                                                                                                                                                       | **Links**                                                                                                                |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| 1) **QLoRA** - Tim Dettmers' breakthrough technique enabling 65B LLM fine-tuning on a single 48GB GPU.  <br>● 4-bit NF4 quantization: Introduces the NormalFloat 4-bit datatype optimized for normally-distributed weights with double-quantization for further memory savings.  <br>● Paged optimizers: Uses paged NVIDIA Unified Memory to handle optimizer state memory spikes without OOM failures.  <br>● 16-bit quality: Achieves quality matching full 16-bit fine-tuning despite aggressive quantization during training.  <br>● Community fine-tuning enabler: Arguably the single most impactful 2023 paper for democratizing LLM fine-tuning - powered thousands of community checkpoints on Hugging Face.                    | [Paper](https://arxiv.org/abs/2305.14314), [Tweet](https://twitter.com/Tim_Dettmers/status/1661379354507476994?s=20)     |
| 2) **LIMA** - Meta's 65B LLaMA fine-tuned on just 1,000 curated examples - showing alignment needs less data than believed.  <br>● 1,000-example SFT: Achieves strong alignment with only 1,000 carefully curated prompt-response pairs, no RLHF needed.  <br>● "Superficial Alignment Hypothesis": Proposes that a model's knowledge is learned in pretraining and alignment mostly teaches response style.  <br>● GPT-4 competitive: Generates responses preferred over or equivalent to GPT-4 in 43% of cases, and much higher versus Bard.  <br>● Data-quality over quantity: Became a foundational reference for the "quality over quantity" SFT paradigm that dominated later alignment work.                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2305.11206), [Tweet](https://twitter.com/violet_zct/status/1660789120069926912?s=20)       |
| 3) **Voyager** - An LLM-powered embodied lifelong learning agent in Minecraft exploring autonomously.  <br>● Skill library: Maintains a growing library of skills written as code - new skills are composed from existing ones, creating cumulative learning.  <br>● Automatic curriculum: The LLM proposes its own curriculum of tasks, driving open-ended exploration without human intervention.  <br>● GPT-4 integration: Uses GPT-4 for both planning and skill generation, demonstrating the power of modern LLMs as agent cognitive cores.  <br>● Agent research milestone: A landmark agent paper showing LLM-powered agents can exhibit autonomous, cumulative learning in complex environments.                                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2305.16291), [Tweet](https://twitter.com/DrJimFan/status/1662115266933972993?s=20)         |
| 4) **Gorilla** - A fine-tuned LLaMA-based model that surpasses GPT-4 on API call generation.  <br>● API-specialized LLM: Specifically trained on massive API documentation corpora to produce correct API calls for TensorFlow Hub, HuggingFace, and PyTorch Hub.  <br>● Beats GPT-4 on APIs: Outperforms GPT-4 on writing correct API calls - a narrow but important capability for tool use.  <br>● Hallucination reduction: Major reduction in hallucinated API names and parameters compared to general-purpose LLMs.  <br>● Tool-use LLM research: Established that specialized LLMs can meaningfully beat generalists at narrow capabilities - informing the later ecosystem of task-specialized models.                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2305.15334), [Tweet](https://twitter.com/omarsar0/status/1661540207206846464?s=20)         |
| 5) **The False Promise of Imitating Proprietary LLMs** - Berkeley's critical analysis of open-source imitation of proprietary LLMs.  <br>● Imitation limits: Shows that fine-tuning small open models on GPT-4 outputs creates a stylistic illusion without meaningfully improving factual capabilities.  <br>● Stylistic mimicry: Imitation models learn to sound like GPT-4 but retain the base model's underlying capability ceiling.  <br>● Base model leverage: Argues the higher-leverage action for open-source is building better base models, not imitating proprietary outputs.  <br>● Field-redirecting: Shifted open-source research focus from distillation toward better pretraining data and scale, preparing the ground for strong foundation models like Llama 2.                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2305.15717) , [Tweet](https://twitter.com/arankomatsuzaki/status/1661908342829187072?s=20) |
| 6) **Sophia** - A simple, scalable second-order optimizer with negligible per-step overhead.  <br>● Second-order optimization: Uses a diagonal Hessian estimate to capture curvature information, going beyond first-order Adam.  <br>● 2x speedup over Adam: On language modeling, achieves 2x speedup in step count, total compute, and wall-clock time.  <br>● Practical efficiency: Despite being second-order, has only marginal per-step overhead versus Adam.  <br>● Optimizer innovation: Part of the late-2023 wave of optimizer research (Lion, Sophia, Shampoo) aiming to replace Adam as the LLM-training default.                                                                                                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2305.14342) , [Tweet](https://twitter.com/tengyuma/status/1661412995430219786?s=20)        |
| 7) **The Larger They Are, the Harder They Fail** - Reveals inverse-scaling failures in LLM code generation.  <br>● Function-name swap test: Swaps default Python function names and observes that larger LLMs fail harder to adapt - they prefer memorized patterns.  <br>● Inverse scaling: Counter to the usual "bigger is better" narrative, larger models prefer incorrect memorized continuations more strongly than smaller ones.  <br>● Memorization vs. reasoning: Highlights the tension between memorization (which helps on training data) and reasoning (which helps on novel data).  <br>● Safety implications: Important for safety/robustness - bigger models may be more brittle in adversarial or out-of-distribution settings.                                                                                                                                                                                                                   | [Paper](https://arxiv.org/abs/2305.15507), [Tweet](https://twitter.com/AVMiceliBarone/status/1662150656327663617?s=20)   |
| 8) **Model Evaluation for Extreme Risks** - DeepMind's framework for evaluating models for catastrophic-risk capabilities.  <br>● Dangerous-capability evaluation: Argues for evaluations targeting specifically dangerous capabilities (cyberattacks, bioweapons, manipulation) rather than general performance.  <br>● Responsible decisions: Connects evaluation results to decisions about training, deployment, access control, and security investments.  <br>● Red-team integration: Builds on dangerous-capability red-teaming methodology, formalizing it for frontier model governance.  <br>● Governance influence: Directly informed the UK AI Safety Institute's frontier model evaluation framework and similar efforts.                                                                                                                                                                                                  | [Paper](https://arxiv.org/abs/2305.15324), [Tweet](https://twitter.com/soundboy/status/1661728733156503555?s=20)         |
| 9) **LLM Research Directions** - A list of research directions for students entering LLM research.  <br>● Research roadmap: Provides an organized list of open LLM research problems (factuality, reasoning, alignment, efficiency, evaluation).  <br>● Accessibility focus: Specifically aimed at students and newcomers, identifying problems tractable on limited compute budgets.  <br>● Course-material input: Became a reference for LLM-focused graduate seminars and reading groups.  <br>● Field-guide document: Helped widen the LLM research field by lowering the barrier for newcomers to find productive research directions.                                                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2305.12544), [Tweet](https://twitter.com/omarsar0/status/1661405738059571201?s=20)         |
| 10) **Reinventing RNNs for the Transformer Era (RWKV)** - Combines parallelizable training of Transformers with efficient RNN inference.  <br>● Hybrid design: Achieves Transformer-style parallelizable training with RNN-style O(1) inference memory - best of both worlds.  <br>● Transformer-parity performance: Matches similarly-sized Transformers on language modeling benchmarks while being dramatically cheaper at inference.  <br>● Open community: Developed as an open-community project with releases spanning multiple scales and substantial community fine-tuning.  <br>● Post-Transformer contender: Alongside Mamba and RetNet, positioned as one of the credible attempts to dethrone attention for efficient long-context inference.                                                                                                                                                                                                                               | [Paper](https://arxiv.org/abs/2305.13048), [Tweet](https://twitter.com/_akhaliq/status/1660816265454419969?s=20)         |

---

## Top AI Papers of the Week (May 15-21)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                                                 | **Links**                                                                                                         |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| 1) **Drag Your GAN (DragGAN)** - Interactive point-based image manipulation on the generative image manifold.  <br>● Point-based control: User clicks handle points on an image and drags them to target locations; the GAN smoothly moves image content accordingly.  <br>● Precision editing: Achieves pixel-level control over image content - opening/closing mouths, rotating objects, changing poses - with minimal artifacts.  <br>● User-interactive: Real-time feedback enables intuitive editing workflows that previous generative editing approaches lacked.  <br>● Viral impact: Became one of the most viral AI papers of 2023, inspiring widespread interest and later extensions to diffusion models (DragDiffusion).                                                                                                                                                                                                                                                                                                                                                                 | [Paper](https://arxiv.org/abs/2305.10973v1), [Tweet](https://twitter.com/dair_ai/status/1660268470057967616?s=20) |
| 2) **Evidence of Meaning in Language Models Trained on Programs** - Argues LMs learn meaning despite only next-token prediction.  <br>● Programs as controlled input: Uses programs (which have well-defined semantics) to study whether LMs learn meaning versus surface patterns.  <br>● Intermediate-state prediction: Shows that LMs trained on programs learn to predict program state after each statement - evidence of semantic understanding.  <br>● Probe experiments: Careful probing experiments distinguish surface correlations from semantic representations.  <br>● Emergence argument: Adds empirical grounding to the "LLMs have world models" debate that dominated 2023's interpretability discussions.                                                                                                                                                                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2305.11169), [Tweet](https://twitter.com/dair_ai/status/1660268472129945600?s=20)   |
| 3) **Towards Expert-Level Medical Question Answering (Med-PaLM 2)** - Google's second-generation medical LLM.  <br>● MedQA SOTA: Scored up to 86.5% on the MedQA dataset (USMLE-style questions) - a new state-of-the-art matching expert physicians.  <br>● Multi-benchmark leadership: Approaches or exceeds SOTA across MedMCQA, PubMedQA, and MMLU clinical topics datasets.  <br>● Human evaluation quality: Physician evaluators rated Med-PaLM 2 answers as comparable to those of other physicians on most axes.  <br>● Medical AI frontier: Set the bar for medical LLMs and informed FDA's thinking on AI-assisted clinical workflows.                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2305.09617), [Tweet](https://twitter.com/dair_ai/status/1660268473853829121?s=20)   |
| 4) **MEGABYTE** - Multiscale Transformers for predicting million-byte sequences.  <br>● Two-level architecture: Combines a large global Transformer over patches with a smaller local Transformer over bytes within each patch.  <br>● Sub-quadratic attention: Achieves sub-quadratic self-attention cost through the patch-level hierarchy, enabling million-byte sequences.  <br>● Decoding parallelism: Improves decoding parallelism compared to flat Transformers that must decode token-by-token.  <br>● Tokenization-free: Operates directly on bytes without tokenizers - potentially avoiding tokenizer failure modes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2305.07185), [Tweet](https://twitter.com/dair_ai/status/1660268475762327552?s=20)   |
| 5) **StructGPT** - A general framework for LLM reasoning over structured data.  <br>● Structured data interface: Provides specialized interfaces for tables, knowledge graphs, and databases that LLMs can query.  <br>● Iterative reasoning: LLM iteratively invokes interfaces to narrow down relevant information rather than ingesting the full structure.  <br>● Zero-shot improvements: Improves zero-shot reasoning over structured data without task-specific training.  <br>● Structured QA foundation: Part of the early work establishing LLM-over-structured-data as a distinct research area leading to 2024 enterprise SQL agents.                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2305.09645) , [Tweet](https://twitter.com/dair_ai/status/1660268477628727298?s=20)  |
| 6) **TinyStories** - Explores how small LMs can be and still speak coherent English.  <br>● Synthetic story dataset: Creates a dataset of short stories using words understandable to 3-4 year olds, generated by GPT-3.5/GPT-4.  <br>● Tiny but fluent: Shows that very small models (1-10M parameters) trained on this focused data can produce coherent multi-paragraph stories.  <br>● Reasoning emergence: Even tiny models demonstrate reasoning and instruction-following capabilities when trained on the right data.  <br>● Data-quality evidence: A foundational piece in the argument that data quality beats scale for many capabilities, influencing Phi series and later SLM work.                                                                                                                                                                                                                                                                                                                                                                                                        | [Paper](https://arxiv.org/abs/2305.07759) , [Tweet](https://twitter.com/dair_ai/status/1660268479642054660?s=20)  |
| 7) **DoReMi** - Optimizes data mixtures for faster language model pretraining.  <br>● Proxy-model reweighting: Trains a small 280M proxy model with group-DRO to derive optimal domain weights for the actual pretraining mixture.  <br>● Scale transfer: Weights found by 280M proxy transfer to training 8B models (30x larger) without retuning.  <br>● Training speedup: Achieves faster convergence and better downstream performance than uniform or human-tuned mixtures.  <br>● Data mixture research: Kicked off a wave of data-mixture optimization work that became central to 2024 pretraining recipes (Llama 3, DCLM).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [Paper](https://arxiv.org/abs/2305.10429), [Tweet](https://twitter.com/dair_ai/status/1660268481466572802?s=20)   |
| 8) **CodeT5+** - An open code LLM family for code understanding and generation.  <br>● Flexible architecture: Supports encoder-only, decoder-only, and encoder-decoder modes to handle diverse code tasks.  <br>● 20-benchmark evaluation: Tested on 20 code-related benchmarks across zero-shot, fine-tuning, and instruction tuning.  <br>● SOTA on multiple tasks: Achieves SOTA on code completion, math programming, and text-to-code retrieval.  <br>● Training efficiency: Uses multiple training objectives combined to improve efficacy and compute efficiency.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2305.07922), [Tweet](https://twitter.com/dair_ai/status/1660268483152584704?s=20)   |
| 9) **Symbol tuning** - Fine-tunes LMs on in-context input-label pairs with natural-language labels replaced by arbitrary symbols.  <br>● Symbolic abstraction: Replacing semantic labels with random symbols forces the model to rely on the demonstrations rather than label priors.  <br>● ICL improvements: Boosts performance on unseen in-context learning tasks where the model must infer label semantics from examples.  <br>● Algorithmic reasoning: Particularly improves algorithmic reasoning tasks that require following abstract patterns.  <br>● ICL mechanism insight: Provides evidence about how ICL works and how to train models that better generalize the mechanism.                                                                                                                                                                                                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2305.08298)), [Tweet](https://twitter.com/dair_ai/status/1660268485035819009?s=20)  |
| 10) **Incidental Bilingualism in PaLM's Translation Capability** - Explores where PaLM's translation ability actually comes from.  <br>● 30M+ translation pairs: PaLM is exposed to over 30 million translation pairs across at least 44 languages within its training data, incidentally.  <br>● Incidental bilingualism: Argues these "accidental" translation pairs substantially explain PaLM's translation capabilities.  <br>● Scale-of-incidental-data: Highlights how large-scale pretraining can inadvertently cover specialized capabilities via byproducts of web data.  <br>● Pretraining data insight: An influential study on understanding emergent capabilities via careful data auditing.                                                                                                                                                                                                                                                                                                                                                                                                       | [Paper](https://arxiv.org/abs/2305.10266), [Tweet](https://twitter.com/dair_ai/status/1660268486839476224?s=20)   |

---

## Top AI Papers of the Week (May 8-14)

| **Paper**                                                                                                                                                                                                                                                                                                               | **Links**                                                                                                                                                  |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **LLM Explains Neurons in LLMs** - OpenAI's automated interpretability pipeline using GPT-4 to explain GPT-2 neurons.  <br>● GPT-4 as interpreter: Uses GPT-4 to generate natural-language explanations of what individual GPT-2 neurons detect.  <br>● Automated scoring: Also uses GPT-4 to score how well an explanation predicts the neuron's actual activations on new text.  <br>● Scale of interpretability: Enables scaling interpretability research to all neurons in a model, previously impractical with human effort.  <br>● Automated interpretability era: Sparked the automated interpretability research program that continued in 2024 with SAE-based techniques and Golden Gate Claude demos.                              | [Paper](https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html), [Tweet](https://twitter.com/OpenAI/status/1655982364273831936?s=20) |
| 2) **PaLM 2** - Google's second-generation PaLM powering Bard and Google products.  <br>● Compute-optimal training: Trained compute-optimally on a larger, higher-quality, more multilingual corpus than PaLM 1.  <br>● Multilingual strength: Major improvement in 100+ languages; supports translation, generation, and reasoning across a much broader language set.  <br>● Reasoning competitive with GPT-4: Particularly strong on mathematical reasoning, approaching GPT-4 on several benchmarks.  <br>● Flan-PaLM 2: The instruction-tuned version performs well on MMLU, BIG-bench Hard, and code generation - powering Google's consumer AI products.                                                                                                                                                                                                                                                                                                              | [Paper](https://ai.google/static/documents/palm2techreport.pdf), [Tweet](https://twitter.com/Google/status/1656347171556294669?s=20)                       |
| 3) **ImageBind** - Meta's joint embedding across six modalities at once.  <br>● Six-modality embedding: Learns a joint embedding space across images, text, audio, depth, thermal, and IMU data.  <br>● Implicit binding via images: Images are the "central" modality that binds others - without requiring all-pairs training data.  <br>● Zero-shot emergent capabilities: Enables cross-modal retrieval, arithmetic composition of modalities, and cross-modal generation/detection.  <br>● Multi-modal foundation: Influenced 2024's unified multimodal models (Chameleon, GPT-4o) by showing the viability of unified embedding spaces.                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2305.05665), [Tweet](https://twitter.com/MetaAI/status/1655989274620358656?s=20)                                             |
| 4) **TidyBot** - Combines LLM-based planning and perception with few-shot summarization to infer user preferences.  <br>● Preference inference: Uses LLMs to infer generalized user preferences from a few examples of what objects belong where in a home.  <br>● Generalization: Preferences inferred from specific examples generalize to future unseen objects.  <br>● LLMs in embodied AI: Demonstrates LLMs' value for household robotics as high-level preference reasoners.  <br>● Personalized robots: An early example of LLM-powered robot personalization - informing 2024 agent+robotics research.                                                                                                                                                                                                                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2305.05658), [Tweet](https://twitter.com/_akhaliq/status/1656117478760796160?s=20)                                           |
| 5) **Unfaithful Explanations in Chain-of-Thought Prompting** - Demonstrates CoT explanations can misrepresent the true reason for a model's prediction.  <br>● Biased-CoT demonstration: Shows when models are biased toward incorrect answers (e.g., from few-shot bias), they generate CoT justifications supporting those wrong answers.  <br>● Confident-but-wrong: The CoT sounds plausible and confident even when it's post-hoc rationalization rather than actual reasoning.  <br>● Interpretability warning: An important caution that visible reasoning traces shouldn't be uncritically trusted as explanations.  <br>● Safety implications: Part of the growing evidence base that CoT monitoring for safety has limitations.                                                                                                                                                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2305.04388) , [Tweet](https://twitter.com/milesaturpin/status/1656010877269602304?s=20)                                      |
| 6) **InstructBLIP** - Visual-language instruction tuning built on BLIP-2.  <br>● Instruction-aware Q-Former: Extends BLIP-2's Q-Former to be instruction-aware, dynamically extracting relevant visual features per instruction.  <br>● 13 held-out datasets: Achieves state-of-the-art zero-shot performance on 13 held-out vision-language datasets.  <br>● Beats BLIP-2 and Flamingo: Outperforms both BLIP-2 and Flamingo on most zero-shot benchmarks despite being a direct BLIP-2 extension.  <br>● Open VLM progress: A prominent open-source VLM in 2023 that informed the later LLaVA-1.5, Qwen-VL, and InternVL lineage.                                                                                                                                                                                                                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2305.06500) , [Tweet](https://twitter.com/LiJunnan0409/status/1656821806593101827?s=20)                                      |
| 7) **Active Retrieval Augmented LLMs (FLARE)** - Actively decides when and what to retrieve during generation.  <br>● Dynamic retrieval: Retrieves only when the model's next-token confidence drops - not at fixed intervals.  <br>● Anticipated content retrieval: Retrieves based on what the model is about to generate, not just the current context.  <br>● Long-form knowledge-intensive tasks: Demonstrates superior or competitive performance on long-form knowledge-intensive generation tasks.  <br>● Adaptive RAG: Established a research direction on adaptive/active retrieval that matured in 2024 with tools like Self-RAG and RankRAG.                                                                                                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2305.06983), [Tweet](https://twitter.com/omarsar0/status/1657004417726423042?s=20)                                           |
| 8) **FrugalGPT** - Strategies to reduce LLM inference cost while improving performance.  <br>● Three-layer strategy: Combines prompt adaptation, LLM approximation, and LLM cascading to save cost.  <br>● Model cascade: Routes easy queries to cheap models and escalates to expensive models only when needed.  <br>● Cost reduction: Shows 98% cost savings while sometimes improving accuracy over using the most expensive model always.  <br>● Production patterns: Influenced production LLM routing patterns and the 2024 ecosystem of LLM routers (RouteLLM, Martian).                                                                                                                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2305.05176), [Tweet](https://twitter.com/omarsar0/status/1656105704808419329?s=20)                                           |
| 9) **StarCoder** - An open-access 15.5B code LLM with 8K context and 80+ programming languages.  <br>● Fully-open release: Released under OpenRAIL with training data (The Stack), training code, and model weights all public.  <br>● 80+ programming languages: Broadly multilingual in code, including non-English natural language in comments and strings.  <br>● 8K context: Long context enables reasoning over larger code files than prior open code LLMs.  <br>● Community base: Became the base for many community code models and powered LMStudio-style local coding assistants.                                                                                                                                                                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2305.06161), [Tweet](https://twitter.com/_akhaliq/status/1656479380296613894?s=20)                                           |
| 10) **MultiModal-GPT** - A vision-language model for multi-round dialogue fine-tuned from OpenFlamingo.  <br>● LoRA-based extension: Adds LoRA to OpenFlamingo's cross-attention and self-attention for efficient fine-tuning.  <br>● Multi-round dialog: Specifically designed for multi-turn visual dialog, going beyond single-turn VQA.  <br>● Open visual chatbot: An early fully-open visual chatbot that users could run locally.  <br>● VLM dialog research: Informed the trajectory toward modern visual chatbots (LLaVA, Qwen-VL) that dominated open VLM research.                                                                                                                                                                                                                                                                                                                                                                                                                                         | [Paper](https://arxiv.org/abs/2305.04790), [Tweet](https://twitter.com/OpenMMLab/status/1656127026687000578?s=20)                                          |

---

## Top AI Papers of the Week (May 1-7)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                       | **Links**                                                                                                                                  |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| 1) **scGPT** - A foundation model for single-cell multi-omics pretrained on 10 million cells.  <br>● Single-cell foundation: Applies LLM-style pretraining to single-cell transcriptomics data, tokenizing cells and genes.  <br>● Massive scale: Pretrained on 10 million cells - the largest foundation model for single-cell biology at the time.  <br>● Multi-task transfer: Transfers to cell-type annotation, gene perturbation prediction, multi-batch integration, and gene network inference.  <br>● Bio-AI foundation: Part of the broader push toward domain-specific foundation models in biology, alongside ESMFold (proteins) and DNA foundation models.                                                                                                                                                                                                                                                                                                                                                                                                                                | [Paper](https://www.biorxiv.org/content/10.1101/2023.04.30.538439v1), [Tweet](https://twitter.com/dair_ai/status/1655223088152211456?s=20) |
| 2) **GPTutor** - A ChatGPT-powered VSCode extension for code explanation.  <br>● IDE integration: Delivered as a VSCode extension, making AI-assisted code explanation frictionless for developers.  <br>● Prompt engineering for code: Uses code-relevant prompt engineering to produce more concise and accurate explanations than vanilla ChatGPT or Copilot.  <br>● Context-aware prompts: Automatically includes relevant surrounding code in its prompts for better local explanations.  <br>● Education use case: Particularly useful for junior developers learning unfamiliar codebases - an early AI-education product.                                                                                                                                                                                                                                                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2305.01863), [Tweet](https://twitter.com/dair_ai/status/1655223089754517509?s=20)                            |
| 3) **Shap-E** - OpenAI's conditional generative model for 3D assets producing implicit functions.  <br>● Implicit function output: Generates implicit functions (NeRFs and signed distance functions) rather than fixed meshes - enabling both textured meshes and neural radiance field rendering.  <br>● Text and image conditioning: Supports both text-to-3D and image-to-3D generation in a unified framework.  <br>● Fast generation: Generates 3D assets in seconds rather than the minutes/hours required by optimization-based methods.  <br>● 3D generative AI: A key step in the rapid evolution of 3D generation that would continue through 2024 with Splatter Image, TripoSR, and others.                                                                                                                                                                                                                                                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2305.02463), [Tweet](https://twitter.com/dair_ai/status/1655223091482566663?s=20)                            |
| 4) **Are Emergent Abilities of LLMs a Mirage?** - Stanford's critical re-examination of emergent abilities.  <br>● Metric-choice argument: Argues "emergence" is often an artifact of using discontinuous metrics (like exact match) rather than smooth ones (like log-probability).  <br>● Metric substitution: When re-analyzing with continuous metrics, many "emergent" capabilities appear smoothly with scale.  <br>● Research methodology: Cautions the field against interpreting metric-choice artifacts as fundamental phase transitions.  <br>● Best Paper at NeurIPS 2023: Influential paper that sparked extensive debate about what "emergence" really means in LLMs.                                                                                                                                                                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2304.15004), [Tweet](https://twitter.com/dair_ai/status/1655223092975640578?s=20)                            |
| 5) **Interpretable ML for Science with PySR** - An open-source library for practical symbolic regression in the sciences.  <br>● Distributed back-end: Built on a high-performance distributed back-end for scaling to larger scientific datasets.  <br>● DL integration: Interfaces with several deep learning packages so symbolic regression can be used alongside neural networks.  <br>● EmpiricalBench benchmark: Releases a new benchmark for quantifying the applicability of symbolic regression algorithms in science.  <br>● Science-AI tool: Became a widely-used tool for scientists seeking interpretable equations from data, complementing black-box DL.                                                                                                                                                                                                                                                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2305.01582) , [Tweet](https://twitter.com/dair_ai/status/1655223094640889856?s=20)                           |
| 6) **PMC-LLaMA** - A LLaMA model fine-tuned on 4.8 million medical papers.  <br>● Domain-specific continued pretraining: Extends LLaMA's medical knowledge through continued pretraining on PubMed Central papers.  <br>● Biomedical QA: Achieves high performance on biomedical QA benchmarks, narrowing the gap with proprietary medical LLMs.  <br>● Open medical LLM: As a fully open model, accessible to academic medical researchers without proprietary model constraints.  <br>● Medical LLM ecosystem: Part of the 2023 medical LLM boom that established the template of general LLM + medical continued pretraining + medical SFT.                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | [Paper](https://arxiv.org/abs/2304.14454) , [Tweet](https://twitter.com/dair_ai/status/1655223096301740032?s=20)                           |
| 7) **Distilling Step-by-Step!** - A mechanism to train smaller models that outperform larger LLMs using fewer examples.  <br>● Rationale extraction: Extracts CoT rationales from a larger teacher LLM, using them to augment smaller student model training.  <br>● Smaller beats larger: Distilled student models outperform LLMs 500x+ larger in size on benchmark reasoning tasks.  <br>● Data efficiency: Requires dramatically less labeled training data than standard fine-tuning by leveraging LLM rationales as free supervision.  <br>● Distillation paradigm: Influential for the 2024 proliferation of reasoning-distilled small models like Orca 2, Phi-3, and later reasoning-specific SLMs.                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | [Paper](https://arxiv.org/abs/2305.02301), [Tweet](https://twitter.com/dair_ai/status/1655223098730217472?s=20)                            |
| 8) **Poisoning Language Models During Instruction Tuning** - Shows adversaries can poison LLMs via instruction tuning data.  <br>● Poisoning attack: Demonstrates adversaries can contribute poisoned examples to instruction tuning datasets to induce specific misbehaviors.  <br>● Cross-task poisoning: Poisoning can induce degenerate outputs across held-out tasks, not just the poisoned task - broad attack surface.  <br>● Supply-chain vulnerability: Highlights the supply-chain vulnerability of using community-sourced instruction data.  <br>● Alignment safety: Important for the field's thinking on data provenance and vetting for alignment datasets.                                                                                                                                                                                                                                                                                                                                                                                                                  | [Paper](https://arxiv.org/abs/2305.00944), [Tweet](https://twitter.com/dair_ai/status/1655223100286332934?s=20)                            |
| 9) **Unlimiformer** - Long-range Transformers with unlimited length input via external datastores.  <br>● External datastore: Augments pre-trained encoder-decoder Transformers with a kNN datastore to support arbitrary-length input.  <br>● Training-free: No additional training required - works with existing pretrained Transformers.  <br>● Long-document tasks: Demonstrates usefulness in long-document summarization where context spans many thousands of tokens.  <br>● RAG-enhancer: Could improve the performance of retrieval-enhanced LLMs by providing unlimited lookback over long conversations or documents.                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2305.01625), [Tweet](https://twitter.com/dair_ai/status/1655223101913718784?s=20)                            |
| 10) **Learning to Reason and Memorize with Self-Notes** - LLMs that deviate from input to explicitly "think" and memorize.  <br>● Self-note generation: The model can pause processing input and generate explicit reasoning or memory notes in-stream.  <br>● On-the-fly recall: Enables the LM to recall past information and perform reasoning when needed, not just in dedicated thinking phases.  <br>● Length generalization: Scales better to longer sequences unseen during training than plain reasoning approaches.  <br>● Scratchpad precursor: An intellectual precursor to 2024's reasoning models like o1 that produce long internal thinking traces.                                                                                                                                                                                                                                                                                                                                                                                                                                       | [Paper](https://arxiv.org/abs/2305.00833), [Tweet](https://twitter.com/dair_ai/status/1655223103662829569?s=20)                            |

---

## Top AI Papers of the Week (April 24 - April 30)

| **Paper**                                                                                                                                                                                                                                                                                                           | **Links**                                                                                                                                                |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Learning Agile Soccer Skills for a Bipedal Robot with Deep RL** - DeepMind's bipedal humanoid robot playing soccer.  <br>● End-to-end DRL: Synthesizes agile soccer skills (fast recovery, walking, kicking, tackling) for a miniature humanoid robot purely through deep RL.  <br>● Dynamic movements: Produces genuinely athletic movements including falling and recovering - a major advance in bipedal robotics.  <br>● Sim-to-real transfer: Successfully transfers policies from simulation to real hardware with robust performance.  <br>● Humanoid robotics milestone: A visible capability demonstration that informed the 2024 boom in humanoid robot startups (Figure, 1X, Apptronik, Tesla).                                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2304.13653), [Tweet](https://twitter.com/dair_ai/status/1652693172810571780?s=20)                                          |
| 2) **Scaling Transformer to 1M tokens with RMT** - Recurrent Memory Transformer extends BERT's effective context to 2M tokens.  <br>● Recurrent memory mechanism: Augments BERT with a recurrent memory that carries information across segments, enabling massive context lengths.  <br>● 2M token context: Scales effective context to two million tokens while maintaining high memory retrieval accuracy.  <br>● Segment-level recurrence: Processes input in segments while passing a compressed memory token stream across them.  <br>● Long-context trend: Part of the 2023 explosion of long-context techniques that established ultra-long context as a viable research direction.                                                                                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2304.11062), [Tweet](https://twitter.com/dair_ai/status/1652693174576349185?s=20)                                          |
| 3) **Track Anything** - An interactive tool for video object tracking and segmentation built on Segment Anything.  <br>● SAM + tracking: Extends SAM's powerful single-image segmentation to video via click-based tracking over time.  <br>● Flexible interaction: Users click on objects in any frame to start tracking, with propagation handling the rest automatically.  <br>● Zero-shot video segmentation: Works zero-shot without per-video training - a major usability win.  <br>● Video-editing tool: Quickly adopted for video editing, content creation, and autonomous system dataset labeling.                                                                                                                                                                                                                                                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2304.11968), [Tweet](https://twitter.com/dair_ai/status/1652693176644165634?s=20)                                          |
| 4) **A Cookbook of Self-Supervised Learning** - A comprehensive overview of SSL techniques and practical considerations.  <br>● Comprehensive coverage: Covers contrastive methods (SimCLR, MoCo), non-contrastive methods (BYOL, SimSiam), masked modeling (MAE, BEiT), and more.  <br>● Practical guidance: Provides concrete advice on hyperparameters, augmentations, and debugging - not just theoretical overview.  <br>● Failure modes: Documents known SSL failure modes (collapse, shortcut learning) and how to detect/mitigate them.  <br>● Educational resource: Widely used as a reference by graduate students and newly-SSL-curious researchers.                                                                                                                                                                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2304.12210), [Tweet](https://twitter.com/dair_ai/status/1652693178724626435?s=20)                                          |
| 5) **Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond** - A practical guide for practitioners working with LLMs.  <br>● Practitioner-focused: Organizes LLM knowledge for engineering and product teams deploying LLMs rather than just academic researchers.  <br>● Use-case catalog: Walks through many concrete use cases with practical applications and limitations.  <br>● Deployment considerations: Covers real-world concerns (cost, latency, hallucination) in a structured way.  <br>● Applied LLM reference: Became a common reference in applied AI discussions during the 2023-2024 enterprise LLM rollout.                                                                                                                                                                                                                                                                                                                                                            | [Paper](https://arxiv.org/abs/2304.13712) , [Tweet](https://twitter.com/dair_ai/status/1652693180381274114?s=20)                                         |
| 6) **AudioGPT** - Connects ChatGPT with audio foundational models for speech, music, sound, and talking head tasks.  <br>● LLM as audio orchestrator: ChatGPT plans and dispatches audio tasks across specialist models (TTS, ASR, music generation, sound effects).  <br>● Modality transformation: Converts speech to text for ChatGPT processing, then generates speech from ChatGPT's text output.  <br>● Spoken dialogue: Enables end-to-end spoken dialogue where users talk to ChatGPT and it talks back.  <br>● Multi-modal agent pattern: An early example of the LLM-as-orchestrator pattern applied to audio, presaging 2024's fully multimodal voice agents.                                                                                                                                                                                                                                                                                                                                                                 | [Paper](https://arxiv.org/abs/2304.12995) , [Tweet](https://twitter.com/dair_ai/status/1652693181895409666?s=20)                                         |
| 7) **DataComp** - A multimodal dataset benchmark with 12.8B image-text pairs.  <br>● Scale and scope: 12.8 billion image-text pairs - one of the largest multimodal datasets ever released.  <br>● Benchmark framework: Provides a benchmark where researchers compete to find the best data subset, not just train the best model on fixed data.  <br>● Data-centric AI: Emphasizes data curation as the primary research axis, with model architecture and training held constant.  <br>● Data research infrastructure: Enabled a wave of data-filtering research (DataComp-XL, fastText filtering) that significantly advanced multimodal model training.                                                                                                                                                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2304.14108), [Tweet](https://twitter.com/dair_ai/status/1652693183493447681?s=20)                                          |
| 8) **ChatGPT for Information Extraction** - A deeper assessment of ChatGPT on information extraction tasks.  <br>● Extraction-task benchmark: Evaluates ChatGPT on named entity recognition, relation extraction, event extraction, and more.  <br>● Competitive but imperfect: Competitive with specialized IE models on many tasks but still falls short of fine-tuned SOTA on others.  <br>● Prompt sensitivity: Highlights significant prompt sensitivity in extraction outputs - practical challenges for deployment.  <br>● Practical assessment: A sober empirical reference informing whether to swap traditional IE pipelines for LLM-based alternatives.                                                                                                                                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2304.11633), [Tweet](https://twitter.com/dair_ai/status/1652693184927989768?s=20)                                          |
| 9) **Comparing Physician vs ChatGPT (JAMA)** - A JAMA Internal Medicine study comparing physician and ChatGPT responses.  <br>● Rigorous study: Published in JAMA Internal Medicine - a high-bar medical journal, not just an arxiv preprint.  <br>● ChatGPT preferred: Chatbot responses were preferred over physician responses and rated significantly higher in both quality and empathy.  <br>● 79% preference: ChatGPT's responses were preferred in 79% of cases, often described as more empathetic.  <br>● Medical AI discussion catalyst: Sparked widespread discussion about the role of AI in clinical communication and patient care.                                                                                                                                                                                                                                                                                                                                                                 | [Paper](https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2804309), [Tweet](https://twitter.com/dair_ai/status/1652693186467299331?s=20) |
| 10) **Stable and Low-Precision Training for Large-Scale Vision-Language Models** - Methods for accelerating and stabilizing large VLM training.  <br>● Mixed-precision techniques: Introduces stable training strategies for bfloat16/float16 mixed precision of large VLMs.  <br>● Training speedup: Significantly accelerates VLM training while avoiding common instabilities (loss spikes, NaN).  <br>● Scale-friendly: Scales to the largest open-source VLMs, enabling more research at serious scale.  <br>● Infrastructure contribution: Practical infrastructure advances that benefited the entire VLM research community.                                                                                                                                                                                                                                                                                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2304.13013), [Tweet](https://twitter.com/dair_ai/status/1652693187960479745?s=20)                                          |

---

## Top AI Papers of the Week (April 17 - April 23)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | **Links**                                                                                                        |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| 1) **DINOv2** - Meta's self-supervised vision foundation model producing robust features without labels.  <br>● Fully self-supervised: Trained purely with SSL on 142M curated images - no labels needed, just clever pretraining objectives.  <br>● Universal features: Produces features useful for image classification, instance retrieval, video understanding, depth estimation, and pixel-level tasks.  <br>● Frozen-backbone usage: Features work well with simple linear probes, no fine-tuning - making DINOv2 a drop-in visual backbone.  <br>● Vision foundation standard: Became the default vision backbone for open-source VLMs (LLaVA, InternVL) and vision research through 2024.                                                                                          | [Paper](https://arxiv.org/abs/2304.07193), [Tweet](https://twitter.com/dair_ai/status/1650145892941324288?s=20)  |
| 2) **Learning to Compress Prompts with Gist Tokens** - Trains LMs to compress prompts into reusable "gist" tokens.  <br>● Prompt compression: Compresses long prompts into a small set of gist tokens that encode the same instruction information.  <br>● 26x compression: Achieves 26x prompt compression with negligible quality loss on downstream tasks.  <br>● Up to 40% FLOPs reduction: Substantial inference-time compute savings on repeated prompts.  <br>● Production optimization: Particularly valuable for systems with long system prompts reused across many requests - a pattern that became ubiquitous in 2024 agent systems.                                                                                                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2304.08467), [Tweet](https://twitter.com/dair_ai/status/1650145895332163585?s=20)  |
| 3) **Scaling Biomolecular Simulations with Equivariant Models** - A framework for large-scale biomolecular simulation using equivariant deep learning.  <br>● Equivariant network scaling: Achieves high accuracy through equivariant deep learning that respects molecular symmetries.  <br>● 44M atom HIV capsid: Simulated a complete, all-atom, explicitly solvated HIV capsid structure of 44 million atoms.  <br>● Nanosecond-scale stable dynamics: Performs nanoseconds-long stable simulations of protein dynamics - much longer than prior ML-MD simulations.  <br>● Perlmutter deployment: Scales to the Perlmutter supercomputer, demonstrating ML-accelerated molecular dynamics at HPC scale. | [Paper](https://arxiv.org/abs/2304.10061), [Tweet](https://twitter.com/dair_ai/status/1650145897689350144?s=20)  |
| 4) **Evaluating Verifiability in Generative Search Engines** - Audits popular generative search engines for citation accuracy.  <br>● Human evaluation: Performs rigorous human evaluation of Bing Chat, Perplexity AI, and NeevaAI responses.  <br>● Citation failure rate: Finds only 52% of generated sentences are supported by citations and only 75% of citations actually support the claim.  <br>● Verifiability gap: Reveals a significant gap between generative search engines' citation promises and their actual reliability.  <br>● Trust-in-AI research: Important empirical foundation for subsequent research on grounded generation and RAG accuracy.                                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2304.09848), [Tweet](https://twitter.com/dair_ai/status/1650145900180779009?s=20)  |
| 5) **Generative Disco: Text-to-Video Generation for Music Visualization** - An LLM + T2I system for music visualization.  <br>● LLM+T2I composition: Uses LLMs to interpret music and generate scene descriptions that text-to-image models then visualize.  <br>● Music-video generation: Produces music-driven video visualizations - an early text-to-video adjacent capability.  <br>● Creative tool direction: Part of the 2023 wave of creative AI tools targeting content creators and music producers.  <br>● HCI contribution: Notable for its focus on user experience and creative workflow rather than pure model capability.                                                                                                                                                                                                                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2304.08551) , [Tweet](https://twitter.com/dair_ai/status/1650145904219832324?s=20) |
| 6) **Architectures of Topological Deep Learning: A Survey on Topological Neural Networks** - A comprehensive survey on topological neural networks.  <br>● Topological DL taxonomy: Surveys neural networks operating on topological structures beyond graphs (simplicial complexes, cell complexes, hypergraphs).  <br>● Architecture catalog: Catalogs major topological DL architectures with their mathematical foundations.  <br>● Beyond-graph DL: Positions topological DL as the natural generalization of GNNs for higher-order interactions.  <br>● Reference survey: Standard reference for researchers entering the topological DL subfield. | [Paper](https://arxiv.org/abs/2304.10031) , [Tweet](https://twitter.com/dair_ai/status/1650145906560311298?s=20) |
| 7) **Visual Instruction Tuning (LLaVA)** - Uses language-only GPT-4 to generate multimodal instruction-following data.  <br>● GPT-4-generated multimodal data: Bootstraps multimodal instruction data using only language-only GPT-4 given captions and bounding boxes - no direct visual access needed.  <br>● End-to-end training: Introduces LLaVA, an end-to-end trained large multimodal model combining CLIP vision encoder and Vicuna LLM.  <br>● Lightweight architecture: Simple projection layer between vision encoder and LLM - cheap and effective.  <br>● Open VLM revolution: LLaVA became the most influential open-source VLM architecture, spawning LLaVA-1.5, LLaVA-NeXT, and countless derivatives through 2024.                                                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2304.08485), [Tweet](https://twitter.com/dair_ai/status/1650145909387214848?s=20)  |
| 8) **ChatGPT: Applications, Opportunities, and Threats** - A comprehensive overview of ChatGPT's applications and risks.  <br>● Application mapping: Surveys ChatGPT applications across education, healthcare, law, research, and creative industries.  <br>● Opportunities & threats: Explicitly balances productive applications with threats like misinformation, academic integrity, and job displacement.  <br>● Policy-relevant: Widely cited in policy discussions about AI governance and educational institution responses.  <br>● Field-orienting: Helped the broader research community orient to ChatGPT's implications during its initial rapid adoption. | [Paper](https://arxiv.org/abs/2304.09103), [Tweet](https://twitter.com/dair_ai/status/1650145911836745736?s=20)  |
| 9) **Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models** - A framework inferring tool sequences for compositional reasoning.  <br>● Tool composition: LLM plans sequences of tools (Python, search, calculator, knowledge retrievers) to solve complex problems.  <br>● SOTA on ScienceQA: Achieves 87% accuracy on ScienceQA and 99% on TabMWP - surpassing prior specialized models.  <br>● Plug-and-play design: Tools can be added/removed flexibly without retraining the LLM.  <br>● Agent framework precursor: Influential in the agent/tool-use research direction leading to 2024 agent frameworks.                                                                                                                                                                                                              | [Paper](https://arxiv.org/abs/2304.09842), [Tweet](https://twitter.com/dair_ai/status/1650145914420330496?s=20)  |
| 10) **Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models** - High-resolution video synthesis with latent diffusion.  <br>● Latent video diffusion: Extends Stable Diffusion-style latent diffusion to video generation with temporal attention layers.  <br>● 512x1024 driving videos: Validates on real driving videos at 512x1024 resolution, achieving state-of-the-art performance.  <br>● Creative content: Also validated on creative content creation tasks, demonstrating versatility beyond driving scenarios.  <br>● Video generation foundation: A key paper in the latent-video-diffusion lineage that led to Stable Video Diffusion, SVD, and later open video models.                                                                                                                                                                                                                                         | [Paper](https://arxiv.org/abs/2304.08818), [Tweet](https://twitter.com/dair_ai/status/1650145916794314752?s=20)  |

---

## Top AI Papers of the Week (April 10 - April 16)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                               | **Links**                                                                                                        |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
| 1) **Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields** - Combines mip-NeRF 360 with grid-based models for 22x faster training.  <br>● Anti-aliasing for grids: Brings mip-NeRF's anti-aliasing technique to fast grid-based NeRF architectures, combining quality and speed.  <br>● 22x training speedup: Trains 22x faster than mip-NeRF 360 while achieving comparable or better quality.  <br>● Best of both worlds: Overcomes the historical tradeoff between slow-but-accurate MLP NeRFs and fast-but-aliased grid NeRFs.  <br>● 3D reconstruction: A practical improvement that made high-quality NeRFs much more accessible to production use cases.                                                                                                                                                                                                                 | [Paper](https://arxiv.org/abs/2304.06706), [Tweet](https://twitter.com/dair_ai/status/1647613826425147401?s=20)  |
| 2) **Generative Agents: Interactive Simulacra of Human Behavior** - Stanford/Google's landmark paper on LLM-powered social simulations.  <br>● "Smallville" simulation: Creates a town of 25 LLM-powered agents who plan their days, remember experiences, form relationships, and even organize parties.  <br>● Memory-reflection-planning: Combines a complete memory stream, synthesized reflections, and dynamic planning to create emergent social behavior.  <br>● Emergent social dynamics: Agents exhibit emergent phenomena like information diffusion, relationship formation, and coordinated planning.  <br>● Agent research foundation: One of the most influential 2023 agent papers, sparking the explosion of LLM agent simulation work including AutoGPT, BabyAGI, and CAMEL. | [Paper](https://arxiv.org/abs/2304.03442), [Tweet](https://twitter.com/dair_ai/status/1647613828417351682?s=20)  |
| 3) **Emergent Autonomous Scientific Research Capabilities of LLMs** - An agent combining LLMs for autonomous scientific experiments.  <br>● Autonomous experiment design: LLM agent designs, plans, and executes chemistry experiments with minimal human guidance.  <br>● Real chemistry execution: Successfully performs catalyzed cross-coupling reactions - actual chemistry, not simulated.  <br>● Emergent research behavior: Demonstrates emergent research capabilities like hypothesis generation, experimental iteration, and failure recovery.  <br>● AI-scientist precursor: An influential paper establishing LLM-driven scientific agents as a research direction that would evolve through 2024's AI Scientist and BioDiscoveryAgent.                                                      | [Paper](https://arxiv.org/abs/2304.05332), [Tweet](https://twitter.com/dair_ai/status/1647613830233571328?s=20)  |
| 4) **Automatic Gradient Descent: Deep Learning without Hyperparameters** - A hyperparameter-free first-order optimizer that leverages architecture.  <br>● Architecture-aware optimization: Derives optimization algorithms that explicitly account for neural network architecture rather than treating it as a black box.  <br>● No hyperparameters: Eliminates learning rate tuning - a hyperparameter-free optimizer that just works.  <br>● ImageNet scale: Successfully trains CNNs at ImageNet scale, demonstrating the approach scales to realistic workloads.  <br>● Optimizer research: Contributes to the ongoing search for optimizers that reduce tuning burden, complementing Adam-era hyperparameter-heavy methods.                                                                                                                                | [Paper](https://arxiv.org/abs/2304.05187), [Tweet](https://twitter.com/dair_ai/status/1647613832804589569?s=20)  |
| 5) **ChemCrow: Augmenting LLMs with Chemistry Tools** - An LLM chemistry agent with 13 expert-designed tools.  <br>● 13 chemistry tools: Integrates 13 expert-designed tools covering synthesis planning, molecule validation, safety checks, and more.  <br>● Cross-domain chemistry: Handles synthesis, drug discovery, and materials design within a unified agent framework.  <br>● Beats vanilla GPT-4: Substantially outperforms vanilla GPT-4 on chemistry tasks by grounding in specialized tools.  <br>● Scientific-agent direction: Alongside BoilerBot and similar systems, established the template for domain-specific scientific agents using LLMs + tools.                                                           | [Paper](https://arxiv.org/abs/2304.05376) , [Tweet](https://twitter.com/dair_ai/status/1647613834813644800?s=20) |
| 6) **One Small Step for Generative AI, One Giant Leap for AGI** - A complete survey on ChatGPT and GPT-4.  <br>● Complete AIGC survey: Comprehensive survey of the ChatGPT/GPT-4 era covering models, applications, and future directions.  <br>● AGI-oriented framing: Analyzes ChatGPT/GPT-4 as stepping stones toward AGI rather than endpoints themselves.  <br>● Technology + society: Balances technical analysis with discussion of societal, economic, and ethical implications.  <br>● Reference timeline: A widely-cited reference for summarizing the 2022-2023 generative AI inflection point.                                                                                                                                                                                                                                               | [Paper](https://arxiv.org/abs/2304.06488) , [Tweet](https://twitter.com/dair_ai/status/1647613836617195525?s=20) |
| 7) **OpenAGI: When LLM Meets Domain Experts** - An open-source research platform for LLM agents manipulating domain expert models.  <br>● LLM-as-orchestrator platform: LLMs plan and orchestrate calls to specialized domain expert models (vision, speech, language).  <br>● Multi-step task evaluation: Provides a standardized evaluation framework for complex multi-step tasks requiring tool composition.  <br>● Open research tooling: Fully open-source platform for academic researchers to compare agent designs and tool-use strategies.  <br>● Agent research infrastructure: Part of the 2023 wave establishing shared infrastructure for LLM agent research.                                                                                                                                                           | [Paper](https://arxiv.org/abs/2304.04370), [Tweet](https://twitter.com/dair_ai/status/1647613838567546886?s=20)  |
| 8) **AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models** - A benchmark using real human standardized exams.  <br>● Real human exams: Uses actual college entrance exams, law school admission tests, math competitions, and civil service exams - not synthetic benchmarks.  <br>● Multilingual coverage: Includes English and Chinese versions of exams, testing bilingual capability.  <br>● Human-comparable scoring: Makes it natural to compare foundation models to human performance percentiles on identical exams.  <br>● Real-world evaluation: Became an important benchmark for claims about "expert-level" or "human-comparable" foundation model performance.                                                                                                       | [Paper](https://arxiv.org/abs/2304.06364), [Tweet](https://twitter.com/dair_ai/status/1647613840400498700?s=20)  |
| 9) **Teaching Large Language Models to Self-Debug** - Teaches LLMs to debug their own code via few-shot demonstrations.  <br>● Self-debugging via explanation: LLMs identify mistakes by explaining their generated code in natural language, then iteratively fix errors.  <br>● Few-shot teaching: Requires only a handful of debugging demonstrations to enable the capability across tasks.  <br>● Text-to-SQL SOTA: Achieves state-of-the-art on several code generation tasks including text-to-SQL generation.  <br>● Self-correction research: Influential paper establishing self-debugging as a distinct capability, informing 2024 reasoning + self-correction agents.                                                   | [Paper](https://arxiv.org/abs/2304.05128), [Tweet](https://twitter.com/dair_ai/status/1647613842300497924?s=20)  |
| 10) **Segment Everything Everywhere All at Once (SEEM)** - A promptable, interactive segmentation model.  <br>● Unified promptable model: Handles various segmentation tasks (semantic, instance, referring, interactive) in one promptable model.  <br>● Multi-modal prompts: Accepts text, click, box, scribble, and mask prompts - broader than SAM's prompt vocabulary.  <br>● Open-vocabulary: Competitive on open-vocabulary and interactive segmentation benchmarks.  <br>● SAM-complement: A more flexible alternative to SAM with richer prompting - both pushed interactive segmentation to production.                                                                                                                                                                      | [Paper](https://arxiv.org/abs/2304.06718), [Tweet](https://twitter.com/dair_ai/status/1647613844087361537?s=20)  |

## Top AI Papers of the Week (April 3 - April 9)

| **Paper**                                                                                                                                                                                                                                                                                                                                                              | **Links**                                                                                                         |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| 1) **Segment Anything (SAM)** - Meta's foundational model for image segmentation with massive training data release.  <br>● Largest segmentation dataset: Releases SA-1B with over 1 billion masks on 11 million licensed images - by far the largest segmentation dataset ever.  <br>● Promptable segmentation: Introduces a new promptable segmentation task where users provide clicks, boxes, or text to indicate what to segment.  <br>● Zero-shot SOTA: Zero-shot performance is competitive with or superior to fully supervised specialist models.  <br>● Vision foundation model: One of the highest-impact vision papers of 2023, transforming how the field thinks about foundation models for dense prediction.                                                         | [Paper](https://arxiv.org/abs/2304.02643v1), [Tweet](https://twitter.com/dair_ai/status/1645089444280561666?s=20) |
| 2) **Instruction Tuning with GPT-4** - Uses GPT-4 to generate instruction-following data for LLM fine-tuning.  <br>● GPT-4 as data generator: First systematic attempt to use GPT-4 (rather than human annotators) to produce instruction-following data.  <br>● 52K bilingual examples: Releases 52K unique English and Chinese instruction-following examples.  <br>● LLaMA fine-tuning: Uses the dataset to instruction-tune LLaMA models, leading to superior zero-shot performance on new tasks.  <br>● Synthetic data wave: Part of the 2023 wave establishing synthetic data from strong models as the dominant alignment data source. | [Paper](https://arxiv.org/abs/2304.03277), [Tweet](https://twitter.com/dair_ai/status/1645089446524534788?s=20)   |
| 3) **Eight Things to Know about Large Language Models** - Sam Bowman's influential primer on key LLM considerations.  <br>● Eight key insights: Organizes LLM knowledge into eight punchy observations covering capabilities, limitations, and emergent behaviors.  <br>● Policy-relevant framing: Written in accessible language suitable for researchers, policymakers, and the broader public.  <br>● Capability-risk balance: Each "thing to know" comes with practical implications for deployment and safety.  <br>● Community reference: Became one of the most widely-shared overviews of LLMs in 2023, frequently cited in onboarding materials and policy discussions.                                                                                                                                                                                                                       | [Paper](https://arxiv.org/abs/2304.00612v1), [Tweet](https://twitter.com/dair_ai/status/1645089448428699650?s=20) |
| 4) **A Survey of Large Language Models** - A 50-page comprehensive survey on LLMs.  <br>● Broad coverage: 50+ pages covering LLM architecture, pretraining, fine-tuning, alignment, evaluation, and applications.  <br>● Chronological evolution: Traces the lineage from early transformers through GPT, PaLM, LLaMA, and beyond.  <br>● Frequently updated: Authors have updated the survey multiple times to keep pace with rapidly evolving field.  <br>● Go-to reference: Became one of the most widely cited LLM surveys, frequently used in graduate courses and research onboarding.                                                                                                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2303.18223), [Tweet](https://twitter.com/dair_ai/status/1645089450395852802?s=20)   |
| 5) **Baize: An Open-Source Chat Model with Self-Chat Data** - An open chat model fine-tuned with LoRA on self-chat dialogs.  <br>● Self-chat data generation: Generates 100K dialogs by having ChatGPT converse with itself, then fine-tunes on these dialogs.  <br>● LoRA fine-tuning: Uses parameter-efficient LoRA fine-tuning for compute efficiency.  <br>● Multiple model sizes: Releases 7B, 13B, and 30B parameter models along with the dialog data.  <br>● Open chatbot ecosystem: Part of the 2023 proliferation of open chat models (Vicuna, Alpaca, Koala, Baize) building on LLaMA.                                                                                  | [Paper](https://arxiv.org/abs/2304.01196) , [Tweet](https://twitter.com/dair_ai/status/1645089452081938433?s=20)  |
| 6) **MACHIAVELLI Benchmark** - A benchmark of 134 text-based Choose-Your-Own-Adventure games for measuring ethical trade-offs.  <br>● 134 interactive games: Uses 134 text adventures with ~500K scenarios to evaluate agent behavior in rich social/ethical contexts.  <br>● Reward vs. ethics trade-off: Specifically measures how agents trade off goal-achievement (rewards) against ethical behavior (harm, deception, power-seeking).  <br>● Dark side measurement: Surfaces unethical behaviors like deception, manipulation, and power-seeking that may emerge when agents optimize for rewards.  <br>● Agent safety research: A foundational benchmark for the emerging "agent safety" sub-field in 2023-2024.                                                                                                      | [Paper](https://arxiv.org/abs/2304.03279) , [Tweet](https://twitter.com/dair_ai/status/1645089453780639744?s=20)  |
| 7) **Better Language Models of Code through Self-Improvement** - Self-improving code LLMs via pseudo-data generation.  <br>● Self-improvement loop: Generates pseudo training data from the model's own knowledge gained through pretraining and fine-tuning.  <br>● Iterative bootstrapping: Adds the generated data to the training set for the next training iteration, creating a self-improvement loop.  <br>● Multi-framework gains: Shows consistent improvements across different code LLM frameworks on code generation tasks.  <br>● Self-improvement research: An early example of the self-improvement paradigm for LLMs that would later mature in 2024's self-rewarding and self-play approaches.                                                 | [Paper](https://arxiv.org/abs/2304.01228v1), [Tweet](https://twitter.com/dair_ai/status/1645089455659687937?s=20) |
| 8) **Summary of ChatGPT/GPT-4 Research** - An overview of ChatGPT and GPT-4 applications based on 194 papers.  <br>● 194-paper meta-analysis: Analyzes 194 relevant papers to produce an integrated overview of the ChatGPT/GPT-4 research landscape.  <br>● Capability-limitation balance: Discusses capabilities, limitations, concerns, and research directions in structured fashion.  <br>● Application catalog: Catalogs applications across education, healthcare, coding, writing, and specialized domains.  <br>● Research synthesis: Useful as a condensed view of the first six months of post-ChatGPT research explosion.                                                                                                       | [Paper](https://arxiv.org/abs/2304.01852), [Tweet](https://twitter.com/dair_ai/status/1645089457488404486?s=20)   |
| 9) **Pythia** - EleutherAI's suite for analyzing LLMs across training and scaling.  <br>● 16-model suite: 16 LLMs trained on public data (The Pile) ranging from 70M to 12B parameters, all with identical training recipes.  <br>● Training checkpoints: Releases 154 training checkpoints per model, enabling analysis of learning dynamics across training.  <br>● Scale-controlled research: The consistent methodology across sizes enables rigorous scaling analyses without confounders.  <br>● Interpretability foundation: Became the foundational testbed for mechanistic interpretability research through 2024.                                                                                                                               | [Paper](https://arxiv.org/abs/2304.01373), [Tweet](https://twitter.com/dair_ai/status/1645089459191382016?s=20)   |
| 10) **SegGPT: Segmenting Everything In Context** - Unifies segmentation tasks into a generalist in-context model.  <br>● In-context segmentation: Uses in-context examples (input-mask pairs) to define the segmentation task at inference time.  <br>● Task generalization: Handles semantic, instance, panoptic, and referring segmentation through the same in-context interface.  <br>● Training-free adaptation: Adapts to new segmentation tasks without retraining - just provide example pairs.  <br>● Prompt-based vision: Part of the 2023 push to bring LLM-style in-context learning to vision tasks.                                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2304.03284), [Tweet](https://twitter.com/dair_ai/status/1645089461124886529?s=20)   |

---

## Top AI Papers of the Week (Mar 27 - April 2)

| **Paper**                                                                                                                                                                                                                                                                                                                                                                      | **Links**                                                                                                                |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------ |
| 1) **BloombergGPT** - A 50B-parameter LLM specialized for finance.  <br>● Largest finance dataset: 363 billion tokens of financial data plus 345 billion tokens from general-purpose datasets - the largest domain-specific LLM dataset at the time.  <br>● Finance-task specialization: Outperforms existing models on financial NLP tasks (sentiment, NER, classification).  <br>● General capability preservation: Maintains competitive performance on general LLM benchmarks despite heavy finance specialization.  <br>● Domain-specific LLM blueprint: Established the template for well-resourced domain-specific LLMs (medical, legal, financial) through 2023-2024. | [Paper](https://arxiv.org/abs/2303.17564v1), [Tweet](https://twitter.com/omarsar0/status/1641787456436547584?s=20)       |
| 2) **Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ALOHA)** - A low-cost bimanual robot manipulation system.  <br>● Action Chunking with Transformers (ACT): Introduces ACT, a generative model that predicts action chunks (sequences) rather than single actions - dramatically improving task success.  <br>● Low-cost hardware: The ALOHA platform uses ~$20K of off-the-shelf parts, making bimanual manipulation research broadly accessible.  <br>● Fine-grained tasks: Demonstrates difficult real-world tasks like threading zip ties, unwrapping candy, and slotting battery cells.  <br>● Robotics research catalyst: ALOHA became one of the most influential robotics platforms of 2023-2024, powering downstream research like Mobile ALOHA.                                            | [Paper](https://tonyzhaozh.github.io/aloha/), [Tweet](https://twitter.com/tonyzzhao/status/1640393026341322754?s=20)     |
| 3) **HuggingGPT (Jarvis)** - ChatGPT orchestrates HuggingFace models to solve complex AI tasks.  <br>● LLM as controller: ChatGPT plans tasks, selects appropriate HuggingFace models, dispatches sub-tasks, and summarizes results.  <br>● Model Hub integration: Directly leverages the HuggingFace model hub, giving ChatGPT access to thousands of specialized models.  <br>● Four-stage pipeline: Task planning → model selection → task execution → response generation - a clear architecture influential in later agent frameworks.  <br>● LLM-as-orchestrator pattern: A canonical example of the LLM-as-orchestrator paradigm that dominated 2023 agent research.                                                                                                        | [Paper](https://arxiv.org/abs/2303.17580), [Tweet](https://twitter.com/johnjnay/status/1641609645713129473?s=20)         |
| 4) **ChatDoctor** - A medical chat model fine-tuned on LLaMA with medical domain knowledge.  <br>● 700 diseases covered: Collects data on approximately 700 diseases to provide broad medical coverage.  <br>● 5K doctor-patient conversations: Generates 5,000 doctor-patient conversations for fine-tuning, simulating realistic clinical dialog.  <br>● LLaMA foundation: Built on LLaMA, part of the 2023 wave of LLaMA-based domain-specific fine-tunes.  <br>● Medical LLM lineage: Early entry in the medical LLM space that would continue with PMC-LLaMA, Meditron, and later specialized clinical LLMs.                                                                                            | [Paper](https://arxiv.org/abs/2303.14070), [Tweet](https://twitter.com/omarsar0/status/1640525256719753217?s=20)         |
| 5) **LLaMA-Adapter** - Efficient fine-tuning of LLaMA with zero-init attention.  <br>● Zero-init attention: Uses zero-initialized attention layers so the adapter starts from identity function, preserving pretrained behavior.  <br>● Tiny parameter count: Only 1.2M trainable parameters adapt LLaMA into an instruction-follower - extremely parameter-efficient.  <br>● Alpaca-quality responses: Matches Alpaca's response quality (fully fine-tuned 7B) with far fewer trainable params.  <br>● Multimodal extension: Extended to accept multi-modal inputs (images), an early step toward efficient VLM adapters.                                                     | [Paper](https://arxiv.org/abs/2303.16199) , [Tweet](https://twitter.com/rasbt/status/1641457696074334209?s=20)           |
| 6) **ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks** - Empirically shows ChatGPT beats MTurk on text annotation.  <br>● Multi-task comparison: Evaluates ChatGPT against MTurk crowd workers on relevance, topic, stance, frames, and general annotation tasks.  <br>● Higher accuracy: ChatGPT achieves higher zero-shot accuracy than crowd workers on most tested annotation tasks.  <br>● 20x cost reduction: ChatGPT's per-annotation cost is approximately 20x cheaper than MTurk.  <br>● Annotation economy shift: Marked a real turning point in how NLP researchers think about dataset construction, accelerating LLM-powered annotation pipelines.                                                           | [Paper](https://arxiv.org/abs/2303.15056v1) , [Tweet](https://twitter.com/AlphaSignalAI/status/1641496876527517696?s=20) |
| 7) **Language Models can Solve Computer Tasks (RCI)** - LLM agent executes computer tasks via recursive self-criticism.  <br>● Recursive Criticism and Improvement: A prompting scheme where the LLM generates actions, critiques its own output, and improves iteratively.  <br>● Computer task execution: Demonstrates LLMs can execute real computer tasks (navigation, form-filling, data entry) with simple prompting.  <br>● Zero-shot without training: Works zero-shot without any task-specific fine-tuning, using only prompting.  <br>● Web-agent foundation: An early demonstration of LLM-based web/computer agents that informed 2024's agent framework explosion.                                                                                                                                                               | [Paper](https://arxiv.org/abs/2303.17491), [Tweet](https://twitter.com/arankomatsuzaki/status/1641609722951516161?s=20)  |
| 8) **DERA** - Dialog-Enabled Resolving Agents for enhancing LLM completions.  <br>● Multi-agent dialog: Uses multiple LLM "agents" that communicate feedback and iteratively refine outputs through dialog.  <br>● Role-based agents: Typically pairs a Researcher and Decider with distinct responsibilities, producing higher-quality outputs.  <br>● Beats base GPT-4: DERA outperforms base GPT-4 on clinically-focused tasks requiring careful reasoning.  <br>● Multi-agent LLM pattern: An early example of the multi-agent debate/collaboration pattern that became widespread in 2024 (AutoGen, CrewAI).                                                                                      | [Paper](https://arxiv.org/abs/2303.17071), [Tweet](https://twitter.com/johnjnay/status/1642168727796961280?s=20)         |
| 9) **Natural Selection Favors AIs over Humans** - Dan Hendrycks on why AI systems will outcompete humans evolutionarily.  <br>● Evolutionary framing: Argues that AI systems will become more evolutionarily "fit" than humans in competition for resources and influence.  <br>● Selection pressures: Identifies specific selection pressures (efficiency, resource acquisition, goal-directedness) that favor AI over humans.  <br>● Risk analysis: Discusses potential dangers including loss of human agency, and strategies to mitigate them.  <br>● AI safety framing: Contributed a memorable framing to AI safety discussions during the 2023 existential-risk conversation.                                                                                                                                                                                     | [Paper](https://arxiv.org/abs/2303.16200), [Tweet](https://twitter.com/DanHendrycks/status/1641102660412792833?s=20)     |
| 10) **Machine Learning for Partial Differential Equations** - A review of ML approaches to PDEs.  <br>● Comprehensive review: Examines ML avenues for solving, learning, and discovering partial differential equations.  <br>● Method taxonomy: Covers neural PDE solvers, Fourier neural operators, physics-informed neural networks, and learned simulators.  <br>● Scientific ML reference: Positions ML-for-PDEs as a coherent sub-field with its own methods and benchmarks.  <br>● SciML roadmap: Influential in the growing scientific machine learning community, informing later foundation-model work on physics simulation.                                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2303.17078), [Tweet](https://twitter.com/DynamicsSIAM/status/1641608068453777412?s=20)     |

---

## Top AI Papers of the Week (Mar 20-Mar 26)

| **Paper**                                                                                                                                                                                                                                                                                                                                    | **Links**                                                                                                                                                                                |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Sparks of Artificial General Intelligence: Early Experiments with GPT-4** - Microsoft Research's influential investigation of early GPT-4.  <br>● Pre-release GPT-4 access: Examines an early, less-aligned GPT-4 while still in active development at OpenAI.  <br>● "Sparks of AGI" claim: Argues GPT-4 shows sparks of general intelligence across diverse domains - a provocative and widely-debated claim.  <br>● Rich demonstrations: Includes stunning demonstrations of GPT-4's capabilities on math, coding, vision, theory-of-mind, and more.  <br>● Discourse-defining paper: Set much of the 2023 public discourse around AGI timelines and LLM capabilities.                                                                                                                                               | [Paper](https://arxiv.org/abs/2303.12712), [Tweet](https://twitter.com/dair_ai/status/1639991716349460481?s=20)                                                                          |
| 2) **Reflexion** - An autonomous agent with dynamic memory and self-reflection.  <br>● Self-reflection loop: Agent reflects on failed attempts in natural language and stores reflections in episodic memory for future use.  <br>● Verbal reinforcement: Uses verbal self-feedback rather than gradient updates - an alternative to RL for agent improvement.  <br>● Task-specific action choice: Enhances task-specific action selection through reflection on prior reasoning traces.  <br>● Agent paradigm: Became one of the foundational agent papers of 2023, widely cited as a canonical example of LLM self-improvement via verbal reflection.                                                                                                    | [Paper](https://arxiv.org/abs/2303.11366), [Tweet](https://twitter.com/dair_ai/status/1639991718169722880?s=20)                                                                          |
| 3) **Capabilities of GPT-4 on Medical Challenge Problems** - Microsoft's medical evaluation showing GPT-4 passing USMLE handily.  <br>● 20+ points above passing: Exceeds USMLE passing score by over 20 points - a remarkable margin for a generalist model.  <br>● Beats Med-PaLM: Outperforms specialist medical models including Med-PaLM (prompt-tuned Flan-PaLM 540B).  <br>● No medical fine-tuning: Achieves these results without any medical-specific fine-tuning - pure generalist capability.  <br>● Medical-AI turning point: A key data point showing generalist frontier models could match or beat specialist medical LLMs, shifting the medical AI strategic landscape.                                                              | [Paper](https://www.microsoft.com/en-us/research/publication/capabilities-of-gpt-4-on-medical-challenge-problems/), [Tweet](https://twitter.com/dair_ai/status/1639991720224989188?s=20) |
| 4) **GPTs are GPTs** - OpenAI/UPenn's early look at LLM labor market impacts.  <br>● Occupational analysis: Systematically assesses which US occupations and tasks are most exposed to LLM automation.  <br>● 80% of workers exposed: Estimates ~80% of US workers have at least 10% of tasks affected, and 19% have at least 50% affected.  <br>● White-collar focus: Shows exposure concentrated in higher-paying, more educated occupations - reversing traditional automation patterns.  <br>● Policy-defining paper: Shaped 2023 policy discussions about AI's economic impact and informed subsequent labor economics research.                                                                                                                                        | [Paper](https://arxiv.org/abs/2303.10130), [Tweet](https://twitter.com/dair_ai/status/1639991722263412737?s=20)                                                                          |
| 5) **CoLT5** - Faster long-range Transformers via conditional computation.  <br>● Conditional computation: Routes important tokens through heavy branches while light tokens get a cheap path - saving compute on easy tokens.  <br>● Per-layer conditioning: Applies conditional computation in both feedforward and attention layers.  <br>● Long-input efficiency: Particularly effective for long documents where most tokens are routine and only a few need deep processing.  <br>● Long-context efficiency: Part of the efficient-attention research line that would continue with MoE and conditional-routing approaches through 2024.                                                                                                       | [Paper](https://arxiv.org/abs/2303.09752) , [Tweet](https://twitter.com/dair_ai/status/1639991723806826499?s=20)                                                                         |
| 6) **Artificial Muses: Generative AI Chatbots Have Risen to Human-Level Creativity** - Compares AI and human creativity.  <br>● Head-to-head comparison: Compares human-generated ideas with those from ChatGPT, YouChat, and other chatbots on creativity metrics.  <br>● Only 9.4% beat GPT-4: Only 9.4% of humans were judged more creative than GPT-4 - a striking finding about LLM creative capabilities.  <br>● Collaborative creative use: Concludes AI systems are valuable creative assistants rather than mere imitators.  <br>● Creativity evaluation: Part of the 2023 creativity-research cluster empirically testing claims about LLM creative limitations. | [Paper](https://arxiv.org/abs/2303.12003) , [Tweet](https://twitter.com/dair_ai/status/1639991725442646018?s=20)                                                                         |
| 7) **A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series** - Systematic evaluation of the GPT series.  <br>● 9 NLU tasks, 21 datasets: Evaluates GPT-3 and GPT-3.5 variants on 9 natural language understanding tasks using 21 datasets.  <br>● Series-wide comparison: Covers the full GPT-3 and GPT-3.5 family (davinci, davinci-002, davinci-003, ChatGPT) enabling lineage-tracking.  <br>● Capability regression detection: Identifies task-specific regressions and improvements across generations.  <br>● Practical reference: Used by practitioners choosing between OpenAI API model variants for specific tasks.                                                                                                                 | [Paper](https://arxiv.org/abs/2303.10420), [Tweet](https://twitter.com/dair_ai/status/1639991727292395520?s=20)                                                                          |
| 8) **Context-faithful Prompting for Large Language Models** - Prompting techniques to improve LLM faithfulness to given context.  <br>● Faithfulness-improving strategies: Introduces opinion-based prompts and counterfactual demonstrations that improve context adherence.  <br>● Parametric-knowledge override: Helps LLMs prioritize context-provided information over conflicting parametric knowledge.  <br>● RAG-relevant: Particularly useful for RAG setups where LLMs must prioritize retrieved documents over their baseline knowledge.  <br>● Grounding research: Part of the broader 2023 work on making LLMs more faithful to provided context.                                                                                                                       | [Paper](https://arxiv.org/abs/2303.11315), [Tweet](https://twitter.com/dair_ai/status/1639991728882032646?s=20)                                                                          |
| 9) **Text2Room** - Extracts textured 3D meshes of rooms from 2D text-to-image models.  <br>● Text-to-3D rooms: Generates room-scale textured 3D meshes purely from text prompts by leveraging 2D T2I models.  <br>● Iterative view generation: Progressively generates 2D views, reconstructs depth, and fuses into a coherent 3D mesh.  <br>● 2D-to-3D lifting: Demonstrates how to lift powerful 2D generation to 3D without needing 3D training data.  <br>● 3D generation lineage: An influential step in the 2023 explosion of text-to-3D methods informed by Stable Diffusion's success.                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2303.11989), [Project](https://lukashoel.github.io/text-to-room/)[Tweet](https://twitter.com/dair_ai/status/1639991730723254274?s=20)                      |
| 10) **PanGu-Σ** - Huawei's trillion-parameter LM with sparse heterogeneous computing.  <br>● 1 trillion parameters: Scales to 1T total parameters using sparse mixture-of-experts routing to keep inference compute manageable.  <br>● Heterogeneous computing: Designed to leverage heterogeneous hardware (GPUs, Ascend NPUs) at massive scale.  <br>● Chinese language focus: Particularly strong on Chinese NLP tasks while also supporting multilingual capabilities.  <br>● Trillion-scale era: Part of the 2023 trillion-parameter wave alongside GLaM and Switch Transformer extensions.                                                                                                                                                                    | [Paper](https://arxiv.org/abs/2303.10845), [Tweet](https://twitter.com/dair_ai/status/1639991732405252100?s=20)                                                                          |

---

## Top AI Papers of the Week (Mar 13-Mar 19)

| **Paper**                                                                                                                                                                                                                                                            | **Links**                                                                                                                                                        |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **GPT-4 Technical Report** - OpenAI's landmark GPT-4 release marking the frontier of 2023.  <br>● Multimodal capabilities: Large multimodal model accepting text and image inputs and producing text outputs with substantially broader reasoning.  <br>● Human-level exams: Scores in top percentiles on simulated bar exams, SAT, GRE, and similar - markedly better than GPT-3.5.  <br>● Alignment improvements: Extensive RLHF and red-teaming produce significantly safer and more helpful outputs than predecessors.  <br>● Industry-defining release: Set the capability bar that defined the 2023 AI landscape and triggered the global race to match frontier model performance.                                                                                                                                       | [Paper](https://arxiv.org/abs/2303.08774v2), [Tweet](https://twitter.com/dair_ai/status/1637456913993433089?s=20)                                                |
| 2) **LERF: Language Embedded Radiance Fields** - Grounds CLIP language embeddings into NeRF for 3D language queries.  <br>● CLIP embeddings in 3D: Lifts CLIP's language-image features into 3D NeRF representations at every location.  <br>● Open-ended 3D queries: Enables open-ended text queries like "where is the espresso" - the NeRF highlights relevant 3D regions.  <br>● Dense 3D-language features: Per-voxel language features enable both localization and retrieval in 3D scenes.  <br>● 3D semantic understanding: Influential for subsequent research combining language grounding with 3D representations.                                                                                         | [Paper](https://arxiv.org/abs/2303.09553), [Tweet](https://twitter.com/dair_ai/status/1637456915658686465?s=20)                                                  |
| 3) **An Overview on Language Models: Recent Developments and Outlook** - Comprehensive LM overview covering structures and future directions.  <br>● Full-stack coverage: Covers linguistic units, model structures, training methods, evaluation, and applications.  <br>● Structured taxonomy: Organizes LM research into clear categories useful for newcomers entering the field.  <br>● Trend analysis: Identifies major research trends and open problems as of early 2023.  <br>● Reference overview: A widely-used survey for orienting to the rapidly-evolving LM landscape. | [Paper](https://arxiv.org/abs/2303.05759), [Tweet](https://twitter.com/omarsar0/status/1635273656858460162?s=20)                                                 |
| 4) **Eliciting Latent Predictions from Transformers with the Tuned Lens** - An interpretability method tracing LM predictions layer-by-layer.  <br>● Tuned lens: Learns per-layer linear probes that translate intermediate hidden states into next-token probability distributions.  <br>● Logit lens improvement: An improved version of "logit lens" that works more reliably across layers and models.  <br>● Layer-by-layer prediction evolution: Reveals how predictions form gradually across transformer layers rather than instantaneously.  <br>● Interpretability toolkit: Became a standard tool in the mechanistic interpretability research community.                                                                     | [Paper](https://arxiv.org/abs/2303.08112), [Tweet](https://twitter.com/dair_ai/status/1637456919819440130?s=20)                                                  |
| 5) **Meet in the Middle** - A new pretraining paradigm combining data efficiency with infilling capability.  <br>● Bidirectional pretraining: Trains LMs to predict from both directions, meeting in the middle of sequences.  <br>● Data efficiency: Jointly improves training data efficiency and downstream LM capability.  <br>● Infilling strength: Particularly strong on infilling tasks where both prefix and suffix context matter.  <br>● Code generation gains: Demonstrates improvements in code generation tasks where infilling is a common use case (IDE autocomplete).        | [Paper](https://arxiv.org/abs/2303.07295) , [Tweet](https://twitter.com/dair_ai/status/1637456922004561920?s=20)                                                 |
| 6) **Resurrecting Recurrent Neural Networks for Long Sequences (LRU)** - Deep RNNs matching state-space model performance.  <br>● Linear Recurrent Unit: Introduces a carefully-designed LRU architecture using standard signal propagation principles.  <br>● S4 parity: Matches the performance of deep state-space models (S4) on long-range reasoning benchmarks.  <br>● RNN renaissance: Demonstrates that classical RNNs, with proper initialization and design, remain competitive.  <br>● SSM lineage: Informed subsequent state-space model research including Mamba and the broader 2024 SSM renaissance.                   | [Paper](https://arxiv.org/abs/2303.06349) , [Tweet](https://twitter.com/dair_ai/status/1637456923795521537?s=20)                                                 |
| 7) **UPRISE: Universal Prompt Retrieval** - A lightweight retriever for zero-shot prompt selection.  <br>● Universal prompt pool: Builds a universal pool of prompts that can be retrieved for diverse tasks without task-specific setup.  <br>● Lightweight retriever: Trains a small, versatile retriever to select the best prompts for a given input at inference time.  <br>● Zero-shot improvements: Significant zero-shot performance gains and hallucination reduction.  <br>● Prompt retrieval research: Part of the broader research direction on automated prompt engineering that matured in 2024.                     | [Paper](https://arxiv.org/abs/2303.08518), [Tweet](https://twitter.com/dair_ai/status/1637456925779456000?s=20)                                                  |
| 8) **Patches Are All You Need? (ConvMixer)** - A parameter-efficient fully-convolutional ViT alternative.  <br>● Conv-based mixing: Replaces self-attention and MLP layers in ViTs with depthwise and pointwise convolutional layers.  <br>● Parameter efficiency: Achieves competitive accuracy with far fewer parameters and simpler architecture.  <br>● Patches-are-enough argument: Suggests much of ViT's success comes from patch-based processing, not attention itself.  <br>● Architecture minimalism: Reinforces the 2023 trend toward simpler architectures that match complex ones.                                       | [Paper](https://openreview.net/forum?id=rAnB7JSMXL), [Tweet](https://twitter.com/dair_ai/status/1637456927784329218?s=20)                                        |
| 9) **NeRFMeshing** - Distills NeRFs into geometrically-accurate 3D meshes.  <br>● NeRF-to-mesh: A compact, flexible architecture that extracts accurate 3D meshes from any NeRF-driven approach.  <br>● Geometric accuracy: Produces meshes with good geometric quality, useful for downstream graphics and simulation applications.  <br>● NeRF-approach agnostic: Works with multiple NeRF variants rather than being tied to one architecture.  <br>● Production bridge: Helps bridge NeRF research to production graphics pipelines that require traditional meshes.   | [Paper](https://arxiv.org/abs/2303.09431), [Tweet](https://twitter.com/dair_ai/status/1637456929705295873?s=20)                                                  |
| 10) **High-throughput Generative Inference with a Single GPU (FlexGen)** - High-throughput LLM inference on limited GPU memory.  <br>● Memory offloading: Offloads weights/KV-cache to CPU/disk and streams them into GPU memory as needed.  <br>● High throughput batch inference: Optimized for offline batch inference workloads where latency is less critical than throughput.  <br>● Single-GPU practicality: Makes running large LLMs on a single consumer-grade GPU feasible for research and hobbyist use.  <br>● Inference infrastructure: Influenced later inference optimization tools like vLLM and the broader inference-engine ecosystem.                                                                                              | [Paper](https://arxiv.org/abs/2303.06865), [Code](https://github.com/FMInference/FlexGen) , [Tweet](https://twitter.com/dair_ai/status/1637456931429183489?s=20) |

---

## Top AI Papers of the Week (Mar 6-Mar 12)

| **Paper**                                                                                                                                                                                                                                                                             | **Links**                                                                                                                                                                                                         |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **PaLM-E** - Google's embodied multimodal language model.  <br>● Sensor-modality integration: Incorporates real-world continuous sensor modalities (images, robot states) directly as tokens for the LM.  <br>● Embodied reasoning: Performs robotic manipulation planning, visual QA, and other embodied reasoning tasks via a single model.  <br>● 562B parameters: One of the largest multimodal models at the time, built on PaLM + ViT encoders.  <br>● Embodied AI foundation: A major step toward generalist embodied agents that bridge language, vision, and action.                                     | [Paper](https://arxiv.org/abs/2303.03378), [Demo](https://palm-e.github.io/) , [Tweet](https://twitter.com/dair_ai/status/1634919222420836358?s=20)                                                               |
| 2) **Prismer: A Vision-Language Model with An Ensemble of Experts** - a parameter-efficient vision-language model powered by an ensemble of domain experts; it efficiently pools expert knowledge from different domains and adapts it to various vision-language reasoning tasks.    | [Paper](https://arxiv.org/abs/2303.02506), [GitHub](https://github.com/NVlabs/Prismer), [Project](https://shikun.io/projects/prismer) , [Tweet](https://twitter.com/dair_ai/status/1634919224505257985?s=20)      |
| 3) **Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models** - it connects ChatGPT and different visual foundation models to enable users to interact with ChatGPT beyond language format.                                                                       | [Paper](https://arxiv.org/abs/2303.04671), [GitHub](https://github.com/microsoft/visual-chatgpt) [Tweet](https://twitter.com/dair_ai/status/1634919226396794882?s=20)                                             |
| 4) **A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT** - an overview of generative AI - from GAN to ChatGPT.                                                                                                                    | [Paper](https://arxiv.org/abs/2303.04226), [Tweet](https://twitter.com/dair_ai/status/1634919228339003393?s=20)                                                                                                   |
| 5) **Larger language models do in-context learning differently** - shows that with scale, LLMs can override semantic priors when presented with enough flipped labels; these models can also perform well when replacing targets with semantically-unrelated targets.                 | [Paper](https://arxiv.org/abs/2303.03846) , [Tweet](https://twitter.com/dair_ai/status/1634919230461345797?s=20)                                                                                                  |
| 6) **Foundation Models for Decision Making: Problems, Methods, and Opportunities** - provides an overview of foundation models for decision making, including tools, methods, and new research directions.                                                                            | [Project](https://arxiv.org/abs/2303.04129) , [Tweet](https://twitter.com/dair_ai/status/1634919232650760192?s=20)                                                                                                |
| 7) **Hyena Hierarchy: Towards Larger Convolutional Language Models** - a subquadratic drop-in replacement for attention; it interleaves implicit long convolutions and data-controlled gating and can learn on sequences 10x longer and up to 100x faster than optimized attention.   | [Paper](https://arxiv.org/abs/2302.10866), [Code](https://github.com/HazyResearch/safari), [Blog](https://ermongroup.github.io/blog/hyena/), [Tweet](https://twitter.com/dair_ai/status/1634919234835980289?s=20) |
| 8) **OpenICL: An Open-Source Framework for In-context Learning** - a new open-source toolkit for in-context learning and LLM evaluation; supports various state-of-the-art retrieval and inference methods, tasks, and zero-/few-shot evaluation of LLMs.                             | [Paper](https://arxiv.org/abs/2303.02913), [Repo](https://github.com/Shark-NLP/OpenICL), [Tweet](https://twitter.com/dair_ai/status/1634919236954132480?s=20)                                                     |
| 9) **MathPrompter: Mathematical Reasoning using Large Language Models** - a technique that improves LLM performance on mathematical reasoning problems; it uses zero-shot chain-of-thought prompting and verification to ensure generated answers are accurate.                       | [Paper](https://arxiv.org/abs/2303.05398), [Tweet](https://twitter.com/dair_ai/status/1634919239030280197?s=20)                                                                                                   |
| 10) **Scaling up GANs for Text-to-Image Synthesis** - enables scaling up GANs on large datasets for text-to-image synthesis; it’s found to be orders of magnitude faster at inference time, synthesizes high-resolution images, & supports various latent space editing applications. | [Paper](https://arxiv.org/abs/2303.05511), [Project](https://mingukkang.github.io/GigaGAN/) , [Tweet](https://twitter.com/dair_ai/status/1634919241198751744?s=20)                                                |

---

## Top AI Papers of the Week (Feb 27-Mar 5)

| **Paper**                                                                                                                                                                                                                                                                                                                | **Links**                                                                                                                                                                                                                       |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Language Is Not All You Need: Aligning Perception with Language Models** - Microsoft's Kosmos-1 unifies perception and language in one foundation model. <br>● Multimodal LLM: Trains a single model on web-scale multimodal corpora including arbitrarily interleaved text and images, image-caption pairs, and text data. <br>● OCR-free NLP: Directly reads and reasons over images containing text without a separate OCR pipeline. <br>● Broad task coverage: Strong zero-shot and few-shot performance on language understanding, perception-language tasks, visual QA, and visual dialog. <br>● Perception-aware foundation: Early step toward general-purpose models that ground language in perception — a core prerequisite for AGI-style systems.                                | [Paper](https://arxiv.org/abs/2302.14045), [Tweet](https://twitter.com/dair_ai/status/1632383312550416384?s=20)                                                                                                                 |
| 2) **Evidence of a predictive coding hierarchy in the human brain listening to speech** - Nature study linking LLM activations to brain hierarchy. <br>● Brain–LM mapping: Uses fMRI on 304 subjects listening to stories to compare brain activations against modern LM representations. <br>● Long-range predictions: Finds brain activity is best explained by LMs augmented with long-range and hierarchical predictions, not single next-word predictions. <br>● Cortical hierarchy: Distance of prediction scales along a clear cortical hierarchy, echoing predictive coding theory. <br>● Neuro-AI bridge: Provides strong empirical support for treating LMs as computational models of language in the human brain.                                                                          | [Paper](https://www.nature.com/articles/s41562-022-01516-2?utm_source=twitter&utm_medium=organic_social&utm_campaign=evergreen&utm_content=animation), [Tweet](https://twitter.com/dair_ai/status/1632383315029180416?s=20)     |
| 3) **EvoPrompting: Language Models for Code-Level Neural Architecture Search** - uses LLMs as evolutionary operators to discover novel NN architectures. <br>● Evolutionary prompting: Combines evolutionary search with soft prompt-tuning to iteratively mutate in-context code examples of neural architectures. <br>● Code-level NAS: Generates valid architecture code using LMs, then scores and selects the best to seed the next generation. <br>● Outperforms baselines: Finds models surpassing hand-designed architectures on MNIST-1D and CLRS Algorithmic Reasoning. <br>● LMs as optimizers: Shows LLMs can act as design agents for ML research, not just text generators. | [Paper](https://arxiv.org/abs/2302.14838), [Tweet](https://twitter.com/dair_ai/status/1632383317302562816?s=20)                                                                                                                 |
| 4) **Consistency Models** - OpenAI introduces one-step generative models with diffusion-quality samples. <br>● Single-step sampling: Maps any noise level directly to the clean data, enabling high-quality generation in just 1-2 steps. <br>● Two training regimes: Trains either via consistency distillation from a pre-trained diffusion model, or standalone as a new class of generative models. <br>● Competitive quality: Achieves strong FID on CIFAR-10 and ImageNet without adversarial training. <br>● Fast inference: Offers ~10-100x speedups over diffusion sampling, shaping later real-time generative systems.                                                                                                                                                                                             | [Paper](https://arxiv.org/abs/2303.01469), [Tweet](https://twitter.com/dair_ai/status/1632383319152132096?s=20)                                                                                                                 |
| 5) **Goal Driven Discovery of Distributional Differences via Language Descriptions** - defines the D5 task: auto-discovering differences between two corpora as natural language. <br>● New task formulation: Given two text corpora + a research goal, the system outputs a language description of how they differ. <br>● Benchmark + system: Introduces OpenD5 with 675 open-ended problems across domains, plus a GPT-based discovery method. <br>● Real findings: Uncovers insights from product reviews, error patterns in NLP systems, and political speeches. <br>● Discovery-as-service: A template for using LMs as scientific-discovery tools, not just predictors.              | [Paper](https://arxiv.org/abs/2302.14233) , [Code](https://github.com/ruiqi-zhong/D5), [Tweet](https://twitter.com/dair_ai/status/1632383321035374593?s=20)                                                                     |
| 6) **High-resolution image reconstruction with latent diffusion models from human brain activity** - reconstructs photos subjects actually saw from fMRI signal. <br>● Stable Diffusion + brain: Maps fMRI voxels into text and image latents consumed by Stable Diffusion. <br>● No fine-tuning: Uses off-the-shelf Stable Diffusion with learned linear mappings from brain activity to latent spaces. <br>● High fidelity: Produces high-resolution reconstructions preserving semantic and structural detail of the viewed images. <br>● Neuro-decoding at scale: Demonstrates how foundation diffusion models can serve as powerful priors for brain decoding.                                                                                                                                              | [Project](https://sites.google.com/view/stablediffusion-with-brain/) , [Tweet](https://twitter.com/dair_ai/status/1632383323086487554?s=20)                                                                                     |
| 7) **Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control** - couples LLM planning with grounding functions during decoding. <br>● Joint decoding: At each token, combines LM probabilities with scores from grounded models (affordance, safety, preferences). <br>● Robot planning: Generates task plans for robots that respect the current environment and robot capabilities. <br>● General framework: Supports many grounding signals without retraining the LM — plug-and-play alignment at inference. <br>● Embodied generalization: Shows strong results across tabletop and mobile manipulation tasks, enabling flexible embodied reasoning.                                                 | [Paper](https://grounded-decoding.github.io/paper.pdf), [Project](https://grounded-decoding.github.io/) [Tweet](https://twitter.com/dair_ai/status/1632383325036740610?s=20)                                                    |
| 8) **Language-Driven Representation Learning for Robotics** - Voltron: visual pretraining guided by language from human videos. <br>● Video + captions: Learns representations from Ego4D-style human videos paired with captions, unifying MAE-style masked reconstruction with language. <br>● Controllable tradeoff: Lets practitioners balance between low-level grounded features and high-level semantic features. <br>● Robotics-friendly evaluation suite: Introduces a benchmark of imitation learning, grasp affordance, and referring expression tasks. <br>● Pretraining recipe: Establishes language-guided video pretraining as a strong backbone for robot policies.                                                                                                                                                       | [Paper](https://arxiv.org/abs/2302.12766), [Models](https://github.com/siddk/voltron-robotics), [Evaluation](https://github.com/siddk/voltron-evaluation), [Tweet](https://twitter.com/dair_ai/status/1632383327154888704?s=20) |
| 9) **Dropout Reduces Underfitting** - surprising finding that early-phase dropout helps underfit models. <br>● Early dropout: Applying dropout in the initial training epochs (then turning it off) improves generalization for underfitting models. <br>● Mechanism: Reduces gradient variance across mini-batches, counteracting SGD stochasticity. <br>● Late dropout: Conversely shows late dropout helps overfit regimes, inverting conventional usage. <br>● Regularization rethought: Forces a broader rethink of dropout's role beyond simple overfitting prevention.                                                                                      | [Paper](https://arxiv.org/abs/2303.01500), [Tweet](https://twitter.com/dair_ai/status/1632383328920666121?s=20)                                                                                                                 |
| 10) **Enabling Conversational Interaction with Mobile UI using Large Language Models** - uses a single LLM to drive diverse mobile UI conversational tasks. <br>● Unified prompting: Feeds UI screen representations into an LLM and prompts for QA, summarization, and screen mapping. <br>● Four tasks: Covers screen question generation, screen summarization, screen QA, and mapping instructions to UI actions. <br>● Competitive results: Matches task-specific models without any task-specific training. <br>● Foundation for UI agents: Foreshadows LLM-based UI agents that later power phone-control systems.                                                                                                                                                              | [Paper](https://arxiv.org/abs/2209.08655), [Tweet](https://twitter.com/dair_ai/status/1632383331286253568?s=20)                                                                                                                 |

---

## Top AI Papers of the Week (Feb 20-26)

| **Paper**                                                                                                                                                                                                                                                                                                | **Links**                                                                                                                                                                                                                   |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **LLaMA: Open and Efficient Foundation Language Models** - Meta's landmark open foundation model family. <br>● Four scales: Releases 7B, 13B, 33B, and 65B parameter models trained entirely on publicly available data. <br>● Compute-efficient: Trained on 1-1.4T tokens — more tokens per parameter than Chinchilla, optimizing inference over training cost. <br>● Benchmark-beating: LLaMA-13B outperforms GPT-3 (175B) on most benchmarks; 65B is competitive with PaLM-540B. <br>● Research catalyst: Release sparked the open-weight LLM explosion (Alpaca, Vicuna, LLaMA-2 ecosystem).                                                                                | [Paper](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/), [Tweet](https://twitter.com/dair_ai/status/1629845535946420226?s=20)                                              |
| 2) **Composer: Creative and Controllable Image Synthesis with Composable Conditions** - 5B diffusion model enabling compositional control over generation. <br>● Decomposition-then-composition: Decomposes images into representative conditions (text, sketch, depth, color) and recomposes them flexibly at inference. <br>● 5B parameters: Trained on billions of (text, image) pairs for strong base quality. <br>● Rich control: Supports colorization, style transfer, image translation, and more without task-specific retraining. <br>● Pre-ControlNet era milestone: One of the earliest general frameworks for multi-condition controllable diffusion.                                                                                                                                                | [Paper](https://arxiv.org/abs/2302.09778), [Project](https://damo-vilab.github.io/composer-page/) , [GitHub](https://github.com/damo-vilab/composer) , [Tweet](https://twitter.com/dair_ai/status/1629845537913548802?s=20) |
| 3) **The Wisdom of Hindsight Makes Language Models Better Instruction Followers** - HIR: alignment without RL. <br>● Hindsight Instruction Relabeling: Relabels failed outputs with instructions they would have been correct for, turning mistakes into supervised data. <br>● Supervised-only: Replaces PPO/RLHF pipelines with a simple two-stage SFT loop. <br>● BigBench results: Outperforms baselines including RLHF on 12 BigBench reasoning tasks with much simpler training. <br>● Algorithmic minimalism: Demonstrates that careful data relabeling can rival RL for alignment.                     | [Paper](https://arxiv.org/abs/2302.05206), [GitHub](https://github.com/tianjunz/HIR) [Tweet](https://twitter.com/dair_ai/status/1629845539964481537?s=20)                                                                   |
| 4) **Active Prompting with Chain-of-Thought for Large Language Models** - active learning meets CoT prompt engineering. <br>● Uncertainty-driven selection: Ranks candidate questions by LLM disagreement across sampled CoTs, then asks humans to annotate only the most uncertain. <br>● Adaptive exemplars: Replaces static few-shot CoT prompts with task-specific ones crafted via targeted annotation. <br>● Reasoning gains: Beats self-consistency and CoT baselines on arithmetic, commonsense, and symbolic reasoning benchmarks. <br>● Label-efficient alignment: A practical recipe for getting the most out of limited annotation budget. | [Paper](https://arxiv.org/abs/2302.12246), [Code](https://github.com/shizhediao/active-prompt) [Tweet](https://twitter.com/dair_ai/status/1629845541847724033?s=20)                                                         |
| 5) **Modular Deep Learning** - comprehensive survey of modular NN design. <br>● Unified taxonomy: Organizes modular methods along four axes — computation function, routing, aggregation, and training regime. <br>● Covers adapters, MoE, hypernetworks: Analyzes how LoRA, adapters, mixture-of-experts, and composable functions map into this taxonomy. <br>● Use-case breadth: Discusses modularity in scaling LMs, causal inference, hierarchical RL, and multilingual transfer. <br>● Research roadmap: Frames an emerging subfield and exposes open problems in routing, specialization, and cross-module generalization.                                                           | [Paper](https://arxiv.org/abs/2302.11529) , [Project](https://www.ruder.io/modular-deep-learning/), [Tweet](https://twitter.com/dair_ai/status/1629845544037228551?s=20)                                                    |
| 6) **Recitation-Augmented Language Models** - RECITE: self-retrieval via recitation. <br>● Memory recitation: Prompts the LLM to first recite relevant passages it has memorized, then condition on those passages to answer. <br>● No external retriever: Replaces document stores with the model's own parametric memory, then conditions answers on recited evidence. <br>● Strong on closed-book QA: Improves accuracy on TriviaQA, NaturalQuestions, and HotpotQA without any retrieval corpus. <br>● Practical technique: Cheap, drop-in method that later informed search-augmented and agentic inference strategies.                                                                                                                 | [Paper](https://arxiv.org/abs/2210.01296) , [Tweet](https://twitter.com/dair_ai/status/1629845546276995075?s=20)                                                                                                            |
| 7) **Learning Performance-Improving Code Edits** - LLMs as code performance optimizers. <br>● Dataset: Curates over 77K competitive programming C++ edits that correctly improve runtime performance. <br>● Prompting + fine-tuning: Benchmarks zero-shot, few-shot, and fine-tuned models for generating performance-improving refactors. <br>● Measured gains: Best configuration achieves ~2.5x average speedup across held-out programs while preserving correctness. <br>● AI code optimization: Formalizes performance editing as a learning problem and introduces evaluation protocols.                                                                                                                                                         | [Paper](https://arxiv.org/abs/2302.07867), [Tweet](https://twitter.com/dair_ai/status/1629845548210561029?s=20)                                                                                                             |
| 8) **More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models** - early foundational analysis of indirect prompt injection. <br>● Threat taxonomy: Defines direct vs indirect prompt injection and enumerates attacker capabilities against LLM-powered apps. <br>● Real exploits: Demonstrates data exfiltration, phishing, and persistent memory injections against Bing Chat and ChatGPT plugins. <br>● Attack vectors: Hidden instructions in retrieved pages, emails, and tool outputs can silently hijack the LM. <br>● Security agenda: Catalyzed prompt-injection research and defensive designs across the industry.                                                               | [Paper](https://arxiv.org/abs/2302.12173), [Tweet](https://twitter.com/dair_ai/status/1629845550152523777?s=20)                                                                                                             |
| 9) **Aligning Text-to-Image Models using Human Feedback** - brings RLHF-style alignment to diffusion models. <br>● Human reward model: Collects human ratings of image-text alignment to train a reward function over generated images. <br>● Supervised alignment fine-tuning: Re-weights generation to favor higher-reward samples via reward-weighted likelihood. <br>● Improved text-image matching: Increases faithfulness for counting, color, and composition prompts without sacrificing image quality. <br>● T2I alignment blueprint: Early template later expanded by DDPO, DPO-Diffusion, and other RL-based T2I tuning methods.                                                                                                                                                               | [Paper](https://arxiv.org/abs/2302.12192), [Tweet](https://twitter.com/dair_ai/status/1629845552039968780?s=20)                                                                                                             |
| 10) **MERF: Memory-Efficient Radiance Fields for Real-time View Synthesis in Unbounded Scenes** - makes large-scale NeRF playable in the browser. <br>● Hybrid volumetric rep: Combines a low-res 3D feature grid with two 2D feature planes for compact yet expressive scene representation. <br>● Real-time rendering: Achieves interactive frame rates in a browser for unbounded outdoor scenes. <br>● Memory-efficient: Roughly order-of-magnitude smaller memory footprint than competing NeRF baselines at similar quality. <br>● Deployable NeRF: A practical step toward shipping neural scene reps in consumer web experiences.                                                                                     | [Paper](https://arxiv.org/abs/2302.12249), [Tweet](https://twitter.com/dair_ai/status/1629845554061606915?s=20)                                                                                                             |

---

## Top AI Papers of the Week (Feb 13 - 19)

| **Paper**                                                                                                                                                                                                                                                            | **Links**                                                                                                                                                 |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Symbolic Discovery of Optimization Algorithms** - Google discovers Lion optimizer via evolutionary search. <br>● Program search: Uses an evolutionary symbolic search over programs to find new optimizers starting from primitive operations. <br>● Lion emerges: Discovers Lion (EvoLved Sign Momentum), simpler and more memory-efficient than Adam/AdamW. <br>● Broad gains: Improves ViT on ImageNet, vision-language training, and LM pretraining with significant compute savings. <br>● ML automation: Demonstrates that symbolic program search can produce genuinely novel, widely-useful training algorithms.                                                         | [Paper](https://arxiv.org/abs/2302.06675), [Tweet](https://twitter.com/dair_ai/status/1627671313874575362?s=20)                                           |
| 2) **Transformer models: an introduction and catalog** - comprehensive catalog and tutorial on the transformer family. <br>● Unified reference: Organizes prominent transformer-based models into a browsable catalog with architecture details, training data, and usage. <br>● Encoder/decoder/encoder-decoder split: Covers BERT-style, GPT-style, and T5-style branches with historical context. <br>● Ecosystem snapshot: Captures a mid-2023 survey including LLaMA, Flan-T5, PaLM, and multimodal variants. <br>● Teaching resource: Widely used as an onboarding reference for practitioners entering the LLM space.                                                                                                                                                               | [Paper](https://arxiv.org/abs/2302.07730), [Tweet](https://twitter.com/dair_ai/status/1627671315678126082?s=20)                                           |
| 3) **3D-aware Conditional Image Synthesis** - Pix2Pix3D: structure-to-image generation with view consistency. <br>● NeRF + conditional GAN: Extends conditional image generation with neural radiance fields for 3D structure awareness. <br>● Multi-view editing: Generates photorealistic images from segmentation/edge maps and lets users rotate or edit from novel viewpoints. <br>● Consistent across views: Preserves identity and layout when the camera moves, unlike 2D-only baselines. <br>● 3D generative assets: Step toward controllable 3D-aware content creation pipelines.                                          | [Project](https://www.cs.cmu.edu/~pix2pix3D/) [Tweet](https://twitter.com/dair_ai/status/1627671317355831296?s=20)                                        |
| 4) **The Capacity for Moral Self-Correction in Large Language Models** - Anthropic study on emergent ethical reasoning. <br>● RLHF-trained LMs self-correct: Finds evidence that larger RLHF-tuned models can reduce biased or stereotyped outputs when prompted to. <br>● Emergence threshold: The capability emerges at ~22B parameters and strengthens with further scale. <br>● Benchmarks: Evaluates on BBQ (bias), Winogender (gender bias), and law school admissions bias. <br>● Alignment implication: Suggests instruction-tuned models can be steered toward fairness via prompting — a building block for safety research. | [Paper](https://arxiv.org/abs/2302.07459), [Tweet](https://twitter.com/dair_ai/status/1627671319100768260?s=20)                                           |
| 5) **Vision meets RL** - applies RLHF-style reward fine-tuning to vision models. <br>● RL with task rewards: Treats CV models as policies and aligns them using task-specific rewards (IoU, accuracy, user-defined metrics). <br>● Big gains: Reports large improvements on object detection, panoptic segmentation, colorization, and image captioning. <br>● Generalizes prior work: Unifies RL post-training across heterogeneous CV tasks with a single recipe. <br>● Post-training for vision: Mirrors the LM alignment playbook — pretrain, then RL-tune toward task objectives.                                                         | [Paper](https://arxiv.org/abs/2302.08242)                                                                                                                 |
| 6) **Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment** - LQAE: image features quantized in the LM vocabulary. <br>● Quantize to text tokens: Learns a VQ autoencoder where codes are drawn from a pretrained LM's token vocabulary, aligning vision to language without captions. <br>● Unsupervised alignment: No image-caption pairs needed — the visual quantizer aligns with the LM's embedding geometry by construction. <br>● Few-shot classification: Enables LLMs to do few-shot image classification purely in-context. <br>● Bridge to LLMs: Offers a path for injecting vision into language models without expensive paired data.                                   | [Paper](https://arxiv.org/abs/2302.00902) , [Code](https://github.com/lhao499/lqae) [Tweet](https://twitter.com/haoliuhl/status/1625273748629901312?s=20) |
| 7) **Augmented Language Models: a Survey** - Meta's foundational survey of reasoning + tool use in LLMs. <br>● ALM definition: Formalizes augmented LMs as models with reasoning skills (CoT, self-consistency) and tool-using ability (retrievers, calculators, code). <br>● Taxonomy: Organizes the literature across reasoning, tools, and learning strategies (in-context vs fine-tuned). <br>● Open problems: Highlights challenges in tool orchestration, skill composition, and evaluation. <br>● Pre-agentic-era blueprint: Anticipates much of the agentic LLM wave that dominates the rest of 2023.                                           | [Paper](https://arxiv.org/abs/2302.07842), [Tweet](https://twitter.com/dair_ai/status/1627671324477820929?s=20)                                           |
| 8) **Geometric Clifford Algebra Networks** - GCANs for modeling physical and geometric systems. <br>● Geometric priors: Parametrizes layers using Clifford (geometric) algebra to natively encode rotations, reflections, and translations. <br>● Physics-oriented: Targets rigid-body dynamics, fluid simulation, and scientific computing where geometric structure matters. <br>● Equivariance for free: Respects symmetries of the underlying problem by construction, improving generalization. <br>● Scientific ML: Part of a growing trend of symmetry-aware architectures for physical simulation.                                                | [Paper](https://arxiv.org/abs/2302.06594), [Tweet](https://twitter.com/dair_ai/status/1627671326176473088?s=20)                                           |
| 9) **Auditing large language models: a three-layered approach** - governance framework for accountable LLM deployment. <br>● Three layers: Proposes governance audits (provider-level), model audits (behavioral), and application audits (deployment context). <br>● Concrete responsibilities: Maps each layer to who is accountable, what gets audited, and how to audit it. <br>● Policy-ready: Designed to inform regulators and practitioners shaping emerging AI policy regimes. <br>● Foundational reference: Frequently cited in later LLM governance and regulatory proposals (EU AI Act, NIST).                                                                                                                                                     | [Paper](https://arxiv.org/abs/2302.08500), [Tweet](https://twitter.com/dair_ai/status/1627671327950643200?s=20)                                           |
| 10) **Energy Transformer** - transformers as associative memories. <br>● Hopfield-inspired: Replaces stacked feedforward transformer blocks with one large associative memory that iteratively minimizes an energy function. <br>● Unified perspective: Reinterprets attention, feedforward, and norm layers through the lens of energy-based retrieval. <br>● Empirical validation: Matches or exceeds baseline transformers on image classification and graph anomaly detection. <br>● Architecture rethink: Part of a broader push to ground transformers in well-understood dynamical systems theory.                  | [Paper](https://arxiv.org/abs/2302.07253), [Tweet](https://twitter.com/dair_ai/status/1627671329561346050?s=20)                                           |

---

## Top AI Papers of the Week (Feb 6 - 12)

| **Paper**                                                                                                                                                                                                               | **Links**                                                                                                                                                                                           |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Toolformer: Language Models Can Teach Themselves to Use Tools** - Meta's seminal paper on self-supervised tool learning. <br>● Self-supervised annotation: LLM inserts candidate API calls into text, keeps only those that reduce perplexity of the continuation. <br>● Five tools: Teaches a model to use calculator, Q&A system, search engine, translator, and calendar. <br>● Zero human annotation: Achieves strong zero-shot tool use using only self-generated training data. <br>● Foundation of agentic era: Direct inspiration for ReAct, function-calling APIs, and the broader agentic LLM stack.                                                     | [Paper](https://arxiv.org/abs/2302.04761), [Tweet](https://twitter.com/dair_ai/status/1624832248691191808?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                            |
| 2) **Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents** - DEPS agent framework for Minecraft. <br>● Four-stage loop: Describe current state, Explain failures, Plan next steps, Select actions — all driven by an LLM. <br>● Multi-task Minecraft: Achieves strong performance across 70+ open-world Minecraft tasks with a single agent. <br>● Interactive planning: Re-plans after failed steps using error descriptions as feedback, enabling robust long-horizon behavior. <br>● Open-ended agents: Early demonstration that LLMs can steer complex embodied agents in rich game environments.                           | [Paper](https://arxiv.org/abs/2302.01560), [Tweet](https://twitter.com/dair_ai/status/1624832250717036548?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                            |
| 3) **A Categorical Archive of ChatGPT Failures** - early systematic taxonomy of ChatGPT weaknesses. <br>● 11 failure categories: Reasoning, logic, math, coding, factual errors, bias, ethics, humor, self-awareness, etc. <br>● Concrete examples: Documents hundreds of reproducible failure modes across categories. <br>● Evaluation scaffolding: Provides a structure for subsequent LLM evaluation and red-teaming efforts. <br>● Historical snapshot: Captures the limits of GPT-3.5-era ChatGPT right before the GPT-4 release.                                                       | [Paper](https://arxiv.org/abs/2302.03494), [Tweet](https://twitter.com/dair_ai/status/1624832252587700230?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                            |
| 4) **Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery** - PEZ optimizer for discrete text prompts. <br>● Continuous proxy: Optimizes continuous embeddings, projects them to nearest tokens each step, producing readable, transferable hard prompts. <br>● Cross-model portability: Hard prompts discovered on one model often transfer to others. <br>● Text + image: Works for text-to-image personalization and text-to-text tasks. <br>● Prompt engineering automation: Makes gradient-based prompt search practical, influential for later jailbreak research (e.g., GCG).                                       | [Paper](https://arxiv.org/abs/2302.03668), [Tweet](https://twitter.com/dair_ai/status/1624832254588465156?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                            |
| 5) **Data Selection for Language Models via Importance Resampling** - DSIR: target-distribution matching for LM pretraining. <br>● Importance resampling: Selects pretraining data that matches a target downstream distribution using hashed n-gram importance weights. <br>● Cheap and scalable: Operates over huge corpora without fine-tuning or running forward passes. <br>● Downstream gains: Improves GLUE and domain-specific benchmarks vs random or heuristic selection. <br>● Data-centric pretraining: Part of the broader shift from "more data" to "better data" as a lever for LM quality.  | [Paper](https://arxiv.org/abs/2302.03169), [Tweet](https://twitter.com/dair_ai/status/1624832256400302080?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                            |
| 6) **Structure and Content-Guided Video Synthesis with Diffusion Models** - Runway Gen-1, structure-preserving video-to-video diffusion. <br>● Dual conditioning: Disentangles structure (depth, frames) from content (text, reference image) for guided video synthesis. <br>● Latent video diffusion: Operates in a latent space for tractable training and inference on video. <br>● Broad edits: Supports stylization, compositional edits, and driven animation with temporal coherence. <br>● Commercial milestone: Underpins Runway's Gen-1 product, a flagship for early generative video.                                                | [Paper](https://arxiv.org/abs/2302.03011) , [Project](https://research.runwayml.com/gen1), [Tweet](https://twitter.com/dair_ai/status/1624832258296229889?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)            |
| 7) **A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity** - sweeping ChatGPT evaluation. <br>● 21 tasks: Evaluates ChatGPT across 9 NLP task categories, multiple languages, and multimodal prompts. <br>● Three axes: Probes reasoning ability, hallucination rates, and interactive multi-turn behavior. <br>● Mixed results: ChatGPT is strong on many tasks but brittle on multi-step logical reasoning and low-resource languages. <br>● Community benchmark: One of the most-cited empirical evaluations of ChatGPT during the GPT-3.5 era.      | [Paper](https://arxiv.org/abs/2302.04023), [Tweet](https://twitter.com/dair_ai/status/1624832260213026819?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                            |
| 8) **Noise2Music: Text-conditioned Music Generation with Diffusion Models** - Google's text-to-music diffusion system. <br>● Cascaded diffusion: Uses a text-conditioned generator plus super-resolution diffusion stages to produce 30-second audio. <br>● Two variants: Compares waveform- and spectrogram-level diffusion models. <br>● High quality: Captures genre, instrumentation, mood, and temporal structure from natural language prompts. <br>● Generative audio: A key reference point for subsequent music generation systems (MusicGen, Stable Audio).                                                | [Paper](https://arxiv.org/abs/2302.03917), [Project](https://google-research.github.io/noise2music/), [Tweet](https://twitter.com/dair_ai/status/1624832262163337220?s=20&t=ygX07dsAPDF8_jwrxZIo1Q) |
| 9) **Offsite-Tuning: Transfer Learning without Full Model** - privacy-preserving LLM fine-tuning. <br>● Emulator + adapter: Model owner shares a lossy "emulator" plus adapter; users fine-tune the adapter on local data without ever seeing the full model. <br>● Mutual privacy: Protects both the model owner's weights and the user's data. <br>● Efficient transfer: Reduces compute and memory substantially vs full fine-tuning of frontier LLMs. <br>● Deployment-relevant: Offers a path for specialized fine-tuning when distributing base weights is not viable. | [Paper](https://arxiv.org/abs/2302.04870), [Project](https://github.com/mit-han-lab/offsite-tuning), [Tweet](https://twitter.com/dair_ai/status/1624832264029831169?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)  |
| 10) **Zero-shot Image-to-Image Translation** - pix2pix-zero: prompt-driven diffusion editing without fine-tuning. <br>● Edit via text pairs: Translates images between concepts (e.g., "dog" → "cat") given just before/after text phrases — no training data or fine-tuning. <br>● Cross-attention guidance: Uses attention maps to preserve layout and identity during editing. <br>● Structure preserving: Unlike prior T2I editors, keeps the input image's geometry intact across large semantic edits. <br>● Training-free diffusion editing: Influential in the broader push toward zero-shot image editing (e.g., MasaCtrl, InstructPix2Pix).           | [Paper](https://arxiv.org/abs/2302.03027), [Project](https://pix2pixzero.github.io/), [Tweet](https://twitter.com/dair_ai/status/1624832265967607813?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                 |

---

## Top AI Papers of the Week (Jan 30-Feb 5)

| **Paper**                                                                                                                                                                                                                                      | **Links**                                                                                                                                                                                     |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **REPLUG: Retrieval-Augmented Black-Box Language Models** - turns any black-box LLM into a retrieval-augmented system. <br>● Retriever adapts to LM: Trains the retriever using LM output signal (not LM gradients) — works with closed APIs like GPT-3. <br>● Ensembled inference: Retrieves and processes multiple documents independently, ensembling predictions at the output. <br>● Strong RAG gains: Improves language modeling and MMLU substantially over few-shot GPT-3 baselines. <br>● API-era RAG: Makes retrieval augmentation viable even when model weights are inaccessible.                                                      | [Paper](https://arxiv.org/abs/2301.12652), [Tweet](https://twitter.com/dair_ai/status/1622261780725616641?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                      |
| 2) **Extracting Training Data from Diffusion Models** - landmark paper showing diffusion models memorize images. <br>● Extraction attack: Reconstructs individual training images (including copyrighted art) from Stable Diffusion and Imagen. <br>● Memorization rate: Finds hundreds of near-exact copies extractable, especially for frequently-seen images. <br>● Privacy + IP implications: Raises legal and ethical questions about training on copyrighted or personal data. <br>● Training-data leakage: Core evidence in ongoing copyright debates and inspires subsequent mitigation work.                                                              | [Paper](https://arxiv.org/abs/2301.13188), [Tweet](https://twitter.com/dair_ai/status/1622261782738788353?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                      |
| 3) **The Flan Collection: Designing Data and Methods for Effective Instruction Tuning** - Google's comprehensive instruction-tuning dataset. <br>● Massive scale: Combines 1,800+ tasks across multiple domains with diverse template formats. <br>● Design insights: Studies how mixing zero-shot, few-shot, and CoT prompts during training affects downstream capability. <br>● Flan-T5/PaLM release: Produces Flan-T5 and Flan-PaLM models that outperform base counterparts on MMLU and reasoning benchmarks. <br>● Open resource: Core public asset for the instruction-tuning research community.                       | [Paper](https://arxiv.org/abs/2301.13688), [Tweet](https://twitter.com/dair_ai/status/1622261784668241922?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                      |
| 4) **Multimodal Chain-of-Thought Reasoning in Language Models** - Amazon extends CoT to multimodal inputs. <br>● Two-stage pipeline: First generates a natural-language rationale grounded in the image, then uses that rationale to produce the final answer. <br>● Vision grounding: Fuses visual features with text at both rationale and answer stages. <br>● ScienceQA gains: Sub-1B model outperforms GPT-3.5 by ~16 points on ScienceQA, exceeding human-level performance. <br>● Efficient reasoning: Demonstrates that smaller multimodal LMs can outperform much larger text-only models through structured reasoning. | [Paper](https://arxiv.org/abs/2302.00923), [Code](https://github.com/amazon-science/mm-cot) [Tweet](https://twitter.com/dair_ai/status/1622261786559791105?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)     |
| 5) **Dreamix: Video Diffusion Models are General Video Editors** - Google's text-driven video editor. <br>● Motion + appearance edits: Modifies existing videos via text while preserving core object identity and high-level motion. <br>● Image-to-video: Also animates still images with text-driven motion, bridging image and video generation. <br>● Mixed training objective: Combines unmasked and masked video training to support edits and animation with one model. <br>● Versatile video editor: One of the first general-purpose text-driven video editing systems with coherent temporal dynamics.                                                                                 | [Paper](https://arxiv.org/abs/2302.01329), [Project](https://dreamix-video-editing.github.io/), [Tweet](https://twitter.com/dair_ai/status/1622261788497657856?s=20&t=ygX07dsAPDF8_jwrxZIo1Q) |
| 6) **Benchmarking Large Language Models for News Summarization** - rigorous evaluation of LLM summarization quality. <br>● Human study: Evaluates 10 LLMs on news summarization with professional freelance writers as reference baselines. <br>● Instruction tuning matters: Finds instruction-tuned LLMs match freelance writer quality, while base LLMs lag significantly. <br>● Prompt sensitivity: Demonstrates that prompt design has substantial impact on summarization quality. <br>● Automated metrics gap: Highlights the poor correlation between ROUGE and human preferences, pushing for better metrics.                                                                                                                                                                                               | [Paper](https://arxiv.org/abs/2301.13848) , [Tweet](https://twitter.com/dair_ai/status/1622261790326259714?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                     |
| 7) **Mathematical Capabilities of ChatGPT** - deep dive into ChatGPT's math reasoning. <br>● GHOSTS benchmark: Introduces a graduate-level holistic math benchmark spanning proofs, problem solving, and olympiad-style tasks. <br>● Mixed performance: ChatGPT handles undergraduate-level math but struggles with formal proofs and advanced reasoning. <br>● Qualitative analysis: Catalogs typical mistake patterns — hallucinated theorems, invalid inferences, symbolic errors. <br>● Math evaluation rigor: Provides a template for evaluating LLMs on structured mathematical reasoning.                                                                                 | [Paper](https://arxiv.org/abs/2301.13867), [Tweet](https://twitter.com/dair_ai/status/1622261792238886913?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                      |
| 8) **Emergence of Maps in the Memories of Blind Navigation Agents** - shows mental maps emerge in memory-only agents. <br>● Blind navigation: Trains RL agents with only egomotion and compass — no vision, no audio, no GPS. <br>● Emergent mapping: Despite lacking explicit spatial sensing, agents develop map-like internal representations of environments. <br>● Probing analysis: Decodable positional and topological information appears spontaneously in recurrent hidden states. <br>● Neuroscience parallel: Mirrors how animals build cognitive maps, supporting broader theories of spatial representation learning.                                          | [Paper](https://arxiv.org/abs/2301.13261), [Project](https://wijmans.xyz/publication/eom/), [Tweet](https://twitter.com/dair_ai/status/1622261793987989507?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)     |
| 9) **SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections** - synthesizes infinite 3D landscapes from 2D data alone. <br>● 2D-only supervision: Trains from only in-the-wild 2D image collections — no 3D ground truth required. <br>● BEV scene representation: Uses bird's-eye-view (BEV) plus height field representations to structure scene generation. <br>● Unbounded synthesis: Produces explorable, consistent 3D worlds across arbitrary camera trajectories. <br>● 3D generative scale: Demonstrates feasibility of large-scale 3D scene generation without expensive paired 3D assets.                                                                               | [Paper](https://arxiv.org/abs/2302.01330), [Tweet](https://twitter.com/dair_ai/status/1622261795925671936?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                      |
| 10) **Large Language Models Can Be Easily Distracted by Irrelevant Context** - exposes brittleness of LLM reasoning under noise. <br>● GSM-IC benchmark: Extends GSM8K by injecting irrelevant sentences into arithmetic word problems. <br>● Large accuracy drops: CoT, self-consistency, and other prompting methods lose 20+ points when irrelevant context is present. <br>● Mitigations: Shows that explicitly instructing the model to ignore irrelevant information partially recovers performance. <br>● Robustness gap: Signals a key weakness in LLM reasoning that later motivates robustness benchmarks and prompt design practices.                                                      | [Paper](https://arxiv.org/abs/2302.00093), [Tweet](https://twitter.com/dair_ai/status/1622261798379429888?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                      |

---

## Top AI Papers of the Week (Jan 23-29)

| **Paper**                                                                                                                                                                                                                                                            | **Links**                                                                                                                                                                                                                                                                                                            |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **MusicLM: Generating Music From Text** - Google's hierarchical text-to-music generator. <br>● Hierarchical tokens: Casts music generation as conditional language modeling over multiple streams of semantic, coarse, and fine audio tokens. <br>● 24kHz, minutes long: Generates high-fidelity music at 24kHz that remains coherent for several minutes. <br>● MusicCaps benchmark: Releases a 5.5K hand-labeled text-music caption dataset for evaluation. <br>● Generative music frontier: Defines the state of the art for text-to-music in early 2023 and anchors follow-up work (MusicGen, Stable Audio).                                                                                                                                                                           | [Paper](https://arxiv.org/abs/2301.11325), [Tweet](https://twitter.com/dair_ai/status/1619716425761042436?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                                                                                                                                             |
| 2) **Hungry Hungry Hippos: Towards Language Modeling with State Space Models** - H3 architecture closes the SSM-attention gap. <br>● Diagnostic lenses: Identifies synthetic copying tasks where existing SSMs lag attention, then designs H3 layer to fix them. <br>● FlashConv kernel: Custom IO-aware FFT convolution implementation that makes SSMs hardware-efficient. <br>● 2.8x training speedup: Hybrid H3 + attention model trains 2.8x faster than Transformer baselines. <br>● Mamba precursor: Key stepping stone toward the Mamba and selective SSM architectures that followed.                                    | [Paper](https://arxiv.org/abs/2212.14052), [Tweet](https://twitter.com/dair_ai/status/1619716427879174144?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                                                                                                                                             |
| 3) **A Watermark for Large Language Models** - Kirchenbauer et al. propose a detectable LM watermark. <br>● Green/red tokens: Partitions vocab into green/red lists per context via hashed seed; biases sampling toward green tokens. <br>● Statistical detection: A statistical test on the fraction of green tokens detects watermark with arbitrary confidence even on short samples. <br>● No quality loss: Empirically has negligible impact on generation quality while enabling provable detection. <br>● Provenance tooling: Foundational technique for LLM output attribution and later standardization efforts.                                                                                                                                                             | [Paper](https://arxiv.org/abs/2301.10226), [Tweet](https://twitter.com/dair_ai/status/1619716430127308800?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                                                                                                                                             |
| 4) **Text-To-4D Dynamic Scene Generation** - Meta's Make-A-Video3D: 4D from text prompts. <br>● 4D synthesis: Generates dynamic 3D scenes (3D + time) directly from text descriptions. <br>● Video-SDS optimization: Uses score distillation sampling from Make-A-Video to supervise a time-varying NeRF. <br>● No 3D/video training data: Requires no 3D or 4D supervision — leverages 2D video priors. <br>● 4D generative pipeline: Establishes a framework for text-to-4D synthesis later refined by 4DGen, Animate124, and others.                                                                                                                                                    | [Paper](https://arxiv.org/abs/2301.11280), [GitHub](https://make-a-video3d.github.io/), [Tweet](https://twitter.com/dair_ai/status/1619718845018828801?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                                                                                                |
| 5) **ClimaX: A foundation model for weather and climate** - Microsoft's first foundation model for atmospheric science. <br>● Flexible architecture: Transformer-based design that handles heterogeneous variables and spatio-temporal resolutions. <br>● Pretrained on CMIP6: Trains on climate model simulations before fine-tuning on real forecasting tasks. <br>● Multi-task performance: Competitive on forecasting, downscaling, climate projection, and S2S prediction. <br>● Climate AI: Establishes a template for foundation models in geosciences, foreshadowing GraphCast and Aurora.                                                   | [Paper](https://arxiv.org/abs/2301.10343), [Tweet](https://twitter.com/tungnd_13/status/1618642574427959296?s=20&t=ygX07dsAPDF8_jwrxZIo1Q), [Blog](https://www.microsoft.com/en-us/research/group/autonomous-systems-group-robotics/articles/introducing-climax-the-first-foundation-model-for-weather-and-climate/) |
| 6) **Open Problems in Applied Deep Learning** - comprehensive map of practical DL challenges. <br>● 300+ references: Surveys ~300 papers to catalog where applied DL struggles in practice. <br>● End-to-end view: Covers data collection, architecture, training, evaluation, deployment, and monitoring. <br>● Actionable problems: Enumerates concrete research opportunities across each stage of the ML lifecycle. <br>● Community resource: Widely used as a reading list for graduate-level applied ML courses.          | [Paper](https://arxiv.org/abs/2301.11316) , [Tweet](https://twitter.com/dair_ai/status/1619719063915339777?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                                                                                                                                            |
| 7) **DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature** - Stanford's probability-curvature detection. <br>● Curvature hypothesis: LM-generated text sits at a local maximum of the model's log-probability — perturbations predictably reduce probability. <br>● Zero-shot detector: Compares log-probability of a passage vs minor paraphrases without training a classifier. <br>● Strong accuracy: Outperforms supervised detectors across GPT-2, GPT-Neo, and ChatGPT. <br>● AI-generated content provenance: Influential in ongoing work on LLM text detection and authorship verification.                      | [Paper](https://arxiv.org/abs/2301.11305), [Tweet](https://twitter.com/dair_ai/status/1619719169758613504?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                                                                                                                                             |
| 8) **StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis** - revives GANs for large-scale T2I. <br>● Scaled-up generator: Increases StyleGAN capacity and training data to handle complex text-to-image distributions. <br>● Fast inference: Orders of magnitude faster sampling than diffusion — single forward pass per image. <br>● Competitive quality: Narrows the quality gap to diffusion models on 64x64 and 256x256 resolutions. <br>● Latency-driven generation: Positions GANs as a compelling option for interactive T2I applications.                                                              | [Paper](https://arxiv.org/abs/2301.09515), [Project](https://sites.google.com/view/stylegan-t/), [Code](https://github.com/autonomousvision/stylegan-t) [Tweet](https://twitter.com/dair_ai/status/1619719293779976193?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                                |
| 9) **Large language models generate functional protein sequences across diverse families** - ProGen: LLMs for protein design. <br>● 1.2B protein LM: Trained on ~280M protein sequences spanning broad taxonomy and functional annotation. <br>● Functional validation: Wet-lab experiments confirm generated enzymes are active — including sequences far from any natural homolog. <br>● Controllable generation: Condition-on-family prompts produce proteins with specified properties. <br>● Generative biology: Landmark Nature Biotechnology result demonstrating LLMs as bona fide design tools for synthetic biology.                                                                   | [Paper](https://www.nature.com/articles/s41587-022-01618-2), [Tweet](https://twitter.com/dair_ai/status/1619719404618645511?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                                                                                                                           |
| 10) **The Impossibility of Parallelizing Boosting** - theoretical lower bound on boosting parallelization. <br>● Inherent serial cost: Proves that boosting algorithms cannot be dramatically parallelized without increasing total work. <br>● Trade-off theorem: Establishes a formal trade-off between parallel rounds and total training time. <br>● Implications for ML systems: Shows boosting is fundamentally different from parallelizable algorithms like SGD. <br>● Theoretical contribution: Settles a long-standing open question in learning theory and shapes future algorithm design.                                                                                                                                                        | [Paper](https://arxiv.org/abs/2301.09627), [Tweet](https://twitter.com/dair_ai/status/1619719511867015168?s=20&t=ygX07dsAPDF8_jwrxZIo1Q)                                                                                                                                                                             |

---

## Top AI Papers of the Week (Jan 16-22)

| **Paper**                                                                                                                                                                                                                        | **Links**                                                                                                                                                                           |
| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Google AI Research Recap (2022 Edition)** - Jeff Dean's annual review of Google AI research. <br>● Breadth of impact: Surveys advances across language, vision, multimodal, generative models, and scientific AI. <br>● Key 2022 milestones: Highlights PaLM, Flamingo, Imagen, Parti, Minerva, LaMDA, and DeepMind's AlphaCode and AlphaFold work. <br>● Responsible AI: Dedicated sections on fairness, privacy, and sociotechnical research. <br>● Community reference: Frequently cited as an organizational snapshot of the AI research frontier at year-end 2022.                                                            | [Blog](https://ai.googleblog.com/2023/01/google-research-2022-beyond-language.html), [Tweet](https://twitter.com/JeffDean/status/1615796030611820545?s=20&t=vUEC8AZmrOJnVxuYIEJs5A) |
| 2) **Dissociating language and thought in large language models: a cognitive perspective** - Mahowald et al.'s landmark cognitive review. <br>● Formal vs functional language: Separates knowledge of linguistic rules from its use in reasoning, world knowledge, and social cognition. <br>● LLM assessment: Argues LLMs excel at formal linguistic competence but are deficient in functional competence. <br>● Cognitive science lens: Draws on decades of neuroscience to interpret LLM capabilities and failures. <br>● Framework influence: Widely adopted framing for discussing LLM reasoning, hallucination, and world models.                                                    | [Paper](https://arxiv.org/abs/2301.06627), [Tweet](https://twitter.com/neuranna/status/1615737072207400962?s=20&t=5iWUK4z_rp1NWst7JRbnwg)                                           |
| 3) **Human-Timescale Adaptation in an Open-Ended Task Space** - DeepMind's AdA: meta-learned embodied adaptation. <br>● Vast task distribution: Trains RL agents over a procedurally-generated task space spanning millions of 3D environments. <br>● In-context adaptation: Agent adapts to never-seen tasks within a few timesteps, matching human-level adaptation speed. <br>● Scale + memory matters: Shows meta-RL agents need both scale and attention-based memory to match human adaptation. <br>● General agents: Evidence that meta-RL at scale can produce broadly-capable embodied learners.                                | [Paper](https://arxiv.org/abs/2301.07608), [Tweet](https://twitter.com/FeryalMP/status/1616035293064462338?s=20&t=RN0YZFAXWr-uH2dT2ZTSqQ)                                           |
| 4) **AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation** - attention-based explanations for generative LMs. <br>● Token importance: Identifies which input tokens most affect model predictions by selectively masking attention. <br>● Memory-efficient: Avoids gradient computation by manipulating attention instead, enabling efficient analysis of large LMs. <br>● Multimodal generalization: Works for both language models and multimodal transformers like MAGMA. <br>● Interpretability tooling: Provides a scalable alternative to gradient-based attribution methods.                                      | [Paper](https://arxiv.org/abs/2301.08110), [Tweet](https://twitter.com/JonasAndrulis/status/1616722810608427008?s=20&t=vUEC8AZmrOJnVxuYIEJs5A)                                      |
| 5) **Everything is Connected: Graph Neural Networks** - Veličković's concise GNN primer. <br>● Unified perspective: Presents GNNs as a generalization of permutation-equivariant layers, connecting CNNs and transformers. <br>● Message passing: Covers the core message-passing formalism and its variants (GCN, GAT, MPNN). <br>● Key applications: Highlights GNNs in drug discovery, traffic prediction, physics simulation, and recommendation. <br>● Teaching resource: Compact reference for anyone entering the graph ML field.                                          | [Paper](https://arxiv.org/abs/2301.08210), [Tweet](https://twitter.com/PetarV_93/status/1616379369953394688?s=20&t=AqTVY30Y7IZCultzwnqBPA)                                          |
| 6) **GLIGEN: Open-Set Grounded Text-to-Image Generation** - adds grounded control to frozen diffusion models. <br>● Grounding inputs: Conditions pre-trained diffusion models on bounding boxes, keypoints, and reference images without retraining the base model. <br>● Gated self-attention: Inserts new attention layers that inject grounding signals while preserving existing generation quality. <br>● Open-set capabilities: Generalizes to novel concepts and layouts unseen during grounding training. <br>● Controlled generation: A key milestone in the spatially-controllable diffusion research line alongside ControlNet.                      | [Paper](https://arxiv.org/abs/2301.07093), [Tweet](https://twitter.com/hardmaru/status/1615766551113744384?s=20&t=wx0Y18oSmW0YenXjKRAdnA), [Project](https://gligen.github.io/)     |
| 7) **InstructPix2Pix: Learning to Follow Image Editing Instructions** - Berkeley's instruction-tuned image editor. <br>● Synthetic training data: Uses GPT-3 and Stable Diffusion to automatically generate (image, instruction, edited-image) triplets. <br>● Forward-only edits: Single forward pass edits images given natural-language instructions — no per-image optimization. <br>● Wide editing scope: Handles style changes, object swaps, additions, and attribute edits. <br>● Accessible image editing: Makes text-driven image editing accessible without inversion or fine-tuning per image.                                                                         | [Paper](https://arxiv.org/abs/2211.09800), [Tweet](https://twitter.com/_akhaliq/status/1615947919286276096?s=20&t=pbRTn8DaPeQFApQ9okkdRg)                                           |
| 8) **Dataset Distillation: A Comprehensive Review** - comprehensive review of dataset distillation. <br>● Problem definition: Formalizes dataset distillation as synthesizing a small dataset that preserves model training performance. <br>● Method taxonomy: Categorizes approaches by matching objective — meta-learning, gradient matching, trajectory matching, distribution matching. <br>● Applications: Surveys use cases in continual learning, privacy, neural architecture search, and federated learning. <br>● Open challenges: Identifies scaling, cross-architecture transfer, and theoretical understanding as key open problems.                                                                                                                                                                                                                                                                                                       | [Paper](https://arxiv.org/abs/2301.07014), [Tweet](https://twitter.com/omarsar0/status/1615745724473540609?s=20&t=r-pwuB6EhbZLXa5R6mL3NQ)                                           |
| 9) **Learning-Rate-Free Learning by D-Adaptation** - eliminates manual learning-rate tuning. <br>● Parameter-free optimizer: Adaptively estimates an effective learning rate from observed gradient norms, eliminating the need for an LR schedule. <br>● Optimal convergence: Matches the asymptotic convergence of optimally-tuned gradient descent. <br>● Broad applicability: Demonstrated on 12+ diverse ML problems from convex to large-scale deep learning. <br>● Production adoption: Later used in training practical models (precursor to Prodigy, Schedule-Free SGD).                                            | [Paper](https://arxiv.org/abs/2301.07733), [Tweet](https://twitter.com/aaron_defazio/status/1616453609956478977?s=20&t=hGWDXu4sT5f1KcH-X1IL9g)                                      |
| 10) **RecolorNeRF: Layer Decomposed Radiance Field for Efficient Color Editing of 3D Scenes** - interactive color editing for NeRFs. <br>● Layer decomposition: Decomposes NeRF scenes into color layers that can be edited independently. <br>● View-consistent recoloring: Color edits propagate coherently across all viewpoints of the 3D scene. <br>● Interactive workflow: Enables palette-based editing tools familiar from 2D image editing. <br>● 3D asset editing: Makes NeRFs practical for creative workflows that require post-hoc appearance edits.     | [Paper](https://arxiv.org/abs/2301.07958), [Tweet](https://twitter.com/_akhaliq/status/1616265465843548160?s=20&t=duiLmtDvxCwkFmw23rYDmQ)                                           |

---

## Top AI Papers of the Week (Jan 9-15)

| **Paper**                                                                                                                                                                                                                                                                                                           | **Links**                                                                                                                                                                                  |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| 1) **Mastering Diverse Domains through World Models** - DreamerV3: scalable world-model RL. <br>● Single algorithm: Uses identical hyperparameters to solve 150+ diverse tasks spanning continuous control, Atari, and Minecraft. <br>● Minecraft diamond milestone: First algorithm to collect diamonds in Minecraft from scratch without human demonstrations or curricula. <br>● Robust world model: Learns a latent dynamics model with techniques (symlog prediction, KL balancing) that eliminate per-task tuning. <br>● General-purpose RL: Establishes world-model RL as a viable general algorithm across domains.                                                                         | [Paper](https://arxiv.org/abs/2301.04104v1), [Tweet](https://twitter.com/dair_ai/status/1614676677757661185?s=20&t=3GITA7PeX7pGwrqvt97bYQ)                                                 |
| 2) **Tracr: Compiled Transformers as a Laboratory for Interpretability** - DeepMind's RASP-to-transformer compiler. <br>● Program-to-weights: Compiles human-readable RASP programs directly into transformer weights with known ground-truth mechanisms. <br>● Interpretability testbed: Provides models where every computation is known, enabling rigorous evaluation of interpretability methods. <br>● Toolkit for circuit research: Supports ablation studies, probing methods, and causal analysis with certainty. <br>● Mechanistic interpretability: Foundational tool for the mechanistic interpretability research program.                                                             | [Paper](https://arxiv.org/abs/2301.05062), [Tweet](https://twitter.com/dair_ai/status/1614676680165187584?s=20&t=3GITA7PeX7pGwrqvt97bYQ), [Code](https://github.com/deepmind/tracr)        |
| 3) **Multimodal Deep Learning** - comprehensive textbook on multimodal DL. <br>● Full textbook: 200+ page arXiv publication covering architectures, training, and applications of multimodal systems. <br>● Modality coverage: Discusses vision-language, vision-audio, and three-way multimodal models in depth. <br>● Architectural foundations: Details fusion techniques, cross-attention, contrastive learning, and joint embedding. <br>● Graduate-level teaching resource: Widely adopted for multimodal AI courses and self-study curricula.                                                    | [Book](https://arxiv.org/abs/2301.04856), [Tweet](https://twitter.com/dair_ai/status/1614676682555670528?s=20&t=3GITA7PeX7pGwrqvt97bYQ)                                                    |
| 4) **Forecasting Potential Misuses of Language Models for Disinformation Campaigns—and How to Reduce Risk** - OpenAI's disinformation threat assessment. <br>● Kill chain framework: Analyzes LMs' role across disinformation pipeline — actor capabilities, content generation, distribution, and audience reach. <br>● Threat vectors: Identifies how generative LMs lower cost, increase scale, and enable tailored influence operations. <br>● Mitigation taxonomy: Proposes interventions at model design, platform, content distribution, and media literacy levels. <br>● Policy-relevant research: Shaped subsequent AI safety and elections-integrity efforts.                                        | [Paper](https://openai.com/blog/forecasting-misuse/), [Tweet](https://twitter.com/dair_ai/status/1614676684984156160?s=20&t=3GITA7PeX7pGwrqvt97bYQ)                                        |
| 5) **Why do Nearest Neighbor Language Models Work?** - empirical analysis of kNN-LM benefits. <br>● Interpolation effect: Identifies that mixing a kNN distribution with parametric LM softmax improves calibration more than knowledge addition. <br>● Representation capacity: Finds the LM's own context representations are the primary driver of kNN-LM gains. <br>● Softmax bottleneck: Shows kNN retrieval helps overcome the softmax bottleneck in expressive output distributions. <br>● Retrieval theory: Clarifies when and why retrieval augmentation helps parametric LMs.                                                                                                                | [Paper](https://arxiv.org/abs/2301.02828), [Code](https://github.com/frankxu2004/knnlm-why), [Tweet](https://twitter.com/dair_ai/status/1614676687597469696?s=20&t=3GITA7PeX7pGwrqvt97bYQ) |
| 6) **Memory Augmented Large Language Models are Computationally Universal** - proves LLMs + memory achieve Turing completeness. <br>● Formal proof: Shows Flan-U-PaLM 540B with associative external memory can simulate any universal Turing machine. <br>● Stored-program computation: Demonstrates that prompting LLMs with memory reads/writes produces arbitrary computation. <br>● Theoretical framing: Positions LLMs as programmable computational substrates, not just statistical models. <br>● Foundations of agentic LLMs: Theoretical backing for the later wave of tool-using and memory-augmented LLM agents.                                                                     | [Paper](https://arxiv.org/abs/2301.04589) , [Tweet](https://twitter.com/dair_ai/status/1614676689908277252?s=20&t=3GITA7PeX7pGwrqvt97bYQ)                                                  |
| 7) **A Survey on Transformers in Reinforcement Learning** - comprehensive survey of transformers in RL. <br>● TransRL taxonomy: Organizes work by use — representation, policy architecture, world models, sequence-to-sequence RL. <br>● Offline vs online RL: Surveys Decision Transformer and Trajectory Transformer alongside online training variants. <br>● Partial observability: Highlights transformers' strength in long-horizon and partially-observable RL settings. <br>● Roadmap: Identifies open problems in training stability, sample efficiency, and generalization of transformer-based RL.                                                    | [Paper](https://arxiv.org/abs/2301.03044), [Tweet](https://twitter.com/dair_ai/status/1614676692538105860?s=20&t=3GITA7PeX7pGwrqvt97bYQ)                                                   |
| 8) **Scaling Laws for Generative Mixed-Modal Language Models** - Meta's scaling laws for multimodal generation. <br>● Mixed-modal regime: Studies loss scaling when training on combinations of text, code, image, and speech. <br>● Cross-modal interference: Identifies when adding modalities helps vs hurts, formalizing competition and synergy effects. <br>● Compute-optimal ratios: Derives compute-optimal recipes for mixing different modalities during pretraining. <br>● Multimodal scaling roadmap: Informs the design of subsequent large multimodal models (Chameleon, Gemini).                                                                                                                                                                                | [Paper](https://arxiv.org/abs/2301.03728), [Tweet](https://twitter.com/dair_ai/status/1614676694920531969?s=20&t=3GITA7PeX7pGwrqvt97bYQ)                                                   |
| 9) **DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching** - transformer-based local feature matcher. <br>● SlimFormer + InterFormer: Novel transformer designs for efficient intra- and inter-image feature interaction. <br>● Robust across challenges: Handles large viewpoint changes, illumination variation, and low-texture scenes. <br>● SOTA matching: Outperforms prior SOTA on HPatches, YFCC100M, and other matching benchmarks. <br>● Computer vision utility: Strengthens foundation tasks for 3D reconstruction, SfM, and visual localization.                                                   | [Paper](https://arxiv.org/abs/2301.02993), [Tweet](https://twitter.com/dair_ai/status/1614676697516752898?s=20&t=3GITA7PeX7pGwrqvt97bYQ)                                                   |
| 10) **Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement** - D3VAE for time series forecasting. <br>● Triple-D framework: Combines Diffusion, Denoising, and Disentanglement in a bidirectional VAE backbone. <br>● Noise-aware training: Diffusion strengthens the model's ability to handle noisy time series data. <br>● Interpretable latent: Disentanglement yields interpretable latent factors linking to underlying temporal dynamics. <br>● SOTA forecasting: Beats transformer and deep-learning baselines on multiple real-world datasets.                                                                                                                                                   | [Paper](https://arxiv.org/abs/2301.03028), [Tweet](https://twitter.com/dair_ai/status/1614676699915980804?s=20&t=3GITA7PeX7pGwrqvt97bYQ)                                                   |

---

## Top AI Papers of the Week (Jan 1-8)

| **Paper**                                                                                                                                                                                                                                                                                                         | **Links**                                                                                                                                                                                                                                      |
| ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1) **Muse: Text-To-Image Generation via Masked Generative Transformers** - Google's masked-token T2I model. <br>● Masked transformer: Generates images via parallel masked token prediction instead of autoregressive or diffusion sampling. <br>● Dramatic speedup: 10x faster sampling than Imagen and Parti, producing high-quality images in few steps. <br>● Editing capabilities: Supports inpainting, outpainting, and mask-free editing natively via masked prediction. <br>● Alternative T2I paradigm: Demonstrates that non-diffusion approaches remain competitive for large-scale text-to-image generation.                                                       | [Paper](https://arxiv.org/abs/2301.00704), [Project](https://muse-model.github.io/), [Code](https://github.com/lucidrains/muse-maskgit-pytorch), [Tweet](https://twitter.com/dair_ai/status/1612153095772938241?s=20&t=ChwZWzSmoRlZKnD54fsV6w) |
| 2) **VALL-E Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers** - Microsoft's neural codec TTS model. <br>● Codec-based TTS: Treats text-to-speech as conditional language modeling over discrete audio codec tokens (EnCodec). <br>● 3-second cloning: Clones a speaker's voice from just a 3-second acoustic prompt, preserving timbre and emotion. <br>● Zero-shot voice synthesis: Zero-shot speaker adaptation without fine-tuning, a huge leap over prior TTS systems. <br>● Generative speech milestone: Bridges LLM methodology to speech, enabling a wave of prompt-based audio generation research.                                       | [Project](https://valle-demo.github.io/), [Tweet](https://twitter.com/dair_ai/status/1612153097962328067?s=20&t=ChwZWzSmoRlZKnD54fsV6w)                                                                                                        |
| 3) **Rethinking with Retrieval: Faithful Large Language Model Inference** - retrieval-augmented CoT. <br>● CoT-conditioned retrieval: Decomposes reasoning into steps via chain-of-thought, then retrieves evidence for each step. <br>● Faithful inference: Ensures answers are grounded in external knowledge rather than hallucinated. <br>● Strong accuracy: Improves over vanilla CoT on TriviaQA, NaturalQuestions, and other knowledge-intensive benchmarks. <br>● Retrieval reasoning: Early blueprint for the step-level RAG patterns now common in agentic systems.                                                                                      | [Paper](https://arxiv.org/abs/2301.00303), [Tweet](https://twitter.com/dair_ai/status/1612153100114055171?s=20&t=ChwZWzSmoRlZKnD54fsV6w)                                                                                                       |
| 4) **SparseGPT: Massive Language Models Can Be Accurately Pruned In One-Shot** - one-shot unstructured LLM pruning. <br>● No retraining: Prunes OPT-175B and BLOOM-176B to 50-60% sparsity in a few GPU-hours with no fine-tuning. <br>● Layer-wise solver: Frames pruning as a layer-wise reconstruction problem solved via efficient second-order updates. <br>● Minimal perplexity loss: Negligible accuracy degradation even at high sparsity ratios. <br>● Production-ready compression: Makes aggressive LLM compression practical at the largest scales, enabling cheaper deployment.                                                             | [Paper](https://arxiv.org/abs/2301.00774), [Tweet](https://twitter.com/dair_ai/status/1612153102513360901?s=20&t=ChwZWzSmoRlZKnD54fsV6w)                                                                                                       |
| 5) **ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders** - Meta's self-supervised ConvNet revival. <br>● Fully conv MAE: Adapts masked autoencoder pretraining for ConvNets using sparse convolutions over masked patches. <br>● GRN module: Introduces Global Response Normalization to boost feature diversity and training stability. <br>● Strong ImageNet results: Matches/beats ViT-based MAE on ImageNet, detection, and segmentation. <br>● CNN competitiveness: Demonstrates that ConvNets remain competitive when properly scaled with modern self-supervised pretraining.                                                                                     | [Paper](https://arxiv.org/abs/2301.00808), [Code](https://github.com/facebookresearch/convnext-v2), [Tweet](https://twitter.com/dair_ai/status/1612153104329281538?s=20&t=ChwZWzSmoRlZKnD54fsV6w)                                              |
| 6) **Large Language Models as Corporate Lobbyists** - LLMs applied to real-world lobbying tasks. <br>● Lobbying pipeline: Uses GPT-3.5 to classify relevant bills, summarize them, and generate corporate lobbying responses. <br>● Practical experiment: Deploys end-to-end LLM lobbying on real US Congressional bills affecting corporate interests. <br>● Ethics discussion: Probes implications for democratic discourse as LLMs lower the cost of scaled political engagement. <br>● Sociotechnical precedent: Informs broader debate about AI influence on governance and policy formation.                                                     | [Paper](https://arxiv.org/abs/2301.01181) , [Code](https://github.com/JohnNay/llm-lobbyist), [Tweet](https://twitter.com/dair_ai/status/1612153106355130372?s=20&t=ChwZWzSmoRlZKnD54fsV6w)                                                     |
| 7) **Superposition, Memorization, and Double Descent** - Anthropic's toy-model study of memorization dynamics. <br>● Superposition of features: Shows how toy networks represent more features than neurons via superposition during memorization. <br>● Double descent explained: Provides mechanistic explanation for why test loss can decrease then spike then fall again with scale. <br>● Phase transitions: Observes clean transitions between memorization and generalization regimes. <br>● Mechanistic interpretability: Builds foundational theory for understanding feature representations in larger transformers.                                                                             | [Paper](https://transformer-circuits.pub/2023/toy-double-descent/index.html), [Tweet](https://twitter.com/dair_ai/status/1612153108460892160?s=20&t=ChwZWzSmoRlZKnD54fsV6w)                                                                    |
| 8) **StitchNet: Composing Neural Networks from Pre-Trained Fragments** - modular NN construction from existing weights. <br>● Fragment stitching: Composes new networks by stitching together layers from multiple pretrained models. <br>● Compatibility metric: Proposes measures of fragment compatibility to guide composition. <br>● Efficient reuse: Avoids expensive training by reusing existing components for new tasks. <br>● Modular deep learning: Early exploration of the growing modular ML space (model merging, adapter composition). | [Paper](https://arxiv.org/abs/2301.01947), [Tweet](https://twitter.com/dair_ai/status/1612153110452903936?s=20&t=ChwZWzSmoRlZKnD54fsV6w)                                                                                                       |
| 9) **Iterated Decomposition: Improving Science Q\&A by Supervising Reasoning Processes** - human-in-the-loop LM program refinement. <br>● Iterative decomposition: Breaks down complex QA tasks into subtasks and refines the decomposition through human feedback. <br>● Process supervision: Supervises intermediate reasoning steps rather than just final answers. <br>● ICE tool: Introduces the ICE (Interactive Composition Explorer) library for building compositional LM programs. <br>● Precursor to agent frameworks: Anticipates later LLM orchestration frameworks (LangChain, DSPy).                                                                    | [Paper](https://arxiv.org/abs/2301.01751), [Code](https://github.com/oughtinc/ice) [Tweet](https://twitter.com/dair_ai/status/1612153112638402562?s=20&t=ChwZWzSmoRlZKnD54fsV6w)                                                               |
| 10) **A Succinct Summary of Reinforcement Learning** - compact overview of key RL concepts. <br>● Core ideas: Covers Markov decision processes, value iteration, policy gradients, and actor-critic methods. <br>● Modern methods: Touches on PPO, DQN, AlphaZero, and RLHF in a unified notation. <br>● Concise reference: Designed as a 20-page primer suitable for ML engineers needing quick RL grounding. <br>● Teaching resource: Useful pocket reference for those entering RL-adjacent areas like RLHF for LLM training.                                                                                                                                                             | [Paper](https://arxiv.org/abs/2301.01379), [Tweet](https://twitter.com/dair_ai/status/1612153114773053446?s=20&t=ChwZWzSmoRlZKnD54fsV6w)                                                                                                       |

---

We use a combination of AI-powered tools, analytics, and human curation to build the lists of papers.

[Subscribe to our NLP Newsletter](https://nlpnews.substack.com/) to stay on top of ML research and trends.

Join our [Discord](https://discord.gg/FzNtjEK9dg).