Andrew Lee

Andrew Lee

  • About
  • Publications
  • Talks
  • CV

© 2026

  • Publications

  • 2026

    1. From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?
      Aaron Mueller, Andrew Lee , Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, and Patrik Reizinger
      ACL 2026
    2. Decomposing Query-Key Feature Interactions Using Contrastive Covariances
      Andrew Lee , Yonatan Belinkov, Fernanda Viegas, and Martin Wattenberg
      Preprint 2026
    3. Valence–Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control
      Lihao Sun, Lewen Yan, Xiaoya Lu, Andrew Lee , Jie Zhang, and Jing Shao
      Preprint 2026
    4. Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry
      Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee , Randall Balestriero, Sonia Joseph, Ekdeep S. Lubana, Talia Konkle, Demba Ba, and Martin Wattenberg
      ICLR 2026

    2025

    1. Shared Global and Local Geometry of Language Model Embeddings
      Andrew Lee , Melanie Weber, Fernanda Viegas, and Martin Wattenberg
      COLM 2025 - Outstanding Paper Award
    2. ICLR: In-Context Learning of Representations
      *Core Francisco Park, *Andrew Lee , *Ekdeep Singh Lubana, *Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, and Hidenori Tanaka
      ICLR 2025
    3. Better World Models Can Lead to Better Post-Training Performance
      Prakhar Gupta, Henry Conklin, Sarah-Jane Leslie, and Andrew Lee
      Mechanistic Interpretability @ NeurIPS 2025 - Spotlight
    4. Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls
      Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Viegas, Martin Wattenberg, and Andrew Lee
      Preprint 2025
    5. How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
      Yushi Yang, Filip Sondej, Harry Mayne, Andrew Lee , and Adam Mahdi
      EMNLP 2025
    6. Eeyore: Realistic Depression Simulation via Expert-in-the-Loop Supervised and Preference Optimization
      Siyang Liu, Bianca Brie, Wenda Li, Laura Biester, Andrew Lee , James Pennebaker, and Rada Mihalcea
      Findings of ACL 2025
    7. Agentic Reinforcement Learning for Search is Unsafe
      Yushi Yang, Shreyansh Padarha, Andrew Lee , and Adam Mahdi
      Preprint 2025

    2024

    1. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity
      Andrew Lee , Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K Kummerfeld, and Rada Mihalcea
      ICML 2024 - Oral (Top 1.5% of submissions)
    2. Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
      Core Francisco Park, Maya Okawa, Andrew Lee , Ekdeep Singh Lubana, and Hidenori Tanaka
      NeurIPS 2024 - Spotlight

    2023

    1. Emergent linear representations in world models of self-supervised sequence models
      *Neel Nanda, *Andrew Lee , and Martin Wattenberg
      BlackboxNLP (EMNLP) 2023 - Honorable Mention, Best Paper
    2. Empathy Identification Systems are not Accurately Accounting for Context
      Andrew Lee , Jonathan Kummerfeld, Larry An, and Rada Mihalcea
      EACL 2023
    3. A PhD Student's Perspective on Research in NLP in the Era of Very Large Language Models
      Oana Ignat, Zhijing Jin, Artem Abzaliev, Laura Biester, Santiago Castro, Naihao Deng, Xinyi Gao, Aylin Gunal, Jacky He, Ashkan Kazemi, and others
      2023
    4. Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss
      Jing Xu, Andrew Lee , Sainbayar Sukhbaatar, and Jason Weston
      Preprint 2023

    2022

    1. Augmenting Task-Oriented Dialogue Systems with Relation Extraction
      Andrew Lee , Zhenguo Chen, Kevin Leach, and Jonathan K. Kummerfeld
      AAAI 2022 DSTC10 Workshop
    2. Improving Chess Commentaries by Combining Language Models with Symbolic Reasoning Engines
      Andrew Lee , David Wu, Emily Dinan, and Mike Lewis
      Preprint 2022

    2021

    1. Micromodels for Efficient, Explainable, and Reusable Systems: A Case Study on Mental Health
      Andrew Lee , Jonathan Kummerfeld, Lawrence An, and Rada Mihalcea
      Findings of EMNLP 2021
      [Code]

    2019

    1. An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction
      Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee , Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, and Jason Mars
      EMNLP 2019
      [Data]
    2. Outlier Detection for Improved Data Quality and Diversity in Dialog Systems
      Stefan Larson, Anish Mahendran, Andrew Lee , Jonathan K Kummerfeld, Parker Hill, Michael A Laurenzano, Johann Hauswald, Lingjia Tang, and Jason Mars
      NAACL 2019
  • About
  • Publications
  • Talks
  • CV

© 2026