Publications

Andrew Lee

Publications

2026

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions

Andrew Lee , Fernanda Viegas, and Martin Wattenberg

Workshop on Connecting Low-rank Representations in AI (@ ICML) Mechanistic Interpretability Workshop (@ ICML) 2026
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

Sungjun Lim, Heedong Kim, Andrew Lee , and Kyungwoo Song

Mechanistic Interpretability Workshop (@ ICML) 2026
Agentic Reinforcement Learning for Search Misaligns Instruction-Tuning

Yushi Yang, Shreyansh Padarha, Sarah Ball, Andrew Lee , and Adam Mahdi

Workshop on Agents in the Wild: Safety, Security, and Beyond (@ ICML)\\Failure Modes of Agentic AI (@ ICML) 2026
Decomposing Query-Key Feature Interactions Using Contrastive Covariances

Andrew Lee , Yonatan Belinkov, Fernanda Viegas, and Martin Wattenberg

ICML 2026
From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

Aaron Mueller, Andrew Lee , Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, and Patrik Reizinger

ACL 2026
Valence–Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

Lihao Sun, Lewen Yan, Xiaoya Lu, Andrew Lee , Jie Zhang, and Jing Shao

Mechanistic Interpretability Workshop (@ ICML) 2026
Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee , Randall Balestriero, Sonia Joseph, Ekdeep S. Lubana, Talia Konkle, Demba Ba, and Martin Wattenberg

ICLR 2026

2025

Shared Global and Local Geometry of Language Model Embeddings

Andrew Lee , Melanie Weber, Fernanda Viegas, and Martin Wattenberg

COLM 2025 - Outstanding Paper Award
ICLR: In-Context Learning of Representations

*Core Francisco Park, *Andrew Lee , *Ekdeep Singh Lubana, *Yongyi Yang, Maya Okawa, Kento Nishi, Martin Wattenberg, and Hidenori Tanaka

ICLR 2025
Better World Models Can Lead to Better Post-Training Performance

Prakhar Gupta, Henry Conklin, Sarah-Jane Leslie, and Andrew Lee

Mechanistic Interpretability @ NeurIPS 2025 - Spotlight
Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls

Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Viegas, Martin Wattenberg, and Andrew Lee

Preprint 2025
How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis

Yushi Yang, Filip Sondej, Harry Mayne, Andrew Lee , and Adam Mahdi

EMNLP 2025
Eeyore: Realistic Depression Simulation via Expert-in-the-Loop Supervised and Preference Optimization

Siyang Liu, Bianca Brie, Wenda Li, Laura Biester, Andrew Lee , James Pennebaker, and Rada Mihalcea

Findings of ACL 2025
Agentic Reinforcement Learning for Search is Unsafe

Yushi Yang, Shreyansh Padarha, Andrew Lee , and Adam Mahdi

Preprint 2025

2024

A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity

Andrew Lee , Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K Kummerfeld, and Rada Mihalcea

ICML 2024 - Oral (Top 1.5% of submissions)
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space

Core Francisco Park, Maya Okawa, Andrew Lee , Ekdeep Singh Lubana, and Hidenori Tanaka

NeurIPS 2024 - Spotlight

2023

Emergent linear representations in world models of self-supervised sequence models

*Neel Nanda, *Andrew Lee , and Martin Wattenberg

BlackboxNLP (EMNLP) 2023 - Honorable Mention, Best Paper
Empathy Identification Systems are not Accurately Accounting for Context

Andrew Lee , Jonathan Kummerfeld, Larry An, and Rada Mihalcea

EACL 2023
A PhD Student's Perspective on Research in NLP in the Era of Very Large Language Models

Oana Ignat, Zhijing Jin, Artem Abzaliev, Laura Biester, Santiago Castro, Naihao Deng, Xinyi Gao, Aylin Gunal, Jacky He, Ashkan Kazemi, and others

2023
Some things are more CRINGE than others: Preference Optimization with the Pairwise Cringe Loss

Jing Xu, Andrew Lee , Sainbayar Sukhbaatar, and Jason Weston

Preprint 2023

2022

Augmenting Task-Oriented Dialogue Systems with Relation Extraction

Andrew Lee , Zhenguo Chen, Kevin Leach, and Jonathan K. Kummerfeld

AAAI 2022 DSTC10 Workshop
Improving Chess Commentaries by Combining Language Models with Symbolic Reasoning Engines

Andrew Lee , David Wu, Emily Dinan, and Mike Lewis

Preprint 2022

2021

Micromodels for Efficient, Explainable, and Reusable Systems: A Case Study on Mental Health

Andrew Lee , Jonathan Kummerfeld, Lawrence An, and Rada Mihalcea

Findings of EMNLP 2021

[Code]

2019

An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction

Stefan Larson, Anish Mahendran, Joseph J Peper, Christopher Clarke, Andrew Lee , Parker Hill, Jonathan K Kummerfeld, Kevin Leach, Michael A Laurenzano, Lingjia Tang, and Jason Mars

EMNLP 2019

[Data]
Outlier Detection for Improved Data Quality and Diversity in Dialog Systems

Stefan Larson, Anish Mahendran, Andrew Lee , Jonathan K Kummerfeld, Parker Hill, Michael A Laurenzano, Johann Hauswald, Lingjia Tang, and Jason Mars

NAACL 2019